Abstract
Topic modeling is a suite of algorithms, which aims to discover the hidden structures in large digital archives. Topic modeling algorithms like Latent Dirichlet Allocation perform unsupersvised learning, thus they do not require any prior annotations or labeling of the documents; the topics emerge from the analysis of the original texts.
Traditional topic models have several shortcomings when applied to archives consisting of short and noisy documents, such as Twitter. More specifically, it is assumed that each document is derived from a specific topic distribution but short text sparsity raises an issue on whether topic models can be applicable.
For this purpose, a wide range of models, belonging to the classes of finite Bayesian mixture models and nonparametric Bayesian models, are compared in terms of statistical likelihood and topic coherence. In addition, an extension to the existing state of the art is proposed, called the Biterm Pitman-Yor process, and compared to the other models.
Experimental analysis is performed on three datasets having documents of different length in order to demonstrate the short text issue: The Reuters-21578 dataset (normal length documents), the Tweets2011 (short text) and a third one in dutch language (extremely short text). Results indicate that newly proposed methods for overcoming data sparsity improve traditional models on both the statistical and semantic level. The newly proposed biterm Pitman-Yor process shows comparable performance to state-of-the-art, while increasing the flexibility of the modeling process, making the result more malleable to the user’s expectations.
Traditional topic models have several shortcomings when applied to archives consisting of short and noisy documents, such as Twitter. More specifically, it is assumed that each document is derived from a specific topic distribution but short text sparsity raises an issue on whether topic models can be applicable.
For this purpose, a wide range of models, belonging to the classes of finite Bayesian mixture models and nonparametric Bayesian models, are compared in terms of statistical likelihood and topic coherence. In addition, an extension to the existing state of the art is proposed, called the Biterm Pitman-Yor process, and compared to the other models.
Experimental analysis is performed on three datasets having documents of different length in order to demonstrate the short text issue: The Reuters-21578 dataset (normal length documents), the Tweets2011 (short text) and a third one in dutch language (extremely short text). Results indicate that newly proposed methods for overcoming data sparsity improve traditional models on both the statistical and semantic level. The newly proposed biterm Pitman-Yor process shows comparable performance to state-of-the-art, while increasing the flexibility of the modeling process, making the result more malleable to the user’s expectations.
Original language | English |
---|---|
Publication status | Published - Feb 2017 |
Event | Computational Linguistics in the Netherlands - Leuven, Belgium Duration: 10 Feb 2017 → 10 Feb 2017 |
Conference
Conference | Computational Linguistics in the Netherlands |
---|---|
Abbreviated title | CLIN27 |
Country/Territory | Belgium |
City | Leuven |
Period | 10/02/17 → 10/02/17 |