Bayesian Nonparametric Topic Models for Short Text Data

Renzo Poddighe, Gerasimos Spanakis

Research output: Contribution to conferenceAbstractAcademic

Abstract

Topic modeling is a suite of algorithms, which aims to discover the hidden structures in large digital archives. Topic modeling algorithms like Latent Dirichlet Allocation perform unsupersvised learning, thus they do not require any prior annotations or labeling of the documents; the topics emerge from the analysis of the original texts.

Traditional topic models have several shortcomings when applied to archives consisting of short and noisy documents, such as Twitter. More specifically, it is assumed that each document is derived from a specific topic distribution but short text sparsity raises an issue on whether topic models can be applicable.

For this purpose, a wide range of models, belonging to the classes of finite Bayesian mixture models and nonparametric Bayesian models, are compared in terms of statistical likelihood and topic coherence. In addition, an extension to the existing state of the art is proposed, called the Biterm Pitman-Yor process, and compared to the other models.

Experimental analysis is performed on three datasets having documents of different length in order to demonstrate the short text issue: The Reuters-21578 dataset (normal length documents), the Tweets2011 (short text) and a third one in dutch language (extremely short text). Results indicate that newly proposed methods for overcoming data sparsity improve traditional models on both the statistical and semantic level. The newly proposed biterm Pitman-Yor process shows comparable performance to state-of-the-art, while increasing the flexibility of the modeling process, making the result more malleable to the user’s expectations.
Original languageEnglish
Publication statusPublished - Feb 2017
EventComputational Linguistics in the Netherlands - Leuven, Belgium
Duration: 10 Feb 201710 Feb 2017

Conference

ConferenceComputational Linguistics in the Netherlands
Abbreviated titleCLIN27
CountryBelgium
CityLeuven
Period10/02/1710/02/17

Cite this

Poddighe, R., & Spanakis, G. (2017). Bayesian Nonparametric Topic Models for Short Text Data. Abstract from Computational Linguistics in the Netherlands, Leuven, Belgium.