Topic models over text streams: A study of batch arid online unsupervised learning

Arindam Banerjee, Sugato Basu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

87 Scopus citations

Abstract

Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and response-rate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large real-world document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for several applications (e.g., dynamic topic-based aggregation of user-generated content in social networks) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th SIAM International Conference on Data Mining
Pages431-436
Number of pages6
StatePublished - Dec 1 2007
Event7th SIAM International Conference on Data Mining - Minneapolis, MN, United States
Duration: Apr 26 2007Apr 28 2007

Publication series

NameProceedings of the 7th SIAM International Conference on Data Mining

Other

Other7th SIAM International Conference on Data Mining
Country/TerritoryUnited States
CityMinneapolis, MN
Period4/26/074/28/07

Fingerprint

Dive into the research topics of 'Topic models over text streams: A study of batch arid online unsupervised learning'. Together they form a unique fingerprint.

Cite this