Topic models over text streams: A study of batch arid online unsupervised learning

Arindam Banerjee; Sugato Basu

Topic models over text streams: A study of batch arid online unsupervised learning

Arindam Banerjee, Sugato Basu

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

87 Scopus citations

Abstract

Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and response-rate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large real-world document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for several applications (e.g., dynamic topic-based aggregation of user-generated content in social networks) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms.

Original language	English (US)
Title of host publication	Proceedings of the 7th SIAM International Conference on Data Mining
Pages	431-436
Number of pages	6
State	Published - Dec 1 2007
Event	7th SIAM International Conference on Data Mining - Minneapolis, MN, United States Duration: Apr 26 2007 → Apr 28 2007

Publication series

Name	Proceedings of the 7th SIAM International Conference on Data Mining

Other

Other	7th SIAM International Conference on Data Mining
Country/Territory	United States
City	Minneapolis, MN
Period	4/26/07 → 4/28/07

OpenUrl availability

Full text

Cite this

@inproceedings{28cef78a2ff94633a86d76b9e3992af4,

title = "Topic models over text streams: A study of batch arid online unsupervised learning",

abstract = "Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and response-rate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large real-world document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for several applications (e.g., dynamic topic-based aggregation of user-generated content in social networks) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms.",

author = "Arindam Banerjee and Sugato Basu",

year = "2007",

month = dec,

day = "1",

language = "English (US)",

isbn = "9780898716306",

series = "Proceedings of the 7th SIAM International Conference on Data Mining",

pages = "431--436",

booktitle = "Proceedings of the 7th SIAM International Conference on Data Mining",

note = "7th SIAM International Conference on Data Mining ; Conference date: 26-04-2007 Through 28-04-2007",

}

TY - GEN

T1 - Topic models over text streams

T2 - 7th SIAM International Conference on Data Mining

AU - Banerjee, Arindam

AU - Basu, Sugato

PY - 2007/12/1

Y1 - 2007/12/1

N2 - Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and response-rate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large real-world document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for several applications (e.g., dynamic topic-based aggregation of user-generated content in social networks) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms.

AB - Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and response-rate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large real-world document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for several applications (e.g., dynamic topic-based aggregation of user-generated content in social networks) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms.

UR - http://www.scopus.com/inward/record.url?scp=70449126967&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70449126967&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:70449126967

SN - 9780898716306

T3 - Proceedings of the 7th SIAM International Conference on Data Mining

SP - 431

EP - 436

BT - Proceedings of the 7th SIAM International Conference on Data Mining

Y2 - 26 April 2007 through 28 April 2007

ER -

Topic models over text streams: A study of batch arid online unsupervised learning

Abstract

Publication series

Other

OpenUrl availability

Other files and links

Fingerprint

Cite this