TY - GEN
T1 - A statistical model for topically segmented documents
AU - Ponti, Giovanni
AU - Tagarelli, Andrea
AU - Karypis, George
PY - 2011
Y1 - 2011
N2 - Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.
AB - Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.
UR - http://www.scopus.com/inward/record.url?scp=80053959322&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053959322&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-24477-3_21
DO - 10.1007/978-3-642-24477-3_21
M3 - Conference contribution
AN - SCOPUS:80053959322
SN - 9783642244766
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 247
EP - 261
BT - Discovery Science - 14th International Conference, DS 2011, Proceedings
T2 - 14th International Conference on Discovery Science, DS 2011, Co-located with the 22nd International Conference on Algorithmic Learning Theory, ALT 2011
Y2 - 5 October 2011 through 7 October 2011
ER -