A statistical model for topically segmented documents

Giovanni Ponti; Andrea Tagarelli; George Karypis

doi:10.1007/978-3-642-24477-3_21

A statistical model for topically segmented documents

Giovanni Ponti, Andrea Tagarelli, George Karypis

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.

Original language	English (US)
Title of host publication	Discovery Science - 14th International Conference, DS 2011, Proceedings
Pages	247-261
Number of pages	15
DOIs	https://doi.org/10.1007/978-3-642-24477-3_21
State	Published - 2011
Event	14th International Conference on Discovery Science, DS 2011, Co-located with the 22nd International Conference on Algorithmic Learning Theory, ALT 2011 - Espoo, Finland Duration: Oct 5 2011 → Oct 7 2011

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	6926 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	14th International Conference on Discovery Science, DS 2011, Co-located with the 22nd International Conference on Algorithmic Learning Theory, ALT 2011
Country/Territory	Finland
City	Espoo
Period	10/5/11 → 10/7/11

Access

10.1007/978-3-642-24477-3_21

OpenUrl availability

Full text

Cite this

Ponti, G., Tagarelli, A., & Karypis, G. (2011). A statistical model for topically segmented documents. In Discovery Science - 14th International Conference, DS 2011, Proceedings (pp. 247-261). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6926 LNAI). https://doi.org/10.1007/978-3-642-24477-3_21

A statistical model for topically segmented documents. / Ponti, Giovanni; Tagarelli, Andrea; Karypis, George.
Discovery Science - 14th International Conference, DS 2011, Proceedings. 2011. p. 247-261 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6926 LNAI).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Ponti, G, Tagarelli, A & Karypis, G 2011, A statistical model for topically segmented documents. in Discovery Science - 14th International Conference, DS 2011, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6926 LNAI, pp. 247-261, 14th International Conference on Discovery Science, DS 2011, Co-located with the 22nd International Conference on Algorithmic Learning Theory, ALT 2011, Espoo, Finland, 10/5/11. https://doi.org/10.1007/978-3-642-24477-3_21

@inproceedings{d2db9945a13c417fbfaeb77852bc320c,

title = "A statistical model for topically segmented documents",

abstract = "Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.",

author = "Giovanni Ponti and Andrea Tagarelli and George Karypis",

year = "2011",

doi = "10.1007/978-3-642-24477-3_21",

language = "English (US)",

isbn = "9783642244766",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

pages = "247--261",

booktitle = "Discovery Science - 14th International Conference, DS 2011, Proceedings",

note = "14th International Conference on Discovery Science, DS 2011, Co-located with the 22nd International Conference on Algorithmic Learning Theory, ALT 2011 ; Conference date: 05-10-2011 Through 07-10-2011",

}

TY - GEN

T1 - A statistical model for topically segmented documents

AU - Ponti, Giovanni

AU - Tagarelli, Andrea

AU - Karypis, George

PY - 2011

Y1 - 2011

N2 - Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.

AB - Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.

UR - http://www.scopus.com/inward/record.url?scp=80053959322&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053959322&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-24477-3_21

DO - 10.1007/978-3-642-24477-3_21

M3 - Conference contribution

AN - SCOPUS:80053959322

SN - 9783642244766

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 247

EP - 261

BT - Discovery Science - 14th International Conference, DS 2011, Proceedings

T2 - 14th International Conference on Discovery Science, DS 2011, Co-located with the 22nd International Conference on Algorithmic Learning Theory, ALT 2011

Y2 - 5 October 2011 through 7 October 2011

ER -

A statistical model for topically segmented documents

Abstract

Publication series

Other

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this