TY - GEN
T1 - Improved unsupervised name discrimination with very wide bigrams and automatic cluster stopping
AU - Pedersen, Ted
PY - 2009
Y1 - 2009
N2 - We cast name discrimination as a problem in clustering short contexts. Each occurrence of an ambiguous name is treated independently, and represented using second-order context vectors. We calibrate our approach using a manually annotated collection of five ambiguous names from the Web, and then apply the learned parameter settings to three held-out sets of pseudo-name data that have been reported on in previous publications. We find that significant improvements in the accuracy of name discrimination can be achieved by using very wide bigrams, which are ordered pairs of words with up to 48 intervening words between them.We also show that recent developments in automatic cluster stopping can be used to predict the number of underlying identities without any significant loss of accuracy as compared to previous approaches which have set these values manually.
AB - We cast name discrimination as a problem in clustering short contexts. Each occurrence of an ambiguous name is treated independently, and represented using second-order context vectors. We calibrate our approach using a manually annotated collection of five ambiguous names from the Web, and then apply the learned parameter settings to three held-out sets of pseudo-name data that have been reported on in previous publications. We find that significant improvements in the accuracy of name discrimination can be achieved by using very wide bigrams, which are ordered pairs of words with up to 48 intervening words between them.We also show that recent developments in automatic cluster stopping can be used to predict the number of underlying identities without any significant loss of accuracy as compared to previous approaches which have set these values manually.
UR - http://www.scopus.com/inward/record.url?scp=67650535509&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=67650535509&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-00382-0_24
DO - 10.1007/978-3-642-00382-0_24
M3 - Conference contribution
AN - SCOPUS:67650535509
SN - 3642003818
SN - 9783642003813
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 294
EP - 305
BT - Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings
T2 - 10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009
Y2 - 1 March 2009 through 7 March 2009
ER -