Empirical and theoretical comparisons of selected criterion functions for document clustering

Ying Zhao, George Karypis

Research output: Contribution to journalArticlepeer-review

481 Scopus citations

Abstract

This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

Original languageEnglish (US)
Pages (from-to)311-331
Number of pages21
JournalMachine Learning
Volume55
Issue number3
DOIs
StatePublished - Jun 2004

Bibliographical note

Funding Information:
∗This work was supported by NSF ACI-0133464, CCR-9972519, EIA-9986042, ACI-9982274, and by Army HPC Research Center contract number DAAH04-95-C-0008.

Keywords

  • Criterion function
  • Data mining
  • Information retrieval
  • Partitional clustering

Fingerprint

Dive into the research topics of 'Empirical and theoretical comparisons of selected criterion functions for document clustering'. Together they form a unique fingerprint.

Cite this