Centroid-based document classification: Analysis and experimental results

Eui Hong Sam Han, George Karypis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

220 Scopus citations

Abstract

In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.

Original languageEnglish (US)
Title of host publicationPrinciples of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings
EditorsDjamel A. Zighed, Jan Komorowski, Jan Zytkow
PublisherSpringer Verlag
Pages424-431
Number of pages8
ISBN (Print)9783540410669
DOIs
StatePublished - 2000
Event4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000 - Lyon, France
Duration: Sep 13 2000Sep 16 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1910
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000
CountryFrance
CityLyon
Period9/13/009/16/00

Bibliographical note

Funding Information:
This work was supported by NSF CCR-9972519, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute.

Fingerprint Dive into the research topics of 'Centroid-based document classification: Analysis and experimental results'. Together they form a unique fingerprint.

Cite this