Centroid-based document classification: Analysis and experimental results

Eui Hong Sam Han; George Karypis

doi:10.1007/3-540-45372-5_46

Centroid-based document classification: Analysis and experimental results

Eui Hong Sam Han, George Karypis

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

260 Scopus citations

Abstract

In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.

Original language	English (US)
Title of host publication	Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings
Editors	Djamel A. Zighed, Jan Komorowski, Jan Zytkow
Publisher	Springer Verlag
Pages	424-431
Number of pages	8
ISBN (Print)	9783540410669
DOIs	https://doi.org/10.1007/3-540-45372-5_46
State	Published - 2000
Event	4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000 - Lyon, France Duration: Sep 13 2000 → Sep 16 2000

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	1910
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000
Country/Territory	France
City	Lyon
Period	9/13/00 → 9/16/00

Bibliographical note

Funding Information:
This work was supported by NSF CCR-9972519, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute.

Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2000.

Access

10.1007/3-540-45372-5_46

OpenUrl availability

Full text

Cite this

Han, E. H. S., & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In D. A. Zighed, J. Komorowski, & J. Zytkow (Eds.), Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings (pp. 424-431). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1910). Springer Verlag. https://doi.org/10.1007/3-540-45372-5_46

Centroid-based document classification: Analysis and experimental results. / Han, Eui Hong Sam; Karypis, George.
Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings. ed. / Djamel A. Zighed; Jan Komorowski; Jan Zytkow. Springer Verlag, 2000. p. 424-431 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1910).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Han, EHS & Karypis, G 2000, Centroid-based document classification: Analysis and experimental results. in DA Zighed, J Komorowski & J Zytkow (eds), Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 1910, Springer Verlag, pp. 424-431, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000, Lyon, France, 9/13/00. https://doi.org/10.1007/3-540-45372-5_46

Han EHS, Karypis G. Centroid-based document classification: Analysis and experimental results. In Zighed DA, Komorowski J, Zytkow J, editors, Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings. Springer Verlag. 2000. p. 424-431. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/3-540-45372-5_46

Han, Eui Hong Sam ; Karypis, George. / Centroid-based document classification : Analysis and experimental results. Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings. editor / Djamel A. Zighed ; Jan Komorowski ; Jan Zytkow. Springer Verlag, 2000. pp. 424-431 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{2a8f9f6afdb8413495a35ac487eb0c6a,

title = "Centroid-based document classification: Analysis and experimental results",

abstract = "In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.",

author = "Han, {Eui Hong Sam} and George Karypis",

note = "Funding Information: This work was supported by NSF CCR-9972519, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute. Publisher Copyright: {\textcopyright} Springer-Verlag Berlin Heidelberg 2000.; 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000 ; Conference date: 13-09-2000 Through 16-09-2000",

year = "2000",

doi = "10.1007/3-540-45372-5_46",

language = "English (US)",

isbn = "9783540410669",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "424--431",

editor = "Zighed, {Djamel A.} and Jan Komorowski and Jan Zytkow",

booktitle = "Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings",

}

TY - GEN

T1 - Centroid-based document classification

T2 - 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2000

AU - Han, Eui Hong Sam

AU - Karypis, George

N1 - Funding Information: This work was supported by NSF CCR-9972519, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute. Publisher Copyright: © Springer-Verlag Berlin Heidelberg 2000.

PY - 2000

Y1 - 2000

N2 - In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.

AB - In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroidbased classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.

UR - http://www.scopus.com/inward/record.url?scp=84962671851&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962671851&partnerID=8YFLogxK

U2 - 10.1007/3-540-45372-5_46

DO - 10.1007/3-540-45372-5_46

M3 - Conference contribution

AN - SCOPUS:84962671851

SN - 9783540410669

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 424

EP - 431

BT - Principles of Data Mining and Knowledge Discovery - 4th European Conference, PKDD 2000, Proceedings

A2 - Zighed, Djamel A.

A2 - Komorowski, Jan

A2 - Zytkow, Jan

PB - Springer Verlag

Y2 - 13 September 2000 through 16 September 2000

ER -

Centroid-based document classification: Analysis and experimental results

Abstract

Publication series

Other

Bibliographical note

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this