A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization

Yindalon Aphinyanaphongs; Lawrence D. Fu; Zhiguo Li; Eric R. Peskin; Efstratios Efstathiadis; Constantin F. Aliferis; Alexander Statnikov

doi:10.1002/asi.23110

A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization

Yindalon Aphinyanaphongs, Lawrence D. Fu, Zhiguo Li, Eric R. Peskin, Efstratios Efstathiadis, Constantin F. Aliferis, Alexander Statnikov

Institute for Health Informatics

Research output: Contribution to journal › Article › peer-review

49 Scopus citations

Abstract

An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.

Original language	English (US)
Pages (from-to)	1964-1987
Number of pages	24
Journal	Journal of the Association for Information Science and Technology
Volume	65
Issue number	10
DOIs	https://doi.org/10.1002/asi.23110
State	Published - Oct 1 2014

Bibliographical note

Publisher Copyright:
© 2014 ASIS&T.

Keywords

information retrieval
machine learning
text processing

Access

10.1002/asi.23110

OpenUrl availability

Full text

Cite this

A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. / Aphinyanaphongs, Yindalon; Fu, Lawrence D.; Li, Zhiguo et al.
In: Journal of the Association for Information Science and Technology, Vol. 65, No. 10, 01.10.2014, p. 1964-1987.

Research output: Contribution to journal › Article › peer-review

@article{18ebf0468a9745fba997d261b5575af5,

title = "A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization",

abstract = "An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.",

keywords = "information retrieval, machine learning, text processing",

author = "Yindalon Aphinyanaphongs and Fu, {Lawrence D.} and Zhiguo Li and Peskin, {Eric R.} and Efstratios Efstathiadis and Aliferis, {Constantin F.} and Alexander Statnikov",

note = "Publisher Copyright: {\textcopyright} 2014 ASIS&T.",

year = "2014",

month = oct,

day = "1",

doi = "10.1002/asi.23110",

language = "English (US)",

volume = "65",

pages = "1964--1987",

journal = "Journal of the Association for Information Science and Technology",

issn = "2330-1635",

publisher = "John Wiley and Sons Ltd",

number = "10",

}

TY - JOUR

T1 - A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization

AU - Aphinyanaphongs, Yindalon

AU - Fu, Lawrence D.

AU - Li, Zhiguo

AU - Peskin, Eric R.

AU - Efstathiadis, Efstratios

AU - Aliferis, Constantin F.

AU - Statnikov, Alexander

PY - 2014/10/1

Y1 - 2014/10/1

N2 - An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.

AB - An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.

KW - information retrieval

KW - machine learning

KW - text processing

UR - http://www.scopus.com/inward/record.url?scp=84925450131&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84925450131&partnerID=8YFLogxK

U2 - 10.1002/asi.23110

DO - 10.1002/asi.23110

M3 - Article

AN - SCOPUS:84925450131

SN - 2330-1635

VL - 65

SP - 1964

EP - 1987

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

IS - 10

ER -

A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this