Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors

C. F. Aliferis, I. Tsamardinos, P. Massion, A. R. Statnikov, D. Hardin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counterintuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of: classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.

Original languageEnglish (US)
Title of host publicationProceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03
EditorsF. Valafar, H. Valafar
Pages47-53
Number of pages7
StatePublished - 2003
EventProceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03 - Las Vegas, NV, United States
Duration: Jun 23 2003Jun 26 2003

Publication series

NameProceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences

Other

OtherProceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03
Country/TerritoryUnited States
CityLas Vegas, NV
Period6/23/036/26/03

Keywords

  • Bioinformatics and Medicine
  • Expression Data Analysis
  • Gene Expression

Fingerprint

Dive into the research topics of 'Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors'. Together they form a unique fingerprint.

Cite this