Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors

C. F. Aliferis; I. Tsamardinos; P. Massion; A. R. Statnikov; D. Hardin

Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors

C. F. Aliferis, I. Tsamardinos, P. Massion, A. R. Statnikov, D. Hardin

Institute for Health Informatics

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

4 Scopus citations

Abstract

Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counterintuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of: classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.

Original language	English (US)
Title of host publication	Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03
Editors	F. Valafar, H. Valafar
Pages	47-53
Number of pages	7
State	Published - 2003
Event	Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03 - Las Vegas, NV, United States Duration: Jun 23 2003 → Jun 26 2003

Publication series

Name	Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences

Other

Other	Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03
Country/Territory	United States
City	Las Vegas, NV
Period	6/23/03 → 6/26/03

Keywords

Bioinformatics and Medicine
Expression Data Analysis
Gene Expression

OpenUrl availability

Full text

Cite this

Aliferis, C. F., Tsamardinos, I., Massion, P., Statnikov, A. R., & Hardin, D. (2003). Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors. In F. Valafar, & H. Valafar (Eds.), Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03 (pp. 47-53). (Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences).

Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors. / Aliferis, C. F.; Tsamardinos, I.; Massion, P. et al.
Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03. ed. / F. Valafar; H. Valafar. 2003. p. 47-53 (Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Aliferis, CF, Tsamardinos, I, Massion, P, Statnikov, AR & Hardin, D 2003, Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors. in F Valafar & H Valafar (eds), Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03. Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, pp. 47-53, Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03, Las Vegas, NV, United States, 6/23/03.

Aliferis CF, Tsamardinos I, Massion P, Statnikov AR, Hardin D. Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors. In Valafar F, Valafar H, editors, Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03. 2003. p. 47-53. (Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences).

Aliferis, C. F. ; Tsamardinos, I. ; Massion, P. et al. / Why Classification Models Using Array Gene Expression Data Perform So Well : A Preliminary Investigation Of Explanatory Factors. Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03. editor / F. Valafar ; H. Valafar. 2003. pp. 47-53 (Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences).

@inproceedings{9304de95d60a4fd0983a5d1c2ba3ced4,

title = "Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors",

abstract = "Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counterintuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of: classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.",

keywords = "Bioinformatics and Medicine, Expression Data Analysis, Gene Expression",

author = "Aliferis, {C. F.} and I. Tsamardinos and P. Massion and Statnikov, {A. R.} and D. Hardin",

year = "2003",

language = "English (US)",

isbn = "1932415041",

series = "Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences",

pages = "47--53",

editor = "F. Valafar and H. Valafar",

booktitle = "Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03",

note = "Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03 ; Conference date: 23-06-2003 Through 26-06-2003",

}

TY - GEN

T1 - Why Classification Models Using Array Gene Expression Data Perform So Well

T2 - Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03

AU - Aliferis, C. F.

AU - Tsamardinos, I.

AU - Massion, P.

AU - Statnikov, A. R.

AU - Hardin, D.

PY - 2003

Y1 - 2003

N2 - Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counterintuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of: classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.

AB - Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counterintuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of: classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.

KW - Bioinformatics and Medicine

KW - Expression Data Analysis

KW - Gene Expression

UR - http://www.scopus.com/inward/record.url?scp=1642399779&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1642399779&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:1642399779

SN - 1932415041

T3 - Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences

SP - 47

EP - 53

BT - Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03

A2 - Valafar, F.

A2 - Valafar, H.

Y2 - 23 June 2003 through 26 June 2003

ER -

Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation Of Explanatory Factors

Abstract

Publication series

Other

Keywords

OpenUrl availability

Other files and links

Fingerprint

Cite this