TY - GEN
T1 - Why Classification Models Using Array Gene Expression Data Perform So Well
T2 - Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS'03
AU - Aliferis, C. F.
AU - Tsamardinos, I.
AU - Massion, P.
AU - Statnikov, A. R.
AU - Hardin, D.
PY - 2003
Y1 - 2003
N2 - Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counterintuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of: classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.
AB - Results in the literature of classification models from microarray data often appear to be exceedingly good relative to most other domains of machine learning and clinical diagnostics. Yet array data are noisy, and have very small sample-to-variable ratios. What is the explanation for such exemplary, yet counterintuitive, classification performance? Answering this question has significant implications (a) for the broad acceptance of such models by the medical and biostatistical community, and (b) for gaining valuable insight on the properties of this domain. To address this problem we build several models for three classification tasks in a gene expression array dataset with 12,600 oligonucleotides and 203 patient cases. We then study the effects of: classifier type (kernel-based/non-kernel-based, linear/non-linear), sample size, sample selection within cross-validation, and gene information redundancy. Our analyses show that gene redundancy and classifier choice have the strongest effects on performance. Linear bias in the classifiers, and sample size (as long as kernel classifiers are used) have relatively small effects; train-test sample ratio, and the choice of cross-validation sample selection method appear to have small-to-negligible effects.
KW - Bioinformatics and Medicine
KW - Expression Data Analysis
KW - Gene Expression
UR - http://www.scopus.com/inward/record.url?scp=1642399779&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=1642399779&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:1642399779
SN - 1932415041
T3 - Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences
SP - 47
EP - 53
BT - Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 03
A2 - Valafar, F.
A2 - Valafar, H.
Y2 - 23 June 2003 through 26 June 2003
ER -