Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data

Constantin F. Aliferis; Alexander Statnikov; Ioannis Tsamardinos; Jonathan S. Schildcrout; Bryan E. Shepherd; Frank E. Harrell

doi:10.1371/journal.pone.0004922

Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data

Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Jonathan S. Schildcrout, Bryan E. Shepherd, Frank E. Harrell

Institute for Health Informatics

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development. Methodology/Principal Findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data. Conclusions/Significance: The findings of the present study have two important practical implications: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

Original language	English (US)
Article number	e4922
Journal	PloS one
Volume	4
Issue number	3
DOIs	https://doi.org/10.1371/journal.pone.0004922
State	Published - Mar 17 2009

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1371/journal.pone.0004922

OpenUrl availability

Full text

Cite this

@article{384ae1307fbe4c6d93aaf18f9737c1b5,

title = "Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data",

abstract = "Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development. Methodology/Principal Findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data. Conclusions/Significance: The findings of the present study have two important practical implications: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.",

author = "Aliferis, {Constantin F.} and Alexander Statnikov and Ioannis Tsamardinos and Schildcrout, {Jonathan S.} and Shepherd, {Bryan E.} and Harrell, {Frank E.}",

year = "2009",

month = mar,

day = "17",

doi = "10.1371/journal.pone.0004922",

language = "English (US)",

volume = "4",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "3",

}

TY - JOUR

T1 - Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data

AU - Aliferis, Constantin F.

AU - Statnikov, Alexander

AU - Tsamardinos, Ioannis

AU - Schildcrout, Jonathan S.

AU - Shepherd, Bryan E.

AU - Harrell, Frank E.

PY - 2009/3/17

Y1 - 2009/3/17

N2 - Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development. Methodology/Principal Findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data. Conclusions/Significance: The findings of the present study have two important practical implications: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

AB - Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development. Methodology/Principal Findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data. Conclusions/Significance: The findings of the present study have two important practical implications: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

UR - http://www.scopus.com/inward/record.url?scp=62849087796&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=62849087796&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0004922

DO - 10.1371/journal.pone.0004922

M3 - Article

C2 - 19290050

AN - SCOPUS:62849087796

SN - 1932-6203

VL - 4

JO - PloS one

JF - PloS one

IS - 3

M1 - e4922

ER -

Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data

Abstract

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this