Variable selection after screening: with or without data splitting?

Xiaoyi Zhu; Yuhong Yang

doi:10.1007/s00180-014-0528-8

Variable selection after screening: with or without data splitting?

Xiaoyi Zhu, Yuhong Yang

Statistics (Twin Cities)

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

Original language	English (US)
Pages (from-to)	191-203
Number of pages	13
Journal	Computational Statistics
Volume	30
Issue number	1
DOIs	https://doi.org/10.1007/s00180-014-0528-8
State	Published - Mar 2014

Bibliographical note

Funding Information:
The authors thank Ying Nan for sharing her computer codes related to their work. A referee and the editors are appreciated for their very helpful comments on improving the paper. The research was partially supported by the NSF Grant DMS-1106576.

Publisher Copyright:
© 2014, Springer-Verlag Berlin Heidelberg.

Keywords

Model selection
Prediction
Sparse regression
Variable screening

Access

10.1007/s00180-014-0528-8

OpenUrl availability

Full text

Cite this

@article{dd895a93a58d45b6b60551410195c14c,

title = "Variable selection after screening: with or without data splitting?",

abstract = "High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.",

keywords = "Model selection, Prediction, Sparse regression, Variable screening",

author = "Xiaoyi Zhu and Yuhong Yang",

note = "Funding Information: The authors thank Ying Nan for sharing her computer codes related to their work. A referee and the editors are appreciated for their very helpful comments on improving the paper. The research was partially supported by the NSF Grant DMS-1106576. Publisher Copyright: {\textcopyright} 2014, Springer-Verlag Berlin Heidelberg.",

year = "2014",

month = mar,

doi = "10.1007/s00180-014-0528-8",

language = "English (US)",

volume = "30",

pages = "191--203",

journal = "Computational Statistics",

issn = "0943-4062",

publisher = "Springer Verlag",

number = "1",

}

TY - JOUR

T1 - Variable selection after screening

T2 - with or without data splitting?

AU - Zhu, Xiaoyi

AU - Yang, Yuhong

N1 - Funding Information: The authors thank Ying Nan for sharing her computer codes related to their work. A referee and the editors are appreciated for their very helpful comments on improving the paper. The research was partially supported by the NSF Grant DMS-1106576. Publisher Copyright: © 2014, Springer-Verlag Berlin Heidelberg.

PY - 2014/3

Y1 - 2014/3

N2 - High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

AB - High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

KW - Model selection

KW - Prediction

KW - Sparse regression

KW - Variable screening

UR - http://www.scopus.com/inward/record.url?scp=84963984359&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963984359&partnerID=8YFLogxK

U2 - 10.1007/s00180-014-0528-8

DO - 10.1007/s00180-014-0528-8

M3 - Article

AN - SCOPUS:84963984359

SN - 0943-4062

VL - 30

SP - 191

EP - 203

JO - Computational Statistics

JF - Computational Statistics

IS - 1

ER -

Variable selection after screening: with or without data splitting?

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this