Variable selection after screening: with or without data splitting?

Xiaoyi Zhu, Yuhong Yang

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

Original languageEnglish (US)
Pages (from-to)191-203
Number of pages13
JournalComputational Statistics
Volume30
Issue number1
DOIs
StatePublished - Mar 2014

Bibliographical note

Funding Information:
The authors thank Ying Nan for sharing her computer codes related to their work. A referee and the editors are appreciated for their very helpful comments on improving the paper. The research was partially supported by the NSF Grant DMS-1106576.

Publisher Copyright:
© 2014, Springer-Verlag Berlin Heidelberg.

Keywords

  • Model selection
  • Prediction
  • Sparse regression
  • Variable screening

Fingerprint

Dive into the research topics of 'Variable selection after screening: with or without data splitting?'. Together they form a unique fingerprint.

Cite this