Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.
Bibliographical noteFunding Information:
We would like to thank Dr Emre Barut and Dr Vincent Vu for helpful discussions, Dr Zongming Ma for sharing his codes for sparse principal component analysis and Dr Chenlei Leng and Dr Yiyuan She for sharing their unpublished papers. H.G.H. is supported by NSA grant H98230-15-1-0260. Lan Wang is supported by NSF grant DMS-1512267. Xuming He is supported by NSF grant DMS-1307566.
- conditional screening
- false negative
- feature screening
- high dimension
- sparse principal component analysis
- sure screening property