A data-driven approach to conditional screening of high-dimensional variables

Hyokyoung G. Hong; Lan Wang; Xuming He

doi:10.1002/sta4.115

A data-driven approach to conditional screening of high-dimensional variables

Hyokyoung G. Hong, Lan Wang, Xuming He

Statistics (Twin Cities)

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.

Original language	English (US)
Pages (from-to)	200-212
Number of pages	13
Journal	Stat
Volume	5
Issue number	1
DOIs	https://doi.org/10.1002/sta4.115
State	Published - 2016

Bibliographical note

Funding Information:
We would like to thank Dr Emre Barut and Dr Vincent Vu for helpful discussions, Dr Zongming Ma for sharing his codes for sparse principal component analysis and Dr Chenlei Leng and Dr Yiyuan She for sharing their unpublished papers. H.G.H. is supported by NSA grant H98230-15-1-0260. Lan Wang is supported by NSF grant DMS-1512267. Xuming He is supported by NSF grant DMS-1307566.

Publisher Copyright:
Copyright © 2016 John Wiley & Sons, Ltd.

Keywords

conditional screening
false negative
feature screening
high dimension
sparse principal component analysis
sure screening property

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1002/sta4.115

OpenUrl availability

Full text

Cite this

@article{754b6b33aa83411db627b363df0e0425,

title = "A data-driven approach to conditional screening of high-dimensional variables",

abstract = "Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.",

keywords = "conditional screening, false negative, feature screening, high dimension, sparse principal component analysis, sure screening property",

author = "Hong, {Hyokyoung G.} and Lan Wang and Xuming He",

note = "Funding Information: We would like to thank Dr Emre Barut and Dr Vincent Vu for helpful discussions, Dr Zongming Ma for sharing his codes for sparse principal component analysis and Dr Chenlei Leng and Dr Yiyuan She for sharing their unpublished papers. H.G.H. is supported by NSA grant H98230-15-1-0260. Lan Wang is supported by NSF grant DMS-1512267. Xuming He is supported by NSF grant DMS-1307566. Publisher Copyright: Copyright {\textcopyright} 2016 John Wiley & Sons, Ltd.",

year = "2016",

doi = "10.1002/sta4.115",

language = "English (US)",

volume = "5",

pages = "200--212",

journal = "Stat",

issn = "2049-1573",

publisher = "Wiley-Blackwell Publishing Ltd",

number = "1",

}

TY - JOUR

T1 - A data-driven approach to conditional screening of high-dimensional variables

AU - Hong, Hyokyoung G.

AU - Wang, Lan

AU - He, Xuming

N1 - Funding Information: We would like to thank Dr Emre Barut and Dr Vincent Vu for helpful discussions, Dr Zongming Ma for sharing his codes for sparse principal component analysis and Dr Chenlei Leng and Dr Yiyuan She for sharing their unpublished papers. H.G.H. is supported by NSA grant H98230-15-1-0260. Lan Wang is supported by NSF grant DMS-1512267. Xuming He is supported by NSF grant DMS-1307566. Publisher Copyright: Copyright © 2016 John Wiley & Sons, Ltd.

PY - 2016

Y1 - 2016

N2 - Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.

AB - Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.

KW - conditional screening

KW - false negative

KW - feature screening

KW - high dimension

KW - sparse principal component analysis

KW - sure screening property

UR - http://www.scopus.com/inward/record.url?scp=84994876279&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994876279&partnerID=8YFLogxK

U2 - 10.1002/sta4.115

DO - 10.1002/sta4.115

M3 - Article

AN - SCOPUS:84994876279

SN - 2049-1573

VL - 5

SP - 200

EP - 212

JO - Stat

JF - Stat

IS - 1

ER -

A data-driven approach to conditional screening of high-dimensional variables

Abstract

Bibliographical note

Keywords

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this