Model selection procedure for high-dimensional data

Yongli Zhang; Xiaotong Shen

doi:10.1002/sam.10088

Model selection procedure for high-dimensional data

Yongli Zhang, Xiaotong Shen

Statistics (Twin Cities)

Research output: Contribution to journal › Article › peer-review

29 Scopus citations

Abstract

For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like Bayesian information criterion (BIC) may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC_c, which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with that of RIC_c. Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.

Original language	English (US)
Pages (from-to)	350-358
Number of pages	9
Journal	Statistical Analysis and Data Mining
Volume	3
Issue number	5
DOIs	https://doi.org/10.1002/sam.10088
State	Published - Oct 2010

Keywords

Information criterion
Large p but small n
Model selection
Power market
RIC

Access

10.1002/sam.10088

OpenUrl availability

Full text

Cite this

@article{470ec08774a14c53946c5b5b33408e4c,

title = "Model selection procedure for high-dimensional data",

abstract = "For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like Bayesian information criterion (BIC) may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RICc, which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with that of RICc. Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.",

keywords = "Information criterion, Large p but small n, Model selection, Power market, RIC",

author = "Yongli Zhang and Xiaotong Shen",

year = "2010",

month = oct,

doi = "10.1002/sam.10088",

language = "English (US)",

volume = "3",

pages = "350--358",

journal = "Statistical Analysis and Data Mining",

issn = "1932-1872",

publisher = "John Wiley and Sons Inc.",

number = "5",

}

TY - JOUR

T1 - Model selection procedure for high-dimensional data

AU - Zhang, Yongli

AU - Shen, Xiaotong

PY - 2010/10

Y1 - 2010/10

N2 - For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like Bayesian information criterion (BIC) may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RICc, which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with that of RICc. Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.

AB - For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like Bayesian information criterion (BIC) may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RICc, which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with that of RICc. Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.

KW - Information criterion

KW - Large p but small n

KW - Model selection

KW - Power market

KW - RIC

UR - http://www.scopus.com/inward/record.url?scp=78649976558&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78649976558&partnerID=8YFLogxK

U2 - 10.1002/sam.10088

DO - 10.1002/sam.10088

M3 - Article

C2 - 21116443

AN - SCOPUS:78649976558

SN - 1932-1872

VL - 3

SP - 350

EP - 358

JO - Statistical Analysis and Data Mining

JF - Statistical Analysis and Data Mining

IS - 5

ER -

Model selection procedure for high-dimensional data

Abstract

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this