TY - JOUR
T1 - Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants
AU - Zhang, Yiwei
AU - Guan, Weihua
AU - Pan, Wei
PY - 2013/1
Y1 - 2013/1
N2 - For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.
AB - For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.
KW - 1000 Genomes Project
KW - Association tests
KW - Logistic regression
KW - Next-generation sequencing
KW - SNP
KW - SSU test
UR - http://www.scopus.com/inward/record.url?scp=84871046806&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84871046806&partnerID=8YFLogxK
U2 - 10.1002/gepi.21691
DO - 10.1002/gepi.21691
M3 - Article
C2 - 23065775
AN - SCOPUS:84871046806
SN - 0741-0395
VL - 37
SP - 99
EP - 109
JO - Genetic epidemiology
JF - Genetic epidemiology
IS - 1
ER -