Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants

Yiwei Zhang; Weihua Guan; Wei Pan

doi:10.1002/gepi.21691

Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants

Yiwei Zhang, Weihua Guan, Wei Pan

Biostatistics

Research output: Contribution to journal › Article › peer-review

34 Scopus citations

Abstract

For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.

Original language	English (US)
Pages (from-to)	99-109
Number of pages	11
Journal	Genetic epidemiology
Volume	37
Issue number	1
DOIs	https://doi.org/10.1002/gepi.21691
State	Published - Jan 2013

Keywords

1000 Genomes Project
Association tests
Logistic regression
Next-generation sequencing
SNP
SSU test

Access

10.1002/gepi.21691

OpenUrl availability

Full text

Cite this

@article{894a3db491924c70b596e47c4aa4e7de,

title = "Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants",

abstract = "For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.",

keywords = "1000 Genomes Project, Association tests, Logistic regression, Next-generation sequencing, SNP, SSU test",

author = "Yiwei Zhang and Weihua Guan and Wei Pan",

year = "2013",

month = jan,

doi = "10.1002/gepi.21691",

language = "English (US)",

volume = "37",

pages = "99--109",

journal = "Genetic epidemiology",

issn = "0741-0395",

publisher = "Wiley-Liss Inc.",

number = "1",

}

TY - JOUR

T1 - Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants

AU - Zhang, Yiwei

AU - Guan, Weihua

AU - Pan, Wei

PY - 2013/1

Y1 - 2013/1

N2 - For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.

AB - For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.

KW - 1000 Genomes Project

KW - Association tests

KW - Logistic regression

KW - Next-generation sequencing

KW - SNP

KW - SSU test

UR - http://www.scopus.com/inward/record.url?scp=84871046806&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84871046806&partnerID=8YFLogxK

U2 - 10.1002/gepi.21691

DO - 10.1002/gepi.21691

M3 - Article

C2 - 23065775

AN - SCOPUS:84871046806

SN - 0741-0395

VL - 37

SP - 99

EP - 109

JO - Genetic epidemiology

JF - Genetic epidemiology

IS - 1

ER -

Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants

Abstract

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this