Copy-number variants (CNVs) account for a substantial proportion of human genetic variations. Understanding the CNV diversities across populations is a computational challenge because CNV patterns are often present in several related populations and only occur in a subgroup of individuals within each of the population. This paper introduces a tree-guided sparse group selection algorithm (treeSGS) to detect population-differentiation CNV markers of subgroups across populations organized by a phylogenetic tree of human populations. The treeSGS algorithm detects CNV markers of populations associated with nodes from all levels of the tree such that the evolutionary relations among the populations are incorporated for more accurate detection of population-differentiation CNVs. We applied treeSGS algorithm to study the 1,179 samples from the 11 populations in Hapmap3 CNV data. The treeSGS algorithm accurately identifies CNV markers of each population and the collection of populations organized under the branches of the human population tree, validated by consistency among family trios and SNP characterizations of the CNV regions. Further comparison between the detected CNV markers and other population-differentiation CNVs reported in 1,000 genome data and other recent studies also shows that treeSGS can significantly improve the current annotations of population-differentiation CNV markers. TreeSGS package is available at https://github.com/kuanglab/treeSGS.
|Original language||English (US)|
|Number of pages||12|
|Journal||IEEE/ACM Transactions on Computational Biology and Bioinformatics|
|State||Published - Mar 1 2019|
Bibliographical noteFunding Information:
This work was supported by the US National Science Foundation (IIS 1149697).
© 2017 IEEE.
- DNA copy number variants
- Machine learning algorithms
- computational biology
- group LASSO
- population genetics
- tree algorithms