Integrative and regularized principal component analysis of multiple sources of data

Binghui Liu; Xiaotong Shen; Wei Pan

doi:10.1002/sim.6866

Integrative and regularized principal component analysis of multiple sources of data

Binghui Liu, Xiaotong Shen, Wei Pan

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

Integration of data of disparate types has become increasingly important to enhancing the power for new discoveries by combining complementary strengths of multiple types of data. One application is to uncover tumor subtypes in human cancer research in which multiple types of genomic data are integrated, including gene expression, DNA copy number, and DNA methylation data. In spite of their successes, existing approaches based on joint latent variable models require stringent distributional assumptions and may suffer from unbalanced scales (or units) of different types of data and non-scalability of the corresponding algorithms. In this paper, we propose an alternative based on integrative and regularized principal component analysis, which is distribution-free, computationally efficient, and robust against unbalanced scales. The new method performs dimension reduction simultaneously on multiple types of data, seeking data-adaptive sparsity and scaling. As a result, in addition to feature selection for each type of data, integrative clustering is achieved. Numerically, the proposed method compares favorably against its competitors in terms of accuracy (in identifying hidden clusters), computational efficiency, and robustness against unbalanced scales. In particular, compared with a popular method, the new method was competitive in identifying tumor subtypes associated with distinct patient survival patterns when applied to a combined analysis of DNA copy number, mRNA expression, and DNA methylation data in a glioblastoma multiforme study.

Original language	English (US)
Pages (from-to)	2235-2250
Number of pages	16
Journal	Statistics in Medicine
Volume	35
Issue number	13
DOIs	https://doi.org/10.1002/sim.6866
State	Published - Jun 15 2016

Bibliographical note

Publisher Copyright:
© 2016 John Wiley & Sons, Ltd.

Keywords

Integrative clustering
PCA
Tumor subtypes

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1002/sim.6866

http://europepmc.org/articles/pmc4853304

OpenUrl availability

Full text

Cite this

@article{2fa38a88cc7b4514ac11d691fc8576ab,

title = "Integrative and regularized principal component analysis of multiple sources of data",

abstract = "Integration of data of disparate types has become increasingly important to enhancing the power for new discoveries by combining complementary strengths of multiple types of data. One application is to uncover tumor subtypes in human cancer research in which multiple types of genomic data are integrated, including gene expression, DNA copy number, and DNA methylation data. In spite of their successes, existing approaches based on joint latent variable models require stringent distributional assumptions and may suffer from unbalanced scales (or units) of different types of data and non-scalability of the corresponding algorithms. In this paper, we propose an alternative based on integrative and regularized principal component analysis, which is distribution-free, computationally efficient, and robust against unbalanced scales. The new method performs dimension reduction simultaneously on multiple types of data, seeking data-adaptive sparsity and scaling. As a result, in addition to feature selection for each type of data, integrative clustering is achieved. Numerically, the proposed method compares favorably against its competitors in terms of accuracy (in identifying hidden clusters), computational efficiency, and robustness against unbalanced scales. In particular, compared with a popular method, the new method was competitive in identifying tumor subtypes associated with distinct patient survival patterns when applied to a combined analysis of DNA copy number, mRNA expression, and DNA methylation data in a glioblastoma multiforme study.",

keywords = "Integrative clustering, PCA, Tumor subtypes",

author = "Binghui Liu and Xiaotong Shen and Wei Pan",

note = "Publisher Copyright: {\textcopyright} 2016 John Wiley & Sons, Ltd.",

year = "2016",

month = jun,

day = "15",

doi = "10.1002/sim.6866",

language = "English (US)",

volume = "35",

pages = "2235--2250",

journal = "Statistics in Medicine",

issn = "0277-6715",

publisher = "John Wiley and Sons Ltd",

number = "13",

}

TY - JOUR

T1 - Integrative and regularized principal component analysis of multiple sources of data

AU - Liu, Binghui

AU - Shen, Xiaotong

AU - Pan, Wei

PY - 2016/6/15

Y1 - 2016/6/15

N2 - Integration of data of disparate types has become increasingly important to enhancing the power for new discoveries by combining complementary strengths of multiple types of data. One application is to uncover tumor subtypes in human cancer research in which multiple types of genomic data are integrated, including gene expression, DNA copy number, and DNA methylation data. In spite of their successes, existing approaches based on joint latent variable models require stringent distributional assumptions and may suffer from unbalanced scales (or units) of different types of data and non-scalability of the corresponding algorithms. In this paper, we propose an alternative based on integrative and regularized principal component analysis, which is distribution-free, computationally efficient, and robust against unbalanced scales. The new method performs dimension reduction simultaneously on multiple types of data, seeking data-adaptive sparsity and scaling. As a result, in addition to feature selection for each type of data, integrative clustering is achieved. Numerically, the proposed method compares favorably against its competitors in terms of accuracy (in identifying hidden clusters), computational efficiency, and robustness against unbalanced scales. In particular, compared with a popular method, the new method was competitive in identifying tumor subtypes associated with distinct patient survival patterns when applied to a combined analysis of DNA copy number, mRNA expression, and DNA methylation data in a glioblastoma multiforme study.

AB - Integration of data of disparate types has become increasingly important to enhancing the power for new discoveries by combining complementary strengths of multiple types of data. One application is to uncover tumor subtypes in human cancer research in which multiple types of genomic data are integrated, including gene expression, DNA copy number, and DNA methylation data. In spite of their successes, existing approaches based on joint latent variable models require stringent distributional assumptions and may suffer from unbalanced scales (or units) of different types of data and non-scalability of the corresponding algorithms. In this paper, we propose an alternative based on integrative and regularized principal component analysis, which is distribution-free, computationally efficient, and robust against unbalanced scales. The new method performs dimension reduction simultaneously on multiple types of data, seeking data-adaptive sparsity and scaling. As a result, in addition to feature selection for each type of data, integrative clustering is achieved. Numerically, the proposed method compares favorably against its competitors in terms of accuracy (in identifying hidden clusters), computational efficiency, and robustness against unbalanced scales. In particular, compared with a popular method, the new method was competitive in identifying tumor subtypes associated with distinct patient survival patterns when applied to a combined analysis of DNA copy number, mRNA expression, and DNA methylation data in a glioblastoma multiforme study.

KW - Integrative clustering

KW - PCA

KW - Tumor subtypes

UR - http://www.scopus.com/inward/record.url?scp=84954271792&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84954271792&partnerID=8YFLogxK

U2 - 10.1002/sim.6866

DO - 10.1002/sim.6866

M3 - Article

C2 - 26756854

AN - SCOPUS:84954271792

SN - 0277-6715

VL - 35

SP - 2235

EP - 2250

JO - Statistics in Medicine

JF - Statistics in Medicine

IS - 13

ER -

Integrative and regularized principal component analysis of multiple sources of data

Abstract

Bibliographical note

Keywords

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this