Sparse cluster analysis of large-scale discrete variables with application to single nucleotide polymorphism data

Baolin Wu

doi:10.1080/02664763.2012.743977

Sparse cluster analysis of large-scale discrete variables with application to single nucleotide polymorphism data

Baolin Wu

Biostatistics

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

Currently, extreme large-scale genetic data present significant challenges for cluster analysis. Most of the existing clustering methods are typically built on the Euclidean distance and geared toward analyzing continuous response. They work well for clustering, e.g. microarray gene expression data, but often perform poorly for clustering, e.g. large-scale single nucleotide polymorphism (SNP) data. In this paper, we study the penalized latent class model for clustering extremely large-scale discrete data. The penalized latent class model takes into account the discrete nature of the response using appropriate generalized linear models and adopts the lasso penalized likelihood approach for simultaneous model estimation and selection of important covariates. We develop very efficient numerical algorithms for model estimation based on the iterative coordinate descent approach and further develop the expectation-maximization algorithm to incorporate and model missing values. We use simulation studies and applications to the international HapMap SNP data to illustrate the competitive performance of the penalized latent class model.

Original language	English (US)
Pages (from-to)	358-367
Number of pages	10
Journal	Journal of Applied Statistics
Volume	40
Issue number	2
DOIs	https://doi.org/10.1080/02664763.2012.743977
State	Published - Feb 2013

Bibliographical note

Funding Information:
This research was supported in part by NIH grant GM083345 and CA134848. I would like to thank two anonymous referees for their constructive comments that have dramatically improved the presentation of the paper.

Keywords

clustering
expectation-maximization algorithm
k-means
lasso
latent class model
principal components
single nucleotide polymorphism
sparse clustering

Access

10.1080/02664763.2012.743977

OpenUrl availability

Full text

Cite this

@article{b80a3c5e63414a7199b7083e4390d2c4,

title = "Sparse cluster analysis of large-scale discrete variables with application to single nucleotide polymorphism data",

abstract = "Currently, extreme large-scale genetic data present significant challenges for cluster analysis. Most of the existing clustering methods are typically built on the Euclidean distance and geared toward analyzing continuous response. They work well for clustering, e.g. microarray gene expression data, but often perform poorly for clustering, e.g. large-scale single nucleotide polymorphism (SNP) data. In this paper, we study the penalized latent class model for clustering extremely large-scale discrete data. The penalized latent class model takes into account the discrete nature of the response using appropriate generalized linear models and adopts the lasso penalized likelihood approach for simultaneous model estimation and selection of important covariates. We develop very efficient numerical algorithms for model estimation based on the iterative coordinate descent approach and further develop the expectation-maximization algorithm to incorporate and model missing values. We use simulation studies and applications to the international HapMap SNP data to illustrate the competitive performance of the penalized latent class model.",

keywords = "clustering, expectation-maximization algorithm, k-means, lasso, latent class model, principal components, single nucleotide polymorphism, sparse clustering",

author = "Baolin Wu",

note = "Funding Information: This research was supported in part by NIH grant GM083345 and CA134848. I would like to thank two anonymous referees for their constructive comments that have dramatically improved the presentation of the paper.",

year = "2013",

month = feb,

doi = "10.1080/02664763.2012.743977",

language = "English (US)",

volume = "40",

pages = "358--367",

journal = "Journal of Applied Statistics",

issn = "0266-4763",

publisher = "Routledge",

number = "2",

}

TY - JOUR

T1 - Sparse cluster analysis of large-scale discrete variables with application to single nucleotide polymorphism data

AU - Wu, Baolin

N1 - Funding Information: This research was supported in part by NIH grant GM083345 and CA134848. I would like to thank two anonymous referees for their constructive comments that have dramatically improved the presentation of the paper.

PY - 2013/2

Y1 - 2013/2

N2 - Currently, extreme large-scale genetic data present significant challenges for cluster analysis. Most of the existing clustering methods are typically built on the Euclidean distance and geared toward analyzing continuous response. They work well for clustering, e.g. microarray gene expression data, but often perform poorly for clustering, e.g. large-scale single nucleotide polymorphism (SNP) data. In this paper, we study the penalized latent class model for clustering extremely large-scale discrete data. The penalized latent class model takes into account the discrete nature of the response using appropriate generalized linear models and adopts the lasso penalized likelihood approach for simultaneous model estimation and selection of important covariates. We develop very efficient numerical algorithms for model estimation based on the iterative coordinate descent approach and further develop the expectation-maximization algorithm to incorporate and model missing values. We use simulation studies and applications to the international HapMap SNP data to illustrate the competitive performance of the penalized latent class model.

AB - Currently, extreme large-scale genetic data present significant challenges for cluster analysis. Most of the existing clustering methods are typically built on the Euclidean distance and geared toward analyzing continuous response. They work well for clustering, e.g. microarray gene expression data, but often perform poorly for clustering, e.g. large-scale single nucleotide polymorphism (SNP) data. In this paper, we study the penalized latent class model for clustering extremely large-scale discrete data. The penalized latent class model takes into account the discrete nature of the response using appropriate generalized linear models and adopts the lasso penalized likelihood approach for simultaneous model estimation and selection of important covariates. We develop very efficient numerical algorithms for model estimation based on the iterative coordinate descent approach and further develop the expectation-maximization algorithm to incorporate and model missing values. We use simulation studies and applications to the international HapMap SNP data to illustrate the competitive performance of the penalized latent class model.

KW - clustering

KW - expectation-maximization algorithm

KW - k-means

KW - lasso

KW - latent class model

KW - principal components

KW - single nucleotide polymorphism

KW - sparse clustering

UR - http://www.scopus.com/inward/record.url?scp=84871197533&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84871197533&partnerID=8YFLogxK

U2 - 10.1080/02664763.2012.743977

DO - 10.1080/02664763.2012.743977

M3 - Article

AN - SCOPUS:84871197533

SN - 0266-4763

VL - 40

SP - 358

EP - 367

JO - Journal of Applied Statistics

JF - Journal of Applied Statistics

IS - 2

ER -

Sparse cluster analysis of large-scale discrete variables with application to single nucleotide polymorphism data

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this