Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Benhuai Xie; Wei Pan; Xiaotong T Shen

doi:10.1214/08-EJS194

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Benhuai Xie, Wei Pan, Xiaotong T Shen

Research output: Contribution to journal › Article › peer-review

46 Scopus citations

Abstract

Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery.For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thre sholding. Numerical examples, including an application to acute leukemia sub type discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.

Original language	English (US)
Pages (from-to)	168-212
Number of pages	45
Journal	Electronic Journal of Statistics
Volume	2
DOIs	https://doi.org/10.1214/08-EJS194
State	Published - 2008

Keywords

BIC
EM algorithm
High-dimension but low-sample size
L penalization
Microarray gene expression
Mixture model
Penalized likelihood

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1214/08-EJS194

OpenUrl availability

Full text

Cite this

@article{ee705a5d736743caac6e7c56fbcc0bc2,

title = "Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables",

abstract = "Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery.For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thre sholding. Numerical examples, including an application to acute leukemia sub type discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.",

keywords = "BIC, EM algorithm, High-dimension but low-sample size, L penalization, Microarray gene expression, Mixture model, Penalized likelihood",

author = "Benhuai Xie and Wei Pan and Shen, {Xiaotong T}",

year = "2008",

doi = "10.1214/08-EJS194",

language = "English (US)",

volume = "2",

pages = "168--212",

journal = "Electronic Journal of Statistics",

issn = "1935-7524",

publisher = "Institute of Mathematical Statistics",

}

TY - JOUR

T1 - Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

AU - Xie, Benhuai

AU - Pan, Wei

AU - Shen, Xiaotong T

PY - 2008

Y1 - 2008

N2 - Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery.For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thre sholding. Numerical examples, including an application to acute leukemia sub type discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.

AB - Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery.For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thre sholding. Numerical examples, including an application to acute leukemia sub type discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.

KW - BIC

KW - EM algorithm

KW - High-dimension but low-sample size

KW - L penalization

KW - Microarray gene expression

KW - Mixture model

KW - Penalized likelihood

UR - http://www.scopus.com/inward/record.url?scp=70449374222&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70449374222&partnerID=8YFLogxK

U2 - 10.1214/08-EJS194

DO - 10.1214/08-EJS194

M3 - Article

AN - SCOPUS:70449374222

SN - 1935-7524

VL - 2

SP - 168

EP - 212

JO - Electronic Journal of Statistics

JF - Electronic Journal of Statistics

ER -

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Abstract

Keywords

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this