The penalized biclustering model and related algorithms

Thierry Chekouo; Alejandro Murua

doi:10.1080/02664763.2014.999647

The penalized biclustering model and related algorithms

Thierry Chekouo, Alejandro Murua

Mathematics & Statistics

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

Biclustering is the simultaneous clustering of two related dimensions, for example, of individuals and features, or genes and experimental conditions. Very few statistical models for biclustering have been proposed in the literature. Instead, most of the research has focused on algorithms to find biclusters. The models underlying them have not received much attention. Hence, very little is known about the adequacy and limitations of the models and the efficiency of the algorithms. In this work, we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the known popular biclustering techniques, and to justify, and many times improve on, the algorithms used to find the biclusters. It turns out that most of the known techniques have a hidden Bayesian flavor. Therefore, we adopt a Bayesian framework to model biclustering. We propose a measure of biclustering complexity (number of biclusters and overlapping) through a penalized plaid model, and present a suitable version of the deviance information criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Our ideas are motivated by the analysis of gene expression data.

Original language	English (US)
Pages (from-to)	1255-1277
Number of pages	23
Journal	Journal of Applied Statistics
Volume	42
Issue number	6
DOIs	https://doi.org/10.1080/02664763.2014.999647
State	Published - Jun 3 2015

Bibliographical note

Publisher Copyright:
© 2015 Taylor & Francis.

Keywords

clustering
deviance information criterion
gene expression
mixture
model selection
plaid model

Access

10.1080/02664763.2014.999647

OpenUrl availability

Full text

Cite this

@article{8d23a709fa93491c9348372915f7ff7b,

title = "The penalized biclustering model and related algorithms",

abstract = "Biclustering is the simultaneous clustering of two related dimensions, for example, of individuals and features, or genes and experimental conditions. Very few statistical models for biclustering have been proposed in the literature. Instead, most of the research has focused on algorithms to find biclusters. The models underlying them have not received much attention. Hence, very little is known about the adequacy and limitations of the models and the efficiency of the algorithms. In this work, we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the known popular biclustering techniques, and to justify, and many times improve on, the algorithms used to find the biclusters. It turns out that most of the known techniques have a hidden Bayesian flavor. Therefore, we adopt a Bayesian framework to model biclustering. We propose a measure of biclustering complexity (number of biclusters and overlapping) through a penalized plaid model, and present a suitable version of the deviance information criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Our ideas are motivated by the analysis of gene expression data.",

keywords = "clustering, deviance information criterion, gene expression, mixture, model selection, plaid model",

author = "Thierry Chekouo and Alejandro Murua",

note = "Publisher Copyright: {\textcopyright} 2015 Taylor & Francis.",

year = "2015",

month = jun,

day = "3",

doi = "10.1080/02664763.2014.999647",

language = "English (US)",

volume = "42",

pages = "1255--1277",

journal = "Journal of Applied Statistics",

issn = "0266-4763",

publisher = "Routledge",

number = "6",

}

TY - JOUR

T1 - The penalized biclustering model and related algorithms

AU - Chekouo, Thierry

AU - Murua, Alejandro

PY - 2015/6/3

Y1 - 2015/6/3

N2 - Biclustering is the simultaneous clustering of two related dimensions, for example, of individuals and features, or genes and experimental conditions. Very few statistical models for biclustering have been proposed in the literature. Instead, most of the research has focused on algorithms to find biclusters. The models underlying them have not received much attention. Hence, very little is known about the adequacy and limitations of the models and the efficiency of the algorithms. In this work, we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the known popular biclustering techniques, and to justify, and many times improve on, the algorithms used to find the biclusters. It turns out that most of the known techniques have a hidden Bayesian flavor. Therefore, we adopt a Bayesian framework to model biclustering. We propose a measure of biclustering complexity (number of biclusters and overlapping) through a penalized plaid model, and present a suitable version of the deviance information criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Our ideas are motivated by the analysis of gene expression data.

AB - Biclustering is the simultaneous clustering of two related dimensions, for example, of individuals and features, or genes and experimental conditions. Very few statistical models for biclustering have been proposed in the literature. Instead, most of the research has focused on algorithms to find biclusters. The models underlying them have not received much attention. Hence, very little is known about the adequacy and limitations of the models and the efficiency of the algorithms. In this work, we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the known popular biclustering techniques, and to justify, and many times improve on, the algorithms used to find the biclusters. It turns out that most of the known techniques have a hidden Bayesian flavor. Therefore, we adopt a Bayesian framework to model biclustering. We propose a measure of biclustering complexity (number of biclusters and overlapping) through a penalized plaid model, and present a suitable version of the deviance information criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Our ideas are motivated by the analysis of gene expression data.

KW - clustering

KW - deviance information criterion

KW - gene expression

KW - mixture

KW - model selection

KW - plaid model

UR - http://www.scopus.com/inward/record.url?scp=84924906149&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84924906149&partnerID=8YFLogxK

U2 - 10.1080/02664763.2014.999647

DO - 10.1080/02664763.2014.999647

M3 - Article

AN - SCOPUS:84924906149

SN - 0266-4763

VL - 42

SP - 1255

EP - 1277

JO - Journal of Applied Statistics

JF - Journal of Applied Statistics

IS - 6

ER -

The penalized biclustering model and related algorithms

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this