Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Raphael Petegrosso; Zhuliu Li; Rui Kuang

doi:10.1093/bib/bbz063

Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Raphael Petegrosso, Zhuliu Li, Rui Kuang

Research output: Contribution to journal › Review article › peer-review

130 Scopus citations

Abstract

Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability: All the source code and data are available at https://github.com/kuanglab/single-cell-review.

Original language	English (US)
Pages (from-to)	1209-1223
Number of pages	15
Journal	Briefings in Bioinformatics
Volume	21
Issue number	4
DOIs	https://doi.org/10.1093/bib/bbz063
State	Published - Jul 10 2019
Externally published	Yes

Bibliographical note

Publisher Copyright:
© The Author(s) 2019. Published by Oxford University Press. All rights reserved.

Keywords

Clustering
Machine learning
ScRNA sequencing
Single-cell technology

Access

10.1093/bib/bbz063

OpenUrl availability

Full text

Cite this

@article{b2da3f76c4be4ddfa8c76a5e16414e0c,

title = "Machine learning and statistical methods for clustering single-cell RNA-sequencing data",

abstract = "Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability: All the source code and data are available at https://github.com/kuanglab/single-cell-review.",

keywords = "Clustering, Machine learning, ScRNA sequencing, Single-cell technology",

author = "Raphael Petegrosso and Zhuliu Li and Rui Kuang",

year = "2019",

month = jul,

day = "10",

doi = "10.1093/bib/bbz063",

language = "English (US)",

volume = "21",

pages = "1209--1223",

journal = "Briefings in Bioinformatics",

issn = "1467-5463",

publisher = "Oxford University Press",

number = "4",

}

TY - JOUR

T1 - Machine learning and statistical methods for clustering single-cell RNA-sequencing data

AU - Petegrosso, Raphael

AU - Li, Zhuliu

AU - Kuang, Rui

PY - 2019/7/10

Y1 - 2019/7/10

N2 - Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability: All the source code and data are available at https://github.com/kuanglab/single-cell-review.

AB - Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability: All the source code and data are available at https://github.com/kuanglab/single-cell-review.

KW - Clustering

KW - Machine learning

KW - ScRNA sequencing

KW - Single-cell technology

UR - http://www.scopus.com/inward/record.url?scp=85088253517&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85088253517&partnerID=8YFLogxK

U2 - 10.1093/bib/bbz063

DO - 10.1093/bib/bbz063

M3 - Review article

C2 - 31243426

AN - SCOPUS:85088253517

SN - 1467-5463

VL - 21

SP - 1209

EP - 1223

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

IS - 4

ER -

Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this