Statistical Methods in Proteomics

Weichuan Yu; Baolin Wu; Tao Huang; Xiaoye Li; Kenneth Williams; Hongyu Zhao

doi:10.1007/978-1-84628-288-1_34

Statistical Methods in Proteomics

Weichuan Yu, Baolin Wu, Tao Huang, Xiaoye Li, Kenneth Williams, Hongyu Zhao

Biostatistics

Research output: Chapter in Book/Report/Conference proceeding › Chapter

14 Scopus citations

Abstract

Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research. We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions. Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5. In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification. Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.

Original language	English (US)
Title of host publication	Statistical Methods in Proteomics
Editors	Hoang Pham
Place of Publication	London
Publisher	Springer
Pages	623-638
Number of pages	16
ISBN (Print)	978-1-84628-288-1
DOIs	https://doi.org/10.1007/978-1-84628-288-1_34
State	Published - 2006

Publication series

Name	Springer Handbooks
ISSN (Print)	2522-8692
ISSN (Electronic)	2522-8706

Keywords

Feature Selection
Feature Selection Method
Mass Spectrometry Data
Random Forest
Sample Classification

Access

10.1007/978-1-84628-288-1_34

OpenUrl availability

Full text

Cite this

@inbook{2d46af0c17fe45e99cb057cbf18a5558,

title = "Statistical Methods in Proteomics",

abstract = "Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research. We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions. Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5. In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification. Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.",

keywords = "Feature Selection, Feature Selection Method, Mass Spectrometry Data, Random Forest, Sample Classification",

author = "Weichuan Yu and Baolin Wu and Tao Huang and Xiaoye Li and Kenneth Williams and Hongyu Zhao",

year = "2006",

doi = "10.1007/978-1-84628-288-1_34",

language = "English (US)",

isbn = "978-1-84628-288-1",

series = "Springer Handbooks",

publisher = "Springer",

pages = "623--638",

editor = "Hoang Pham",

booktitle = "Statistical Methods in Proteomics",

}

TY - CHAP

T1 - Statistical Methods in Proteomics

AU - Yu, Weichuan

AU - Wu, Baolin

AU - Huang, Tao

AU - Li, Xiaoye

AU - Williams, Kenneth

AU - Zhao, Hongyu

PY - 2006

Y1 - 2006

N2 - Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research. We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions. Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5. In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification. Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.

AB - Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research. We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions. Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5. In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification. Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.

KW - Feature Selection

KW - Feature Selection Method

KW - Mass Spectrometry Data

KW - Random Forest

KW - Sample Classification

UR - http://www.scopus.com/inward/record.url?scp=71449090611&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=71449090611&partnerID=8YFLogxK

U2 - 10.1007/978-1-84628-288-1_34

DO - 10.1007/978-1-84628-288-1_34

M3 - Chapter

SN - 978-1-84628-288-1

T3 - Springer Handbooks

SP - 623

EP - 638

BT - Statistical Methods in Proteomics

A2 - Pham, Hoang

PB - Springer

CY - London

ER -

Statistical Methods in Proteomics

Abstract

Publication series

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this