Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data

Vanja Paunić; Michael Steinbach; Abeer Madbouly; Vipin Kumar

doi:10.1145/2649387.2649408

Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data

Vanja Paunić, Michael Steinbach, Abeer Madbouly, Vipin Kumar

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

The Human Leukocyte Antigen (HLA) genes are some of the most studied genes on the genome. This is due to their importance in bone marrow and solid organ transplantation, as well as their strong associations with many autoimmune, infectious, and inammatory diseases. As such, they can be a highly valuable asset to clinicians and researchers for elucidating biological mechanism that may drive those diseases. The extraordinary genetic polymorphism that exists in this region makes it very challenging to type. Therefore, several approaches were proposed for prediction of HLA genes from widely available genome-wide single nucleotide polymorphism (SNP) data sets in the attempt to reduce cost and utilize existing data. These methods use SNPs and highresolution training HLA data to build models for prediction of HLA genes in new samples. However, most of the existing HLA data sets are not available in high-resolution (exact allele assignment) but contain allelic ambiguities (inexact allele assignments). This is a result of existing typing methodologies not always being able to distinguish between several possible alleles at a given gene and produce ambiguous allele as a result. Current approaches for prediction of HLA genes from SNP data do not accommodate learning from ambiguous HLA data and, as such, miss the potential for an increased sample size and consequently improvements in prediction performance. In this paper, we propose Amb-EM, a novel algorithm for SNP-based prediction of HLA genes that utilizes ambiguities in the HLA data and predicts highresolution alleles using ambiguous HLA alleles for building the model. Additionally, we measure the impact that the uncertainty in the training data has on the prediction accuracy, and evaluate it on a real world data set. Our results show that the prediction from ambiguous HLA data outperforms the alternative approach which first imputes the ambiguous data into high-resolution HLA alleles and uses it to build the model.

Original language	English (US)
Title of host publication	ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
Publisher	Association for Computing Machinery
Pages	104-113
Number of pages	10
ISBN (Electronic)	9781450328944
DOIs	https://doi.org/10.1145/2649387.2649408
State	Published - Sep 20 2014
Event	5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014 - Newport Beach, United States Duration: Sep 20 2014 → Sep 23 2014

Publication series

Name	ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Other

Other	5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014
Country/Territory	United States
City	Newport Beach
Period	9/20/14 → 9/23/14

Bibliographical note

Publisher Copyright:
Copyright © 2014 ACM.

Keywords

Ambiguous genotypes
Expectation-maximization
HLA prediction
SNPs
Uncertain data

Access

10.1145/2649387.2649408

OpenUrl availability

Full text

Cite this

Paunić, V., Steinbach, M., Madbouly, A., & Kumar, V. (2014). Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data. In ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 104-113). (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics). Association for Computing Machinery. https://doi.org/10.1145/2649387.2649408

Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data. / Paunić, Vanja; Steinbach, Michael; Madbouly, Abeer et al.
ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, 2014. p. 104-113 (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Paunić, V, Steinbach, M, Madbouly, A & Kumar, V 2014, Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data. in ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Association for Computing Machinery, pp. 104-113, 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014, Newport Beach, United States, 9/20/14. https://doi.org/10.1145/2649387.2649408

Paunić V, Steinbach M, Madbouly A, Kumar V. Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data. In ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery. 2014. p. 104-113. (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics). doi: 10.1145/2649387.2649408

Paunić, Vanja ; Steinbach, Michael ; Madbouly, Abeer et al. / Amb-EM : A SNP-based prediction of HLA alleles using ambiguous HLA Data. ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, 2014. pp. 104-113 (ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics).

@inproceedings{466ffc9862f044d8893b1394d0570b30,

title = "Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data",

abstract = "The Human Leukocyte Antigen (HLA) genes are some of the most studied genes on the genome. This is due to their importance in bone marrow and solid organ transplantation, as well as their strong associations with many autoimmune, infectious, and inammatory diseases. As such, they can be a highly valuable asset to clinicians and researchers for elucidating biological mechanism that may drive those diseases. The extraordinary genetic polymorphism that exists in this region makes it very challenging to type. Therefore, several approaches were proposed for prediction of HLA genes from widely available genome-wide single nucleotide polymorphism (SNP) data sets in the attempt to reduce cost and utilize existing data. These methods use SNPs and highresolution training HLA data to build models for prediction of HLA genes in new samples. However, most of the existing HLA data sets are not available in high-resolution (exact allele assignment) but contain allelic ambiguities (inexact allele assignments). This is a result of existing typing methodologies not always being able to distinguish between several possible alleles at a given gene and produce ambiguous allele as a result. Current approaches for prediction of HLA genes from SNP data do not accommodate learning from ambiguous HLA data and, as such, miss the potential for an increased sample size and consequently improvements in prediction performance. In this paper, we propose Amb-EM, a novel algorithm for SNP-based prediction of HLA genes that utilizes ambiguities in the HLA data and predicts highresolution alleles using ambiguous HLA alleles for building the model. Additionally, we measure the impact that the uncertainty in the training data has on the prediction accuracy, and evaluate it on a real world data set. Our results show that the prediction from ambiguous HLA data outperforms the alternative approach which first imputes the ambiguous data into high-resolution HLA alleles and uses it to build the model.",

keywords = "Ambiguous genotypes, Expectation-maximization, HLA prediction, SNPs, Uncertain data",

author = "Vanja Pauni{\'c} and Michael Steinbach and Abeer Madbouly and Vipin Kumar",

note = "Publisher Copyright: Copyright {\textcopyright} 2014 ACM.; 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014 ; Conference date: 20-09-2014 Through 23-09-2014",

year = "2014",

month = sep,

day = "20",

doi = "10.1145/2649387.2649408",

language = "English (US)",

series = "ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics",

publisher = "Association for Computing Machinery",

pages = "104--113",

booktitle = "ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics",

}

TY - GEN

T1 - Amb-EM

T2 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM BCB 2014

AU - Paunić, Vanja

AU - Steinbach, Michael

AU - Madbouly, Abeer

AU - Kumar, Vipin

PY - 2014/9/20

Y1 - 2014/9/20

N2 - The Human Leukocyte Antigen (HLA) genes are some of the most studied genes on the genome. This is due to their importance in bone marrow and solid organ transplantation, as well as their strong associations with many autoimmune, infectious, and inammatory diseases. As such, they can be a highly valuable asset to clinicians and researchers for elucidating biological mechanism that may drive those diseases. The extraordinary genetic polymorphism that exists in this region makes it very challenging to type. Therefore, several approaches were proposed for prediction of HLA genes from widely available genome-wide single nucleotide polymorphism (SNP) data sets in the attempt to reduce cost and utilize existing data. These methods use SNPs and highresolution training HLA data to build models for prediction of HLA genes in new samples. However, most of the existing HLA data sets are not available in high-resolution (exact allele assignment) but contain allelic ambiguities (inexact allele assignments). This is a result of existing typing methodologies not always being able to distinguish between several possible alleles at a given gene and produce ambiguous allele as a result. Current approaches for prediction of HLA genes from SNP data do not accommodate learning from ambiguous HLA data and, as such, miss the potential for an increased sample size and consequently improvements in prediction performance. In this paper, we propose Amb-EM, a novel algorithm for SNP-based prediction of HLA genes that utilizes ambiguities in the HLA data and predicts highresolution alleles using ambiguous HLA alleles for building the model. Additionally, we measure the impact that the uncertainty in the training data has on the prediction accuracy, and evaluate it on a real world data set. Our results show that the prediction from ambiguous HLA data outperforms the alternative approach which first imputes the ambiguous data into high-resolution HLA alleles and uses it to build the model.

AB - The Human Leukocyte Antigen (HLA) genes are some of the most studied genes on the genome. This is due to their importance in bone marrow and solid organ transplantation, as well as their strong associations with many autoimmune, infectious, and inammatory diseases. As such, they can be a highly valuable asset to clinicians and researchers for elucidating biological mechanism that may drive those diseases. The extraordinary genetic polymorphism that exists in this region makes it very challenging to type. Therefore, several approaches were proposed for prediction of HLA genes from widely available genome-wide single nucleotide polymorphism (SNP) data sets in the attempt to reduce cost and utilize existing data. These methods use SNPs and highresolution training HLA data to build models for prediction of HLA genes in new samples. However, most of the existing HLA data sets are not available in high-resolution (exact allele assignment) but contain allelic ambiguities (inexact allele assignments). This is a result of existing typing methodologies not always being able to distinguish between several possible alleles at a given gene and produce ambiguous allele as a result. Current approaches for prediction of HLA genes from SNP data do not accommodate learning from ambiguous HLA data and, as such, miss the potential for an increased sample size and consequently improvements in prediction performance. In this paper, we propose Amb-EM, a novel algorithm for SNP-based prediction of HLA genes that utilizes ambiguities in the HLA data and predicts highresolution alleles using ambiguous HLA alleles for building the model. Additionally, we measure the impact that the uncertainty in the training data has on the prediction accuracy, and evaluate it on a real world data set. Our results show that the prediction from ambiguous HLA data outperforms the alternative approach which first imputes the ambiguous data into high-resolution HLA alleles and uses it to build the model.

KW - Ambiguous genotypes

KW - Expectation-maximization

KW - HLA prediction

KW - SNPs

KW - Uncertain data

UR - http://www.scopus.com/inward/record.url?scp=84920719089&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84920719089&partnerID=8YFLogxK

U2 - 10.1145/2649387.2649408

DO - 10.1145/2649387.2649408

M3 - Conference contribution

AN - SCOPUS:84920719089

T3 - ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

SP - 104

EP - 113

BT - ACM BCB 2014 - 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

PB - Association for Computing Machinery

Y2 - 20 September 2014 through 23 September 2014

ER -

Amb-EM: A SNP-based prediction of HLA alleles using ambiguous HLA Data

Abstract

Publication series

Other

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this