Local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks

Mengli Xiao; Zhong Zhuang; Wei Pan

doi:10.3390/genes11010041

Local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks

Mengli Xiao, Zhong Zhuang, Wei Pan

Biostatistics

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.

Original language	English (US)
Article number	41
Journal	Genes
Volume	11
Issue number	1
DOIs	https://doi.org/10.3390/genes11010041
State	Published - Jan 2020

Bibliographical note

Publisher Copyright:
© 2019 by the authors. Licensee MDPI, Basel, Switzerland.

Keywords

Boosting
Convolutional neural networks
Deep learning
Feed-forward neural networks
Machine learning

Access

10.3390/genes11010041

OpenUrl availability

Full text

Cite this

@article{4f330efe50ee420db8d3debcae3f9f40,

title = "Local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks",

abstract = "Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.",

keywords = "Boosting, Convolutional neural networks, Deep learning, Feed-forward neural networks, Machine learning",

author = "Mengli Xiao and Zhong Zhuang and Wei Pan",

note = "Publisher Copyright: {\textcopyright} 2019 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2020",

month = jan,

doi = "10.3390/genes11010041",

language = "English (US)",

volume = "11",

journal = "Genes",

issn = "2073-4425",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "1",

}

TY - JOUR

T1 - Local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks

AU - Xiao, Mengli

AU - Zhuang, Zhong

AU - Pan, Wei

PY - 2020/1

Y1 - 2020/1

N2 - Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.

AB - Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.

KW - Boosting

KW - Convolutional neural networks

KW - Deep learning

KW - Feed-forward neural networks

KW - Machine learning

UR - http://www.scopus.com/inward/record.url?scp=85077594005&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85077594005&partnerID=8YFLogxK

U2 - 10.3390/genes11010041

DO - 10.3390/genes11010041

M3 - Article

C2 - 31905774

AN - SCOPUS:85077594005

SN - 2073-4425

VL - 11

JO - Genes

JF - Genes

IS - 1

M1 - 41

ER -

Local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this