Local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks

Mengli Xiao, Zhong Zhuang, Wei Pan

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.

Original languageEnglish (US)
Article number41
JournalGenes
Volume11
Issue number1
DOIs
StatePublished - Jan 2020

Bibliographical note

Funding Information:
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Table S1: Performance otfr ainCiNngN/vs aliwdaitthionv/taersytidnagta swpilnitdtionwg foarncde ll lsitneep GMsiz1e2s878f.orTabtlheeS 2:TParagreatmFientdererseadrcahtagseritdswfoirthCoNutN mcoorrdeeclts. training/validation/test data splitting for cell line GM12878. Table S2: Parameter search grids for CNN models. the results for the training data in parentheses). Table S4: Performance summary of additional epigenomics CNN models. Table S5: The single-cell-line and cross-cell-line mean (SD) test AUROCs across each of the 21 test the results for the training data in parentheses). Table S4: Performance summary of additional epigenomics CNN models. Table S5: The single-cell-line and cross-cell-line mean (SD) test AUROCs across each of the 21 test Author Contributions: Conceptualization, W.P.; methodology, M.X., Z.Z., W.P.; software, M.X., Z.Z.; validation, chromosomes for Gradient Boosting (GB) in comparison with the CNNs and FNNs with the same data format. M.X.; formal analysis, M.X.; investigation, M.X., Z.Z., W.P.; resources, W.P.; data curation, M.X.; writing—original Author Contributions: Conceptualization, W.P.; methodology, M.X., Z.Z., W.P.; software, M.X., Z.Z.; validation, M.X.; formal analysis, M.X.; investigation, M.X., Z.Z., W.P.; resources, W.P.; data curation, M.X.; writing— the manuscript. original draft preparation, M.X.; writing—review and editing, W.P.; visualization, M.X.; supervision, W.P.; Funding: This research was supported by NIH grants R21AG057038, R01HL116720, R01GM113250 and R01HL105397 and R01GM126002, and by the Minnesota Supercomputing Institute. Funding: This research was supported by NIH grants R21AG057038, R01HL116720, R01GM113250 and Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the R0study;1HL1in05the397collection,and R01GM1analyses,26002,oraninterprd by thetatione Minnofesodata;ta SupinerthecomwritingputingofInthestitutmanuscript,e. or in the decision to publish the results. Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Keywords

  • Boosting
  • Convolutional neural networks
  • Deep learning
  • Feed-forward neural networks
  • Machine learning

PubMed: MeSH publication types

  • Journal Article
  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

Fingerprint Dive into the research topics of 'Local epigenomic data are more informative than local genome sequence data in predicting enhancer-promoter interactions using neural networks'. Together they form a unique fingerprint.

Cite this