Machine learning-based prediction of activity and substrate specificity for OleA enzymes in the thiolase superfamily

Serina L. Robinson; Megan D. Smith; Jack E. Richman; Kelly G. Aukema; Lawrence P. Wackett

doi:10.1093/SYNBIO/YSAA004

Machine learning-based prediction of activity and substrate specificity for OleA enzymes in the thiolase superfamily

Serina L. Robinson, Megan D. Smith, Jack E. Richman, Kelly G. Aukema, Lawrence P. Wackett

Research output: Contribution to journal › Article › peer-review

20 Scopus citations

Abstract

Enzymes in the thiolase superfamily catalyze carbon-carbon bond formation for the biosynthesis of polyhydroxyalkanoate storage molecules, membrane lipids and bioactive secondary metabolites. Natural and engineered thiolases have applications in synthetic biology for the production of high-value compounds, including personal care products and therapeutics. A fundamental understanding of thiolase substrate specificity is lacking, particularly within the OleA protein family. The ability to predict substrates from sequence would advance (meta)genome mining efforts to identify active thiolases for the production of desired metabolites. To gain a deeper understanding of substrate scope within the OleA family, we measured the activity of 73 diverse bacterial thiolases with a library of 15 p-nitrophenyl ester substrates to build a training set of 1095 unique enzyme-substrate pairs. We then used machine learning to predict thiolase substrate specificity from physicochemical and structural features. The area under the receiver operating characteristic curve was 0.89 for random forest classification of enzyme activity, and our regression model had a test set root mean square error of 0.22 (R2 = 0.75) to quantitatively predict enzyme activity levels. Substrate aromaticity, oxygen content and molecular connectivity were the strongest predictors of enzyme-substrate pairing. Key amino acid residues A173, I284, V287, T292 and I316 in the Xanthomonas campestris OleA crystal structure lining the substrate binding pockets were important for thiolase substrate specificity and are attractive targets for future protein engineering studies. The predictive framework described here is generalizable and demonstrates how machine learning can be used to quantitatively understand and predict enzyme substrate specificity.

Original language	English (US)
Article number	ysaa004
Journal	Synthetic Biology
Volume	5
Issue number	1
DOIs	https://doi.org/10.1093/SYNBIO/YSAA004
State	Published - 2020

Bibliographical note

Funding Information:
Raw data and scripts to reproduce analyses, figures and tables are available at https://github.com/serina-robinson/thiolase-ma chine-learning/. An interactive web application with a searchable database and predictive models trained on the complete dataset are also available at z.umn.edu/thiolases (shortened URL) and srobinson.shinyapps.io/thiolases (permanent URL). The DNA constructs, provided in Supplementary Material S1, will be provided upon request. The DNA constructs were provided by the United States Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, under Contract No. DE-AC02-05CH11231. DNA requests will be honored with the completion of a Materials Transfer Agreement as required by our contracts with the U.S. Department of Energy.

Funding Information:
We thank the U.S. Department of Energy Joint Genome Institute for synthetic DNA. The work conducted by the U.S. Department of Energy (DOE) Joint Genome Institute, a DOE

Funding Information:
Office of Science User Facility, is supported under [DE-AC02-05CH11231]; The National Science Foundation Graduate Research Fellowship [00039202 to S.L.R.]; National Institutes of Health Biotechnology training grant [5T32GM008347-27 to M.D.S.]. We also acknowledge support from the MnDRIVE initiative for Industry and the Environment.

Publisher Copyright:
© The Author(s) 2020.

Keywords

Enzyme activity screen
Machine learning
P-nitrophenyl esters
Substrate specificity
Thiolase

Access

10.1093/SYNBIO/YSAA004

OpenUrl availability

Full text

Cite this

@article{9b847274281142bcb175921611dd815b,

title = "Machine learning-based prediction of activity and substrate specificity for OleA enzymes in the thiolase superfamily",

abstract = "Enzymes in the thiolase superfamily catalyze carbon-carbon bond formation for the biosynthesis of polyhydroxyalkanoate storage molecules, membrane lipids and bioactive secondary metabolites. Natural and engineered thiolases have applications in synthetic biology for the production of high-value compounds, including personal care products and therapeutics. A fundamental understanding of thiolase substrate specificity is lacking, particularly within the OleA protein family. The ability to predict substrates from sequence would advance (meta)genome mining efforts to identify active thiolases for the production of desired metabolites. To gain a deeper understanding of substrate scope within the OleA family, we measured the activity of 73 diverse bacterial thiolases with a library of 15 p-nitrophenyl ester substrates to build a training set of 1095 unique enzyme-substrate pairs. We then used machine learning to predict thiolase substrate specificity from physicochemical and structural features. The area under the receiver operating characteristic curve was 0.89 for random forest classification of enzyme activity, and our regression model had a test set root mean square error of 0.22 (R2 = 0.75) to quantitatively predict enzyme activity levels. Substrate aromaticity, oxygen content and molecular connectivity were the strongest predictors of enzyme-substrate pairing. Key amino acid residues A173, I284, V287, T292 and I316 in the Xanthomonas campestris OleA crystal structure lining the substrate binding pockets were important for thiolase substrate specificity and are attractive targets for future protein engineering studies. The predictive framework described here is generalizable and demonstrates how machine learning can be used to quantitatively understand and predict enzyme substrate specificity. ",

keywords = "Enzyme activity screen, Machine learning, P-nitrophenyl esters, Substrate specificity, Thiolase",

author = "Robinson, {Serina L.} and Smith, {Megan D.} and Richman, {Jack E.} and Aukema, {Kelly G.} and Wackett, {Lawrence P.}",

note = "Funding Information: Raw data and scripts to reproduce analyses, figures and tables are available at https://github.com/serina-robinson/thiolase-ma chine-learning/. An interactive web application with a searchable database and predictive models trained on the complete dataset are also available at z.umn.edu/thiolases (shortened URL) and srobinson.shinyapps.io/thiolases (permanent URL). The DNA constructs, provided in Supplementary Material S1, will be provided upon request. The DNA constructs were provided by the United States Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, under Contract No. DE-AC02-05CH11231. DNA requests will be honored with the completion of a Materials Transfer Agreement as required by our contracts with the U.S. Department of Energy. Funding Information: We thank the U.S. Department of Energy Joint Genome Institute for synthetic DNA. The work conducted by the U.S. Department of Energy (DOE) Joint Genome Institute, a DOE Funding Information: Office of Science User Facility, is supported under [DE-AC02-05CH11231]; The National Science Foundation Graduate Research Fellowship [00039202 to S.L.R.]; National Institutes of Health Biotechnology training grant [5T32GM008347-27 to M.D.S.]. We also acknowledge support from the MnDRIVE initiative for Industry and the Environment. Publisher Copyright: {\textcopyright} The Author(s) 2020.",

year = "2020",

doi = "10.1093/SYNBIO/YSAA004",

language = "English (US)",

volume = "5",

journal = "Synthetic Biology",

issn = "1939-7267",

publisher = "Oxford University Press",

number = "1",

}

TY - JOUR

T1 - Machine learning-based prediction of activity and substrate specificity for OleA enzymes in the thiolase superfamily

AU - Robinson, Serina L.

AU - Smith, Megan D.

AU - Richman, Jack E.

AU - Aukema, Kelly G.

AU - Wackett, Lawrence P.

N1 - Funding Information: Raw data and scripts to reproduce analyses, figures and tables are available at https://github.com/serina-robinson/thiolase-ma chine-learning/. An interactive web application with a searchable database and predictive models trained on the complete dataset are also available at z.umn.edu/thiolases (shortened URL) and srobinson.shinyapps.io/thiolases (permanent URL). The DNA constructs, provided in Supplementary Material S1, will be provided upon request. The DNA constructs were provided by the United States Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, under Contract No. DE-AC02-05CH11231. DNA requests will be honored with the completion of a Materials Transfer Agreement as required by our contracts with the U.S. Department of Energy. Funding Information: We thank the U.S. Department of Energy Joint Genome Institute for synthetic DNA. The work conducted by the U.S. Department of Energy (DOE) Joint Genome Institute, a DOE Funding Information: Office of Science User Facility, is supported under [DE-AC02-05CH11231]; The National Science Foundation Graduate Research Fellowship [00039202 to S.L.R.]; National Institutes of Health Biotechnology training grant [5T32GM008347-27 to M.D.S.]. We also acknowledge support from the MnDRIVE initiative for Industry and the Environment. Publisher Copyright: © The Author(s) 2020.

PY - 2020

Y1 - 2020

N2 - Enzymes in the thiolase superfamily catalyze carbon-carbon bond formation for the biosynthesis of polyhydroxyalkanoate storage molecules, membrane lipids and bioactive secondary metabolites. Natural and engineered thiolases have applications in synthetic biology for the production of high-value compounds, including personal care products and therapeutics. A fundamental understanding of thiolase substrate specificity is lacking, particularly within the OleA protein family. The ability to predict substrates from sequence would advance (meta)genome mining efforts to identify active thiolases for the production of desired metabolites. To gain a deeper understanding of substrate scope within the OleA family, we measured the activity of 73 diverse bacterial thiolases with a library of 15 p-nitrophenyl ester substrates to build a training set of 1095 unique enzyme-substrate pairs. We then used machine learning to predict thiolase substrate specificity from physicochemical and structural features. The area under the receiver operating characteristic curve was 0.89 for random forest classification of enzyme activity, and our regression model had a test set root mean square error of 0.22 (R2 = 0.75) to quantitatively predict enzyme activity levels. Substrate aromaticity, oxygen content and molecular connectivity were the strongest predictors of enzyme-substrate pairing. Key amino acid residues A173, I284, V287, T292 and I316 in the Xanthomonas campestris OleA crystal structure lining the substrate binding pockets were important for thiolase substrate specificity and are attractive targets for future protein engineering studies. The predictive framework described here is generalizable and demonstrates how machine learning can be used to quantitatively understand and predict enzyme substrate specificity.

AB - Enzymes in the thiolase superfamily catalyze carbon-carbon bond formation for the biosynthesis of polyhydroxyalkanoate storage molecules, membrane lipids and bioactive secondary metabolites. Natural and engineered thiolases have applications in synthetic biology for the production of high-value compounds, including personal care products and therapeutics. A fundamental understanding of thiolase substrate specificity is lacking, particularly within the OleA protein family. The ability to predict substrates from sequence would advance (meta)genome mining efforts to identify active thiolases for the production of desired metabolites. To gain a deeper understanding of substrate scope within the OleA family, we measured the activity of 73 diverse bacterial thiolases with a library of 15 p-nitrophenyl ester substrates to build a training set of 1095 unique enzyme-substrate pairs. We then used machine learning to predict thiolase substrate specificity from physicochemical and structural features. The area under the receiver operating characteristic curve was 0.89 for random forest classification of enzyme activity, and our regression model had a test set root mean square error of 0.22 (R2 = 0.75) to quantitatively predict enzyme activity levels. Substrate aromaticity, oxygen content and molecular connectivity were the strongest predictors of enzyme-substrate pairing. Key amino acid residues A173, I284, V287, T292 and I316 in the Xanthomonas campestris OleA crystal structure lining the substrate binding pockets were important for thiolase substrate specificity and are attractive targets for future protein engineering studies. The predictive framework described here is generalizable and demonstrates how machine learning can be used to quantitatively understand and predict enzyme substrate specificity.

KW - Enzyme activity screen

KW - Machine learning

KW - P-nitrophenyl esters

KW - Substrate specificity

KW - Thiolase

UR - http://www.scopus.com/inward/record.url?scp=85102056300&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85102056300&partnerID=8YFLogxK

U2 - 10.1093/SYNBIO/YSAA004

DO - 10.1093/SYNBIO/YSAA004

M3 - Article

AN - SCOPUS:85102056300

SN - 1939-7267

VL - 5

JO - Synthetic Biology

JF - Synthetic Biology

IS - 1

M1 - ysaa004

ER -

Machine learning-based prediction of activity and substrate specificity for OleA enzymes in the thiolase superfamily

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this