Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions

Hamid Safizadeh; Scott W. Simpkins; Justin Nelson; Sheena C. Li; Jeff S. Piotrowski; Mami Yoshimura; Yoko Yashiroda; Hiroyuki Hirano; Hiroyuki Osada; Minoru Yoshida; Charles Boone; Chad L. Myers

doi:10.1021/acs.jcim.0c00993

Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions

Hamid Safizadeh, Scott W. Simpkins, Justin Nelson, Sheena C. Li, Jeff S. Piotrowski, Mami Yoshimura, Yoko Yashiroda, Hiroyuki Hirano, Hiroyuki Osada, Minoru Yoshida, Charles Boone, Chad L. Myers

Computer Science and Engineering

Research output: Contribution to journal › Review article › peer-review

10 Scopus citations

Abstract

A common strategy for identifying molecules likely to possess a desired biological activity is to search large databases of compounds for high structural similarity to a query molecule that demonstrates this activity, under the assumption that structural similarity is predictive of similar biological activity. However, efforts to systematically benchmark the diverse array of available molecular fingerprints and similarity coefficients have been limited by a lack of large-scale datasets that reflect biological similarities of compounds. To elucidate the relative performance of these alternatives, we systematically benchmarked 11 different molecular fingerprint encodings, each combined with 13 different similarity coefficients, using a large set of chemical-genetic interaction data from the yeastSaccharomyces cerevisiaeas a systematic proxy for biological activity. We found that the performance of different molecular fingerprints and similarity coefficients varied substantially and that the all-shortest path fingerprints paired with the Braun-Blanquet similarity coefficient provided superior performance that was robust across several compound collections. We further proposed a machine learning pipeline based on support vector machines that offered a fivefold improvement relative to the best unsupervised approach. Our results generally suggest that using high-dimensional chemical-genetic data as a basis for refining molecular fingerprints can be a powerful approach for improving prediction of biological functions from chemical structures.

Original language	English (US)
Pages (from-to)	4156-4172
Number of pages	17
Journal	Journal of Chemical Information and Modeling
Volume	61
Issue number	9
DOIs	https://doi.org/10.1021/acs.jcim.0c00993
State	Published - Sep 27 2021

Bibliographical note

Funding Information:
We thank Benjamin VanderSluis, Wen Wang, and Maximilian Billmann for proofreading the early drafts of this article. H.S. was partially supported by the National Institutes of Health (NIH) (R01HG005084 and R01GM104975) and the National Science Foundation (NSF) (DBI 0953881). S.W.S. was supported by a NSF Graduate Research Fellowship (00039202), a NIH Biotechnology training grant (T32GM008347), and a BICB one-year fellowship. S.C.L. was supported by a RIKEN Foreign Postdoctoral Research Fellowship and a RIKEN Incentive Research Projects grant. Minoru Yoshida was supported in part by a Grant-in-Aid for Scientific Research (S) (19H05640), the Japan Society for the Promotion of Science (JSPS), and by a Grant-in-Aid for Scientific Research on Innovative Areas (18H05503) from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) of Japan. C.B. was supported by a JSPS Grant-in-Aid for Scientific Research (B) (15H04483). C.B. and Y.Y. were supported by a Grant-in-Aid for Scientific Research on Innovative Areas (17H06411) from MEXT. C.L.M. is a fellow in the Canadian Institute for Advanced Research (CIFAR) Genetic Networks Program. Computing resources and data storage services were partially provided by the Minnesota Supercomputing Institute and the UMN Office of Information Technology, respectively.

Publisher Copyright:
© 2021 The Authors. Published by American Chemical Society

Access

10.1021/acs.jcim.0c00993

OpenUrl availability

Full text

Cite this

Safizadeh, H., Simpkins, S. W., Nelson, J., Li, S. C., Piotrowski, J. S., Yoshimura, M., Yashiroda, Y., Hirano, H., Osada, H., Yoshida, M., Boone, C., & Myers, C. L. (2021). Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions. Journal of Chemical Information and Modeling, 61(9), 4156-4172. https://doi.org/10.1021/acs.jcim.0c00993

Safizadeh, H, Simpkins, SW, Nelson, J, Li, SC, Piotrowski, JS, Yoshimura, M, Yashiroda, Y, Hirano, H, Osada, H, Yoshida, M, Boone, C & Myers, CL 2021, 'Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions', Journal of Chemical Information and Modeling, vol. 61, no. 9, pp. 4156-4172. https://doi.org/10.1021/acs.jcim.0c00993

@article{7f683c21157747c89f8faf23478b089e,

title = "Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions",

abstract = "A common strategy for identifying molecules likely to possess a desired biological activity is to search large databases of compounds for high structural similarity to a query molecule that demonstrates this activity, under the assumption that structural similarity is predictive of similar biological activity. However, efforts to systematically benchmark the diverse array of available molecular fingerprints and similarity coefficients have been limited by a lack of large-scale datasets that reflect biological similarities of compounds. To elucidate the relative performance of these alternatives, we systematically benchmarked 11 different molecular fingerprint encodings, each combined with 13 different similarity coefficients, using a large set of chemical-genetic interaction data from the yeastSaccharomyces cerevisiaeas a systematic proxy for biological activity. We found that the performance of different molecular fingerprints and similarity coefficients varied substantially and that the all-shortest path fingerprints paired with the Braun-Blanquet similarity coefficient provided superior performance that was robust across several compound collections. We further proposed a machine learning pipeline based on support vector machines that offered a fivefold improvement relative to the best unsupervised approach. Our results generally suggest that using high-dimensional chemical-genetic data as a basis for refining molecular fingerprints can be a powerful approach for improving prediction of biological functions from chemical structures.",

author = "Hamid Safizadeh and Simpkins, {Scott W.} and Justin Nelson and Li, {Sheena C.} and Piotrowski, {Jeff S.} and Mami Yoshimura and Yoko Yashiroda and Hiroyuki Hirano and Hiroyuki Osada and Minoru Yoshida and Charles Boone and Myers, {Chad L.}",

note = "Funding Information: We thank Benjamin VanderSluis, Wen Wang, and Maximilian Billmann for proofreading the early drafts of this article. H.S. was partially supported by the National Institutes of Health (NIH) (R01HG005084 and R01GM104975) and the National Science Foundation (NSF) (DBI 0953881). S.W.S. was supported by a NSF Graduate Research Fellowship (00039202), a NIH Biotechnology training grant (T32GM008347), and a BICB one-year fellowship. S.C.L. was supported by a RIKEN Foreign Postdoctoral Research Fellowship and a RIKEN Incentive Research Projects grant. Minoru Yoshida was supported in part by a Grant-in-Aid for Scientific Research (S) (19H05640), the Japan Society for the Promotion of Science (JSPS), and by a Grant-in-Aid for Scientific Research on Innovative Areas (18H05503) from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) of Japan. C.B. was supported by a JSPS Grant-in-Aid for Scientific Research (B) (15H04483). C.B. and Y.Y. were supported by a Grant-in-Aid for Scientific Research on Innovative Areas (17H06411) from MEXT. C.L.M. is a fellow in the Canadian Institute for Advanced Research (CIFAR) Genetic Networks Program. Computing resources and data storage services were partially provided by the Minnesota Supercomputing Institute and the UMN Office of Information Technology, respectively. Publisher Copyright: {\textcopyright} 2021 The Authors. Published by American Chemical Society",

year = "2021",

month = sep,

day = "27",

doi = "10.1021/acs.jcim.0c00993",

language = "English (US)",

volume = "61",

pages = "4156--4172",

journal = "Journal of Chemical Information and Modeling",

issn = "1549-9596",

publisher = "American Chemical Society",

number = "9",

}

TY - JOUR

T1 - Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions

AU - Safizadeh, Hamid

AU - Simpkins, Scott W.

AU - Nelson, Justin

AU - Li, Sheena C.

AU - Piotrowski, Jeff S.

AU - Yoshimura, Mami

AU - Yashiroda, Yoko

AU - Hirano, Hiroyuki

AU - Osada, Hiroyuki

AU - Yoshida, Minoru

AU - Boone, Charles

AU - Myers, Chad L.

N1 - Funding Information: We thank Benjamin VanderSluis, Wen Wang, and Maximilian Billmann for proofreading the early drafts of this article. H.S. was partially supported by the National Institutes of Health (NIH) (R01HG005084 and R01GM104975) and the National Science Foundation (NSF) (DBI 0953881). S.W.S. was supported by a NSF Graduate Research Fellowship (00039202), a NIH Biotechnology training grant (T32GM008347), and a BICB one-year fellowship. S.C.L. was supported by a RIKEN Foreign Postdoctoral Research Fellowship and a RIKEN Incentive Research Projects grant. Minoru Yoshida was supported in part by a Grant-in-Aid for Scientific Research (S) (19H05640), the Japan Society for the Promotion of Science (JSPS), and by a Grant-in-Aid for Scientific Research on Innovative Areas (18H05503) from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) of Japan. C.B. was supported by a JSPS Grant-in-Aid for Scientific Research (B) (15H04483). C.B. and Y.Y. were supported by a Grant-in-Aid for Scientific Research on Innovative Areas (17H06411) from MEXT. C.L.M. is a fellow in the Canadian Institute for Advanced Research (CIFAR) Genetic Networks Program. Computing resources and data storage services were partially provided by the Minnesota Supercomputing Institute and the UMN Office of Information Technology, respectively. Publisher Copyright: © 2021 The Authors. Published by American Chemical Society

PY - 2021/9/27

Y1 - 2021/9/27

N2 - A common strategy for identifying molecules likely to possess a desired biological activity is to search large databases of compounds for high structural similarity to a query molecule that demonstrates this activity, under the assumption that structural similarity is predictive of similar biological activity. However, efforts to systematically benchmark the diverse array of available molecular fingerprints and similarity coefficients have been limited by a lack of large-scale datasets that reflect biological similarities of compounds. To elucidate the relative performance of these alternatives, we systematically benchmarked 11 different molecular fingerprint encodings, each combined with 13 different similarity coefficients, using a large set of chemical-genetic interaction data from the yeastSaccharomyces cerevisiaeas a systematic proxy for biological activity. We found that the performance of different molecular fingerprints and similarity coefficients varied substantially and that the all-shortest path fingerprints paired with the Braun-Blanquet similarity coefficient provided superior performance that was robust across several compound collections. We further proposed a machine learning pipeline based on support vector machines that offered a fivefold improvement relative to the best unsupervised approach. Our results generally suggest that using high-dimensional chemical-genetic data as a basis for refining molecular fingerprints can be a powerful approach for improving prediction of biological functions from chemical structures.

AB - A common strategy for identifying molecules likely to possess a desired biological activity is to search large databases of compounds for high structural similarity to a query molecule that demonstrates this activity, under the assumption that structural similarity is predictive of similar biological activity. However, efforts to systematically benchmark the diverse array of available molecular fingerprints and similarity coefficients have been limited by a lack of large-scale datasets that reflect biological similarities of compounds. To elucidate the relative performance of these alternatives, we systematically benchmarked 11 different molecular fingerprint encodings, each combined with 13 different similarity coefficients, using a large set of chemical-genetic interaction data from the yeastSaccharomyces cerevisiaeas a systematic proxy for biological activity. We found that the performance of different molecular fingerprints and similarity coefficients varied substantially and that the all-shortest path fingerprints paired with the Braun-Blanquet similarity coefficient provided superior performance that was robust across several compound collections. We further proposed a machine learning pipeline based on support vector machines that offered a fivefold improvement relative to the best unsupervised approach. Our results generally suggest that using high-dimensional chemical-genetic data as a basis for refining molecular fingerprints can be a powerful approach for improving prediction of biological functions from chemical structures.

UR - http://www.scopus.com/inward/record.url?scp=85112510092&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85112510092&partnerID=8YFLogxK

U2 - 10.1021/acs.jcim.0c00993

DO - 10.1021/acs.jcim.0c00993

M3 - Review article

C2 - 34318674

AN - SCOPUS:85112510092

SN - 1549-9596

VL - 61

SP - 4156

EP - 4172

JO - Journal of Chemical Information and Modeling

JF - Journal of Chemical Information and Modeling

IS - 9

ER -

Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions

Abstract

Bibliographical note

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this