Frequent substructure-based approaches for classifying chemical compounds

Mukund Deshpande; Michihiro Kuramochi; Nikil Wale; George Karypis

doi:10.1109/TKDE.2005.127

Frequent substructure-based approaches for classifying chemical compounds

Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, George Karypis

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

299 Scopus citations

Abstract

Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or nontoxic, and filtering out drug-like compounds from large compound libraries. This paper presents a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and, on average, outperforms existing schemes by 7 percent to 35 percent.

Original language	English (US)
Pages (from-to)	1036-1050
Number of pages	15
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	17
Issue number	8
DOIs	https://doi.org/10.1109/TKDE.2005.127
State	Published - Aug 2005

Bibliographical note

Funding Information:
The authors will like to thank Dr. Ian Watson from Lilly Research Laboratories and Dr. Peter Henstock from Pfizer Inc. for providing them with the various fingerprints used in the experimental evaluation and for the numerous discussions on the practical aspects of virtual screening. This work was supported by the US National Science Foundation EIA-9986042, ACI-9982274, ACI-0133464, ACI-0312828, IIS-0431135, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014, and by the Digital Technology Center at the University of Minnesota.

Keywords

Chemical compounds
Classification
Graphs
SVM
Virtual screening

Access

10.1109/TKDE.2005.127

OpenUrl availability

Full text

Cite this

@article{2af6dc31adc741bd8829864d05ee7fcb,

title = "Frequent substructure-based approaches for classifying chemical compounds",

abstract = "Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or nontoxic, and filtering out drug-like compounds from large compound libraries. This paper presents a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and, on average, outperforms existing schemes by 7 percent to 35 percent.",

keywords = "Chemical compounds, Classification, Graphs, SVM, Virtual screening",

author = "Mukund Deshpande and Michihiro Kuramochi and Nikil Wale and George Karypis",

note = "Funding Information: The authors will like to thank Dr. Ian Watson from Lilly Research Laboratories and Dr. Peter Henstock from Pfizer Inc. for providing them with the various fingerprints used in the experimental evaluation and for the numerous discussions on the practical aspects of virtual screening. This work was supported by the US National Science Foundation EIA-9986042, ACI-9982274, ACI-0133464, ACI-0312828, IIS-0431135, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014, and by the Digital Technology Center at the University of Minnesota.",

year = "2005",

month = aug,

doi = "10.1109/TKDE.2005.127",

language = "English (US)",

volume = "17",

pages = "1036--1050",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "8",

}

TY - JOUR

T1 - Frequent substructure-based approaches for classifying chemical compounds

AU - Deshpande, Mukund

AU - Kuramochi, Michihiro

AU - Wale, Nikil

AU - Karypis, George

N1 - Funding Information: The authors will like to thank Dr. Ian Watson from Lilly Research Laboratories and Dr. Peter Henstock from Pfizer Inc. for providing them with the various fingerprints used in the experimental evaluation and for the numerous discussions on the practical aspects of virtual screening. This work was supported by the US National Science Foundation EIA-9986042, ACI-9982274, ACI-0133464, ACI-0312828, IIS-0431135, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014, and by the Digital Technology Center at the University of Minnesota.

PY - 2005/8

Y1 - 2005/8

N2 - Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or nontoxic, and filtering out drug-like compounds from large compound libraries. This paper presents a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and, on average, outperforms existing schemes by 7 percent to 35 percent.

AB - Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or nontoxic, and filtering out drug-like compounds from large compound libraries. This paper presents a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and, on average, outperforms existing schemes by 7 percent to 35 percent.

KW - Chemical compounds

KW - Classification

KW - Graphs

KW - SVM

KW - Virtual screening

UR - http://www.scopus.com/inward/record.url?scp=24344484786&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=24344484786&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2005.127

DO - 10.1109/TKDE.2005.127

M3 - Article

AN - SCOPUS:24344484786

SN - 1041-4347

VL - 17

SP - 1036

EP - 1050

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 8

ER -

Frequent substructure-based approaches for classifying chemical compounds

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this