Prediction of mutagenicity of chemicals from their calculated molecular descriptors: A case study with structurally homogeneous versus diverse datasets

Subhash C. Basak; Subhabrata Majumdar

doi:10.2174/1871524915666150722121322

Prediction of mutagenicity of chemicals from their calculated molecular descriptors: A case study with structurally homogeneous versus diverse datasets

Subhash C. Basak, Subhabrata Majumdar

Natural Resources Research Institute

Research output: Contribution to journal › Article › peer-review

12 Scopus citations

Abstract

Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n << p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.

Original language	English (US)
Pages (from-to)	117-123
Number of pages	7
Journal	Current computer-aided drug design
Volume	11
Issue number	2
DOIs	https://doi.org/10.2174/1871524915666150722121322
State	Published - Sep 1 2015

Bibliographical note

Publisher Copyright:
© 2015 Bentham Science Publishers.

Keywords

Congenericity principle
Diversity begets diversity principle
Envelope models
Hierarchical quantitative structure-activity relationship (HiQSAR)
Interrelated two-way clustering
Linear discriminant analysis
Mutagenicity
Ridge regression
Topological indices

Access

10.2174/1871524915666150722121322

OpenUrl availability

Full text

Cite this

@article{b4ca30cef3c4429f8a08a3810c484948,

title = "Prediction of mutagenicity of chemicals from their calculated molecular descriptors: A case study with structurally homogeneous versus diverse datasets",

abstract = "Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n << p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.",

keywords = "Congenericity principle, Diversity begets diversity principle, Envelope models, Hierarchical quantitative structure-activity relationship (HiQSAR), Interrelated two-way clustering, Linear discriminant analysis, Mutagenicity, Ridge regression, Topological indices",

author = "Basak, {Subhash C.} and Subhabrata Majumdar",

note = "Publisher Copyright: {\textcopyright} 2015 Bentham Science Publishers.",

year = "2015",

month = sep,

day = "1",

doi = "10.2174/1871524915666150722121322",

language = "English (US)",

volume = "11",

pages = "117--123",

journal = "Current computer-aided drug design",

issn = "1573-4099",

publisher = "Bentham Science Publishers B.V.",

number = "2",

}

TY - JOUR

T1 - Prediction of mutagenicity of chemicals from their calculated molecular descriptors

T2 - A case study with structurally homogeneous versus diverse datasets

AU - Basak, Subhash C.

AU - Majumdar, Subhabrata

PY - 2015/9/1

Y1 - 2015/9/1

N2 - Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n << p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.

AB - Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n << p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.

KW - Congenericity principle

KW - Diversity begets diversity principle

KW - Envelope models

KW - Hierarchical quantitative structure-activity relationship (HiQSAR)

KW - Interrelated two-way clustering

KW - Linear discriminant analysis

KW - Mutagenicity

KW - Ridge regression

KW - Topological indices

UR - http://www.scopus.com/inward/record.url?scp=84938724509&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84938724509&partnerID=8YFLogxK

U2 - 10.2174/1871524915666150722121322

DO - 10.2174/1871524915666150722121322

M3 - Article

C2 - 26202887

AN - SCOPUS:84938724509

SN - 1573-4099

VL - 11

SP - 117

EP - 123

JO - Current computer-aided drug design

JF - Current computer-aided drug design

IS - 2

ER -

Prediction of mutagenicity of chemicals from their calculated molecular descriptors: A case study with structurally homogeneous versus diverse datasets

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this