QSAR with Few Compounds and Many Features

Douglas M. Hawkins; Subhash C. Basak; Xiaofang Shi

doi:10.1021/ci0001177

QSAR with Few Compounds and Many Features

Douglas M. Hawkins, Subhash C. Basak, Xiaofang Shi

Research output: Contribution to journal › Article › peer-review

82 Scopus citations

Abstract

Fitting quantitative structure - activity relationships (QSAR) requires different statistical methodologies and, to some degree, philosophies depending on the "shape" of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset selection may be made and that nonlinearities and nonadditivities can be detected and diagnosed. Where there are many features and few compounds, this is unrealistic. Methods such as ridge regression RR, PLS, and principal component regression PCR, which abjure feature selection and rely on linearity may provide good predictions and fair understanding. We report a development of ridge regression for the underdetermined case by using generalized cross-validation to choose the ridge constant and perform F-tests for additional information. Conventional regression diagnostics can be used in followup to identify nonlinearities and other departures from model. We illustrate the approach with QSAR models of four data sets using calculated molecular descriptors.

Original language	English (US)
Pages (from-to)	663-670
Number of pages	8
Journal	Journal of chemical information and computer sciences
Volume	41
Issue number	3
DOIs	https://doi.org/10.1021/ci0001177
State	Published - 2001

Access

10.1021/ci0001177

OpenUrl availability

Full text

Cite this

@article{4485d86fcdc1418b9d4655a2dbe2674e,

title = "QSAR with Few Compounds and Many Features",

abstract = "Fitting quantitative structure - activity relationships (QSAR) requires different statistical methodologies and, to some degree, philosophies depending on the {"}shape{"} of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset selection may be made and that nonlinearities and nonadditivities can be detected and diagnosed. Where there are many features and few compounds, this is unrealistic. Methods such as ridge regression RR, PLS, and principal component regression PCR, which abjure feature selection and rely on linearity may provide good predictions and fair understanding. We report a development of ridge regression for the underdetermined case by using generalized cross-validation to choose the ridge constant and perform F-tests for additional information. Conventional regression diagnostics can be used in followup to identify nonlinearities and other departures from model. We illustrate the approach with QSAR models of four data sets using calculated molecular descriptors.",

author = "Hawkins, {Douglas M.} and Basak, {Subhash C.} and Xiaofang Shi",

year = "2001",

doi = "10.1021/ci0001177",

language = "English (US)",

volume = "41",

pages = "663--670",

journal = "Journal of chemical information and computer sciences",

issn = "0095-2338",

publisher = "American Chemical Society",

number = "3",

}

TY - JOUR

T1 - QSAR with Few Compounds and Many Features

AU - Hawkins, Douglas M.

AU - Basak, Subhash C.

AU - Shi, Xiaofang

PY - 2001

Y1 - 2001

N2 - Fitting quantitative structure - activity relationships (QSAR) requires different statistical methodologies and, to some degree, philosophies depending on the "shape" of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset selection may be made and that nonlinearities and nonadditivities can be detected and diagnosed. Where there are many features and few compounds, this is unrealistic. Methods such as ridge regression RR, PLS, and principal component regression PCR, which abjure feature selection and rely on linearity may provide good predictions and fair understanding. We report a development of ridge regression for the underdetermined case by using generalized cross-validation to choose the ridge constant and perform F-tests for additional information. Conventional regression diagnostics can be used in followup to identify nonlinearities and other departures from model. We illustrate the approach with QSAR models of four data sets using calculated molecular descriptors.

AB - Fitting quantitative structure - activity relationships (QSAR) requires different statistical methodologies and, to some degree, philosophies depending on the "shape" of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset selection may be made and that nonlinearities and nonadditivities can be detected and diagnosed. Where there are many features and few compounds, this is unrealistic. Methods such as ridge regression RR, PLS, and principal component regression PCR, which abjure feature selection and rely on linearity may provide good predictions and fair understanding. We report a development of ridge regression for the underdetermined case by using generalized cross-validation to choose the ridge constant and perform F-tests for additional information. Conventional regression diagnostics can be used in followup to identify nonlinearities and other departures from model. We illustrate the approach with QSAR models of four data sets using calculated molecular descriptors.

UR - http://www.scopus.com/inward/record.url?scp=0035350283&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035350283&partnerID=8YFLogxK

U2 - 10.1021/ci0001177

DO - 10.1021/ci0001177

M3 - Article

C2 - 11410044

AN - SCOPUS:0035350283

SN - 0095-2338

VL - 41

SP - 663

EP - 670

JO - Journal of chemical information and computer sciences

JF - Journal of chemical information and computer sciences

IS - 3

ER -

QSAR with Few Compounds and Many Features

Abstract

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this