QSAR with Few Compounds and Many Features

Douglas M. Hawkins, Subhash C. Basak, Xiaofang Shi

Research output: Contribution to journalArticlepeer-review

82 Scopus citations

Abstract

Fitting quantitative structure - activity relationships (QSAR) requires different statistical methodologies and, to some degree, philosophies depending on the "shape" of the data matrix. When few features are used and there are many compounds, it is a reasonable expectation that good feature subset selection may be made and that nonlinearities and nonadditivities can be detected and diagnosed. Where there are many features and few compounds, this is unrealistic. Methods such as ridge regression RR, PLS, and principal component regression PCR, which abjure feature selection and rely on linearity may provide good predictions and fair understanding. We report a development of ridge regression for the underdetermined case by using generalized cross-validation to choose the ridge constant and perform F-tests for additional information. Conventional regression diagnostics can be used in followup to identify nonlinearities and other departures from model. We illustrate the approach with QSAR models of four data sets using calculated molecular descriptors.

Original languageEnglish (US)
Pages (from-to)663-670
Number of pages8
JournalJournal of chemical information and computer sciences
Volume41
Issue number3
DOIs
StatePublished - 2001

Fingerprint

Dive into the research topics of 'QSAR with Few Compounds and Many Features'. Together they form a unique fingerprint.

Cite this