The aqueous solubility (log S) of xenobiotic chemicals has been identified as a key characteristic in determining their bioaccessibility/bioavailability and their fate and transport in aquatic environments. We here explore and evaluate the use of a state-of-the-art data analysis technique (Project to Latent Structures, PLS) to estimate log S of environmentally relevant chemicals. A large number (n=624) of molecular descriptors was computed for over 1400 organic chemicals, and then refined by a feature selection technique. Candidate predictor descriptors were fitted to data by means of PLS, which was optimized by an internal leave-one-out cross-validation technique and validated by an external data set. The final (best) PLS model with only four variables (AlogP, X1. sol, Mv, and E) exhibited noteworthy stability and good predictive power. It was able to explain 91% of the data (n=1400) variance with an average absolute error of 0.5 log units through the solubilities span over 12 orders of magnitude. The newly proposed model is transparent, easily portable from one user to another, and robust enough to accurately estimate log S of a wide range of emerging contaminants.
- Aqueous solubility
- Environmental contaminants
- Environmental mobility
- Partial least-squares regression
- Water quality