TORBJO ¨ RN LUNDSTEDT Acurepharma AB, Uppsala, Sweden and BMC, Uppsala, SwedenOVERVIEW approaches useful for QSAR analysis in predictive toxicology.The methods discussed and exemplified
Trang 1TORBJO ¨ RN LUNDSTEDT Acurepharma AB, Uppsala, Sweden and BMC, Uppsala, Sweden
OVERVIEW
approaches useful for QSAR analysis in predictive toxicology.The methods discussed and exemplified are: multiple linearregression (MLR), principal component analysis (PCA), prin-cipal component regression (PCR), and partial least squaresprojections to latent structures (PLS) Two QSAR data sets,drawn from the fields of environmental toxicology and drugdesign, are worked out in detail, showing the benefits of thesemethods PCA is useful when overviewing a data set and
177
Trang 2exploring relationships among compounds and relationshipsamong variables MLR, PCR, and PLS are used for establish-ing the QSARs Additionally, the concept of statistical molecu-lar design is considered, which is an essential ingredient forselecting an informative training set of compounds for QSARcalibration.
1 INTRODUCTION
Much of today’s activities in medicinal chemistry, molecularbiology, predictive toxicology, and drug design are centered
as toxicity, solubility, acidity, enzyme binding, and membranepenetration For almost any series of compounds, dependen-cies between chemistry and biology are usually very complex,particularly when addressing in vivo biological data To inves-tigate, understand, and use such relationships, we need asound description (‘‘characterization’’) of the variation in che-mical structure of relevant molecules and biological targets,reliable biological and pharmacological data, and possibilities
of fabricating new compounds deemed to be of interest Inaddition, we need good mathematical tools to establish andexpress the relationships, as well as informationally optimalstrategies to select compounds for closer scrutiny, so thatthe resulting model is indeed informative and relevant forthe stated purposes
Mathematical analysis of the relationships between mical structure and biological properties of compounds isoften called quantitative structure–activity relationship(QSAR) modeling (1,2) Thus, QSARs link biological proper-ties of a chemical to its molecular structure Consequently,
che-a hypothesis cche-an often be proposed to identify which physicche-al,chemical, or structural (conformational) features are crucialfor the biological response(s) elicited In this chapter, we willdiscuss two aspects of the QSAR problem, two parts which areintimately linked The first deals with how to select informa-tive and relevant compounds to make the model as good as
Trang 3possible (Sec 2) The second involves methods to capture thestructure–activity relationships (Sec 3).
2 CHARACTERIZATION AND SELECTION OF
COMPOUNDS: STATISTICAL MOLECULAR
DESIGN
2.1 Characterization
A key issue in QSAR is the characterization of the compoundsinvestigated, both concerning chemical and biological proper-ties This description of chemical and biological features maywell be done multivariately, i.e., by using a wide set of chemi-cal descriptors and biological responses (3) The use of multi-variate chemical and biological data is becoming increasinglywidespread in QSAR, both regarding drug design and envir-onmental sciences A multitude of chemical descriptors willstabilize the description of the chemical properties of the com-pounds, facilitate the detection of groups (classes) of com-
unravel chemical outliers A multivariate description of thebiological properties is highly recommended as well Thisleads to statistically beneficial properties of the QSAR andimproved possibilities of exploring the biological similarity
of the studied substances The absence of outliers in variate biological data is a very valuable indication of homo-geneity of the biological response profiles among thecompounds
multi-This rapidly developing emphasis on the use of manyX-descriptors and Y-responses is at some contrast to the tradi-tional way of QSAR-conduct, where single parameters areusually used to account for chemical properties, parametersthat are often derived from measurements in chemicalmodel systems (1) However, with the advancement ofcomputers, quantum chemical theories, and dedicated QSARsoftware, it is becoming increasingly common to be confrontedwith a wide set of molecular descriptors of different kinds (4)
An advantage of theoretical descriptors is that they arecalculable for not yet synthesized chemicals
Trang 4Descriptors that are found useful in QSAR often mirrorfundamental physico-chemical factors that in some wayrelate to the biological endpoint(s) under study Examples
of such molecular properties are hydrophobicity, steric andelectronic properties, molecular weight, pKa, etc Thesedescriptors provide valuable insight into plausible mechanis-tic properties It is also desirable for the chemical description
to be reversible, so that the model interpretation leads ward to an understanding of how to modify chemical struc-ture to possibly influence biological activity (a deeperaccount of tools and descriptors used for representation ofchemicals is provided elsewhere in this text.)
for-Furthermore, knowledge about the biological data isessential in QSAR To quote Cronin and Schultz (5): ‘‘Reli-able data are required to build reliable predictive models
In terms of biological activities, such data should ideally bemeasured by a single protocol, ideally even the same labora-tory and by the same workers High quality biological datawill have lower experimental error associated with them.Biological data should ideally be from well standardizedassays, with a clear and unambiguous endpoint.’’ This articlealso discusses in depth the importance of appreciation of bio-logical data quality, and that it is important to know theuncertainty with which the biological data were measured.(Issues related to representation of biological data are dis-cussed elsewhere in this book.)
2.2 Selection of Representative Compounds
A second key issue in QSAR concerns the selection of cules on which the QSAR model is to be based This phasemay perhaps also involve consideration of a second subset
mole-of compounds, which is used for validation purposes tunately, the selection of relevant compounds is an oftenoverlooked issue in QSAR Without the use of a formal selec-tion strategy the result is often a poor and unbalanced cover-
contrast, statistical molecular design (SMD) (1–4) is an cient tool resulting in the selection of a diverse set of
Trang 5effi-compounds (Fig 1, bottom) One of the early proponents ofSMD, was Austel (6), who introduced formal design on theQSAR arena.
The basic idea in SMD is to first describe thoroughlythe available compounds using several chemical and struc-tural descriptor variables These variables may be measur-able in chemical model systems, calculable using, e.g.,quantum-chemical orbital theory, or simply based on atom-and=or fragment counts
The collected chemical descriptors make up the matrix
X Principal component analysis (PCA) is then used to dense the information of the original variables into a set of
con-‘‘new’’ variables, the principal component scores (1–4) Thesescore vectors are linear combinations of the original vari-ables, and reflect the major chemical properties of the com-
independent (orthogonal) of one another they are often used
in a statistical experimental design protocols This process
Figure 1 (Top) A set of nine compounds uniformly distributed in the structural space (S-space) of a series of compounds The axes correspond to appropriate structural features, e.g., lipophilicity, size, polarizability, chemical reactivity, etc The information con- tent of the selected set of compounds is closely linked to how well the set is spread in the given S-space In the given example, the selected compounds represent a good coverage of the S-space (Bottom) The same number of compounds but distributed in an uninformative manner The information provided by this set of com- pounds corresponds approximately to the information obtained from two compounds, the remote one plus one drawn from the eight-membered main cluster.
Trang 6is called SMD Design protocols commonly used in SMD aredrawn from the factorial and D-optimal design families (1–4).
3 DATA ANALYTICAL TECHNIQUES
In this section, we will be concerned with four regression- andprojection-based methods, which are frequently used in QSAR.The first method we discuss is multiple linear regression,(MLR), which is a workhorse used extensively in QSAR (7).Next, we introduce three projection-based approaches, themethods of PCA, principal component regression (PCR), andprojections to latent structures (PLS) These methods are par-ticularly apt at handling the situation when the number of vari-ables equals or exceeds the number of compounds (1–4) This isbecause projections to latent variables in multivariate space
variables are involved (3)
Geometrically, PCA, PCR, PLS, and similar methods can
be seen as the projection of the observation points (compounds)
in variable-space down on an A-dimensional hyper-plane Thepositions of the observation points on this hyper-plane are given
by the scores and the orientation of the plane in relation to theoriginal variables is indicated by the loadings
3.1 Multiple Linear Regression (MLR)
The method of MLR represents the classical approach to tical analysis in QSAR (7,8) Multiple linear regression isusually used to fit the regression model (1), which models asingle response variable, y, as a linear combination of the X-variables, with the coefficients b The deviations between thedata (y) and the model (Xb) are called residuals, and aredenoted by e
Multiple linear regression assumes the predictor ables, normally called X, to be mathematically independent(‘‘orthogonal’’) Mathematical independence means that therank of X is K (i.e., equals the number of X-variables) Hence,
Trang 7vari-MLR does not work well with correlated descriptors One tical work-around is long and lean data matrices—matriceswhere the number of compounds substantially exceeds thenumber of chemical descriptors—where inter-relatednessamong variables usually drops It has been suggested to pre-serve the ratio of compounds to variables above five (9) We notethat one way to introduce orthogonality or near-orthogonality
For many response variables (columns in the responsematrix Y), regression normally forms one model for each ofthe M Y-variables, i.e., M separate models Another key fea-ture of MLR is that it exhausts the X-matrix, i.e., uses all(100%) of its variance (i.e., there will be no X-matrix errorterm in the regression model) Hence, it is assumed that theX-variables are exact and completely (100%) relevant for themodelling of Y
3.2 Principal Component Analysis (PCA)
Principal component analysis forms the basis for multivariatedata analysis (10–13) This is an exploratory and summary
point for PCA is a matrix of data with N rows (observations)and K columns (variables), here denoted by X In QSAR, theobservations are the compounds and the variables are thedescriptors used to characterize them
PCA goes back to Cauchy, but was first formulated instatistics by Pearson, who described the analysis as finding
‘‘lines and planes of closest fit to systems of points in space’’(10) The most important use of PCA is indeed to represent amultivariate data table as a low-dimensional plane, usuallyconsisting of 2–5 dimensions, such that an overview of the
of observations (in QSAR: compounds), trends, and outliers.This overview also uncovers the relationships between
themselves
Statistically, PCA finds lines, planes, and hyper-planes
in the K-dimensional space that approximate the data as well
Trang 8Figure 3 Two PCs form a plane This plane is a window into the multidimensional space, which can be visualized graphically Each observation may be projected onto this giving a score for each The scores give the location of the points on the plane The loadings give the orientation of the plane (From Ref 3.)
Figure 2 Notation used in PCA The observations (rows) can be lytical samples, chemical compounds or reactions, process time points
ana-of a continuous process, batches from a batch process, biological viduals, trials of a DOE-protocol, and so on The variables (columns) might be of spectral origin, of chromatographic origin, or be measure- ments from sensors and instruments in a process (From Ref 3.)
Trang 9indi-as possible in the leindi-ast squares sense It is eindi-asy to see that aline or a plane that is the least squares approximation of a set
of data points makes the variance of the coordinates on theline or plane as large as possible (Fig 4)
By using PCA a data table X is modeled as
the variable averages and originates from the preprocessing
structure, and the third term, the residual matrix E, contains
The principal component scores of the first, second, third,
, components (t1, t2, t3, ) are columns of the score matrix
T These scores are the coordinates of the observations in the
Figure 4 Principal component analysis derives a model that fits the data as well as possible in the least squares sense Alterna- tively, PCA may be understood as maximizing the variance of the projection coordinates (From Ref 3.)
Trang 10model (hyper-)plane Alternatively, these scores may be seen asnew variables which summarize the old ones In their deriva-tion, the scores are sorted in descending importance
The meaning of the scores is given by the loadings The
that in Fig 5, a prime has been used with P to denote itstranspose
3.3 Principal Component Regression (PCR)
Principal component regression can be understood as ahyphenation of PCA and MLR In the first step, PCA isapplied to the original set of descriptor variables In the sec-
are used as input in the MLR model to estimate Eq (1).Thus, PCR uses PCA as a means to summarize the origi-nal X-variables as orthogonal score vectors and hence the colli-nearity problem is circumvented However, as pointed out byJolliffe (13) and others, there is a risk that numerically smallstructures in the X-data which explain Y may disappear inthe PC-modeling of X This will then give bad predictions of Yfrom the X-score vectors (T) Hence, to begin with, a subsetFigure 5 A matrix representation of how a data table X is modeled by PCA (From Ref 3.)
Trang 11selection among the score vectors might be necessary prior tothe MLR-step.
3.4 Partial Least Squares Projections (PLS) to
Latent Structures
The PLS method (1,4,10–12) has properties that alleviatesome of the difficulties noted for MLR and PCR The matrixT—a projection of X—is calculated to fulfill two objectives,i.e., (i) to well approximate X and Y, and (ii) to maximizethe squared covariance between T and Y (14) Hence, asopposed to MLR, PLS can handle correlated variables, whichare noisy and possibly also incomplete (i.e., containing miss-ing data elements) The PLS method has the additionaladvantage to handle also the case with several Y-variables.This is accomplished by using a separate model for the Y-dataand computing a projection U (see below), which is modeledand predicted well by T, and which is a good description of Y.The PLS regression method estimates the relationship
by making the bilinear projections
and connecting X and Y through the inner relation
useful for interpreting which X-variables are influential formodelling the Y-variables Finally, A is the number ofPLS components, usually estimated by cross-validation (seebelow)
Trang 12One way to understand PLS is that it simultaneously jects the X- and Y-variables onto the same subspace, T, in such
pro-a wpro-ay thpro-at there is pro-a good relpro-ationship between the predictor
‘‘new’’ X-variables, t, as linear combinations of the old ones,and subsequently uses these new ts as predictors of Y Only
as many new ts are formed as are found significant by
3.5 Model Performance Indicators
The performance of a regression model—and hence also a
variation (or goodness of fit) It is defined as
Figure 6 The matrix relationships in PLS Here, 1x0 and 1y0represent the variable averages and originate from the preproces- sing step The PLS scores comprise T and U, the X-loadings P 0 , the X-weights W 0 , and the Y-weights C 0 The variation in the data that was left out of the modeling forms the E and F residual matrices (From Ref 3.)
Trang 13Here RSS is the sum of squares of the Y-residuals, and SSythe initial sum of squares of the mean-centered response data.
also becoming increasingly used in this context The value
Here PRESS is the sum of squares of the predictive residuals
that the former is based on the fitted residuals
Trang 14The size of Q2Y is often estimated via cross-validation(15,16) During cross-validation some of the data points arekept out, and are then predicted by the model and comparedwith the measured values This procedure is repeated untileach data point has been eliminated once and only once ThenPRESS is formed and the number of PLS components result-ing in the lowest PRESS-value is considered optimal.
Cross-validation is also used in the context of PCA, the ference being that PRESS and related statistics refer to X-data,
4 RESULTS FOR THE FIRST EXAMPLE—
MODELING AND PREDICTING IN VITRO
TOXICITY OF SMALL HALOALKANES
4.1 Introduction and Background
In the first example, the aim is to contrast the methodsintroduced in Sec 3 For this purpose, we shall deal with
a series of halogenated aliphatic hydrocarbons and their
in vitro genotoxicity and cytotoxicity We call this datasetCELLTEST The complete CELLTEST data set consists of
The objective of the study was to set up a QSAR modelenabling large-scale prediction of the two endpoints for verymany similar untested compounds (17,18) In order to accom-plish this, SMD was used to encode a diverse subset of 16 com-pounds These compounds were tested for their genotoxic andcytotoxic potencies (17,18) Ten out of the 16 tested com-pounds were defined as the training set (17,18) and will here
be used as a basis to calculate QSARs The remaining six pounds will be used for assessing the predictive ability of thedeveloped QSARs Further details are found in the originalpublications (17,18) The identity of the compounds, and thenature of the X- and the Y-variables is seen in the legends
com-toFigs 8–10
Trang 154.2 Obtaining an Overview: PCA Modeling
The PCA-modeling of the six X-variables of the training set
There are no signs of deviating compounds (Fig 8) The
indi-cates that the two HPLC retention indices (LCI and LC2)
Figure 8 The PCA t 1 =t 2 score plot of model for CELLTEST ing data set Each plot mark represents one aliphatic hydrocarbon Open squares designate the training set; solid dots represent the prediction set (as classified into the model of the training set) The compounds are: (2) dichloromethane, (3) trichloromethane, (6) tetrachloromethane, (7) fluorotrichloromethane, (11) 1,2-dichlor- oethane, (12) 1-bromo-2- chloroethane, (15) 1,1,2,2-tetrachlor- oethane, (19) 1,2-dibromoethane, (23) 1,2,3-trichloropropane, (30) 1-bromoethane, (33) 1,1-dibromoethane, (37) bromochloromethane, (39) fluorotribromomethane, (47) 1-chloropropane, (48) 2-chloropro- pane, (52) 1-bromobutane.
Trang 16train-are very correlated, as train-are surface train-area (SA) and van derWaals volume (vdW) Log P is also correlated with these fourdescriptors, whereas Mw partly encodes unique information.The conclusions of the PCA-model are as follows:
There are no outliers Therefore, there is no need todelete any compound prior to regression (QSAR)modeling
Figure 9 The PCA p 1 =p 2 loading plot of model for CELLTEST data set In this plot, one considers the distance to the plot origin The closer to the origin a variable point lies, the less informative
it is for the model Variables that are close to each other provide similar information (¼ are correlated) Variable description:
Mw ¼ molecular weight; vdW ¼ van der Waals volume; log P ¼ logarithm of octanol=water partition coefficient; SA ¼ accessible molecular surface area; LC1 ¼ log HPLC retention times for Supel- cosil LC-08 column; LC2 ¼ log HPLC retention times for Nucleosil
loga-18 column.
Trang 17The first PC mainly describes lipophilicity, the secondpredominantly molecular weight, and the third chieflyvariation in surface area and volume not accountedfor in the first component.
4.3 One- and Two-Parameter QSARs for Single
Y-Variables: MLR Modelling
Table 1shows the correlation matrix of the training set Thereare strong correlations among the six chemical descriptors,and the absolute value of the correlation coefficients invari-ably lie between 0.5 and 1.0 Clearly, as MLR cannot copewith strongly correlated descriptor variables, care must beexercised in the regression modeling We deployed MLR andfor each of the two responses six one-parameter QSARs were
(see models M1–M6)
Figure 10 Overview plot of regression coefficients of model M12 for both responses Genotox ¼ log slope of concentration–response curve from the DNA-precipitation assay using Chinese hamster V79 cells; Cytotox ¼ log inhibitory concentration (in mM) decreasing cell viability by 50% in the MTT-assay using human HeLa cells.
Trang 18Table 1 Correlation Matrix of Example 1
Trang 19Gentox Gentox Gentox Gentox Gentox Cytotox Cvtotax Cvtotox Cytotox Cytotox Model Method Parameter (s) R2X R2Y R2Yadj Q2Yint Q2Yext R2X R2Y RZYadj Q2Yint Q2Yext
Trang 20Table 2is divided into two parts along the vertical tion The five left-most columns with numerical entries relate
direc-to the first biological response (Genodirec-tox) and the five most columns to the second response (Cytotox) For eachresponse, the following five model performance indicatorsare listed:
how much variation is used to model the biologicalresponse);
internal prediction ($ cross-validation) (15,16);
on external prediction of the six compounds in theprediction set
As shown by Table 2, the one-parameter QSARs basedeither on log P, LC1, or LC2 are the successful ones from apredictive point of view Interpretation of these three modelsshows (no plots provided) that an increasing value of the che-mical descriptor is coupled to an increasing toxicity for bothbiological responses
In this context, it might be tempting to use more thanone descriptor variable to see whether predictions can besharpened We note, however, that due to the strong correla-tions among the six X-variables, there is a substantial riskthat the interpretation of the model(s) might give misleadinghints about structure–activity relationships
We decided to use molecular weight as foundation fortwo-parameter correlations, as this is the descriptor that isleast correlated with the other chemical descriptors ModelsM7–M11 of Table 2 represent the five two-parameter combi-nations used Again log P, LC1 and LC2 work best, but com-pared with models M1–M6, there is practically no gain inpredictive power
Finally, in order to illustrate how model interpretationmay break down with correlated descriptors, we calculatedmodel M12 for each response, in which the two most
Trang 21correlated descriptors LC1 and LC2 were employed.Figure 10
shows a plot of the regression coefficients of the QSAR modelfor each of the two responses Since LC1 and LC2 are posi-tively correlated with a correlation coefficient of >0.99, wewould expect their regression coefficients to display the samesign However, as seen from Fig 10, LC1 and LC2 have oppo-site signs in both cases Regarding the first response variableLC2 has the wrong sign, whereas the converse is true for thesecond endpoint
In conclusion, by using MLR, very good one-parameterQSARs are obtainable using either log P, LC1, or LC2, andthis holds true for both responses As shown by the lastexample (model 12), however, some caution is needed whenusing correlated descriptors Here, the risk is that the inter-pretation may break down, because the coefficients do notget the right numerical size and they may even get the
theore-tical account of this phenomenon) Note, however, that thepredictive ability is not influenced in any way On the con-
for the previous best MLR models (models M3, M5, M6,M8, M9, and M11)
4.4 Obtaining an Orthogonal Representation of
the X-Descriptors and Regressing this
Against Single Y-Variables: PCR Modeling
In order to circumvent the collinearity problem, the threePCA score vectors derived in Sec 4.2 were used to replacethe original descriptor variables The three score vectors, hen-
vari-ables optimally summarizing the original X-varivari-ables Theirmodelling ability of the six original variables is excellent as
We developed seven different PCR models for each
the model in which all three score vectors are present (modelM13), the best model from a predictive point of view is Ml5
Trang 22displays the regression coefficients of M15 We recall from Sec.
The modeling results of the remaining PCR models (M14
X-variables is not favorable as far as predictive power isconcerned Thus, it appears that variation in molecularweight is not critically linked to the variation in biologicalactivity among the studied chemicals
4.5 Using All X- and Y-Variables in One Shot:
PLS Modeling
In contrast to MLR, PCR and the like, PLS can handle manyX- and Y-variables in one single model This is possible sincePLS is based on the assumption of correlated variables,which may be noisy and incomplete (3,4,10) Hence, PLSallows correlations among the X-variables, among the Y-variables, and between the X- and the Y-variables, to beFigure 11 Overview plot of regression coefficients of model M13 for both responses.