These descriptors are used by var-ious linear [partial least squares PLS 1, etc.] or non-linear[neural networks 2] regression algorithms to predict biologi-cal effects of molecules which
Trang 1Description and Representation
of Chemicals
WOLFGANG GUBA
F Hoffmann-La Roche Ltd, Pharmaceuticals
Division, Basel, Switzerland
1 INTRODUCTION
Biological effects are mediated by intermolecular interactions,for instance, through the binding of a ligand to a receptor,which triggers a signaling event in a signal transduction path-way Three-dimensional (3D) structures of receptor–ligandcomplexes are of great value to rationalize pharmacological
or toxicological effects of small molecule ligands However,due to experimental constraints such as purity and homogene-ity of the protein, crystallizability, solubility, size of the pro-tein–ligand complex, etc., an X-ray or NMR-based structuredetermination is often not feasible In those cases empiricalmodels have to be developed that deduce biological effects
11
Trang 2from the 2D or 3D molecular structures of small moleculeligands only, and as a result structure–activity relationships(SAR) are formulated These SARs may either be qualitative[i.e., molecular features (substructures, functional groups)are associated with activity] or quantitative SARs QSARs(defined by correlating molecular structures with biologicaleffects via mathematical equations) QSAR models requirethe translation of molecular structures into numerical scales,i.e., molecular descriptors These descriptors are used by var-ious linear [partial least squares (PLS) (1), etc.] or non-linear[neural networks (2)] regression algorithms to predict biologi-cal effects of molecules which have not been tested or not evensynthesized.
The core of empirical model building by QSAR is thesimilarity principle (3), which states that similar chemicalstructures should have similar biological activities The con-verse is not true, since similar biological activities may be dis-played by chemically diverse molecules (4) Within thecontext of QSAR, the similarity principle implies that smallchanges in molecular structure cause correspondingly slightvariations in biological activity, which allows the interpola-tion of biological activity from a calibration set to a structu-rally related prediction set of compounds Thus, moleculardescriptors have a pivotal role for quantifying the similarityprinciple and their usefulness can be ranked by the followingcriteria:
relevance for the biological effect to be described
interpretability
speed of calculation
The biological relevance of molecular descriptors can beeasily checked by the stability and predictivity of the gener-ated mathematical QSAR models and speed is (at least for
up to 1000–10,000 compounds in most cases) no longer a iting factor Only for virtual screening campaigns with >105compounds does this issue need to be taken into consid-eration The most critical factor is the interpretability of mole-cular descriptors, because a clear understanding of thecorrelation of molecular structures with toxicological effects
Trang 3lim-is crucial for correctly associating structural features withtoxic liabilities and for optimizing the biological profile of acompound.
This chapter will describe how molecules are formed into numerical descriptors Fragment-based andwhole molecule descriptor schemes will be discussed, followed
trans-by examples for 1D, 2D, and 3D molecular descriptors Thefocus will not be on reviewing algorithms for descriptor gen-eration but rather on illustrating strategies on how to dealwith homogenous and diverse sets of molecules and on outlin-ing the scope of commonly used descriptor schemes For moredetailed information about molecular descriptors and algo-rithms, the reader is referred to the references and to theencyclopedic Handbook of Molecular Descriptors (5) Thequest for a universal set of descriptors which can be generallyapplied to structure–activity modeling is ongoing and willprobably never succeed The choice of molecular descriptors
is determined by the biological phenomenon to be analyzedand very often experience in descriptor selection is a criticalsuccess factor in QSAR model building
2 FRAGMENT-BASED AND WHOLE
MOLECULE DESCRIPTOR SCHEMES
Molecular descriptors are usually classified in terms ofdimensionality of the molecular representation (1D, 2D, and3D) from which they are derived However, before selectingthe dimensionality of the molecular descriptor scheme, thefollowing question needs to be answered Do the molecules
in the dataset contain an invariant substructure, a commonscaffold, with one or more substitution sites to which variablebuilding blocks are attached or is there no common sub-
structure?
A QSAR analysis correlates the variation in molecularstructures with the variation in biological activities In thecase of an invariant scaffold the obvious strategy is to estab-lish a relationship between the structural variation of thesubstituents (R groups), the substitution site (R1, R2, etc.)
Trang 4and the resulting biological effects Since the R groups alsoinfluence the scaffold (e.g., via electronic effects), anotherapproach would be to compare the effects of the substituentgroups onto the common set of scaffold atoms (Fig 1) Hybridapproaches are also feasible correlating both the variation ofbuilding blocks with respect to a substitution site and themodified properties of common scaffold atoms with biologicalactivities.
If no common substructure can be identified, whole cule descriptors have to be calculated These heterogeneousdatasets are more challenging than series with common scaf-folds It cannot be assumed a priori that each molecule in thedataset interacts with the biological target in the same way,and it is usually not a trivial task to identify those structuralfeatures which cause a biological effect Later it will be illu-strated how topological and 3D descriptor schemes attempt
mole-to tackle this problem
tar-Figure 1 In datasets with a common, invariant scaffold (here, biphenyl) molecular descriptors can be generated both for the varia- tion of substituents R (marked by circles) and for the substitution sites (marked by squares).
Trang 5constants and the biological effect is established Commonlyused fragment values are hydrophobic constants (p), molarrefractivity (MR), and the electronic Hammett constant (s) (7).The hydrophobic constant p has been derived from thedifference of octanol–water partition coefficients of a substi-tuted molecule and the unsubstituted parent compound.
psubstituent¼ log Psubstituted compound log Pparent compoundThe octanol–water partition coefficient is defined by
log P¼ Concentration of solute in octanol phase
Concentration of solute in aqueous phase
and assumes positive values for lipophilic compounds favoringthe octanol phase and negative values for polar molecules with
a preference for the aqueous phase There is a general trendbetween lipophilicity and toxicity of xenobiotics which iscaused by an enrichment in hydrophobic body compartments(membranes, body fat, etc.) and by extensive metabolism lead-ing to reactive species The octanol–water partition coefficient(log P) is a highly relevant measure and it can be calculatedvia a battery of atom- and fragment-based methods (8).Molar refractivity is determined from the Lorentz–Lorenz equation (9) and is a function of the refractive index(n), density (d), and molecular weight (MW):
MR ¼ n
2 1
n2þ 2
MWdMolar refractivity is related to the molar volume(MW=density), but the refractive index correction alsoaccounts for polarizability However, molar refractivity doesnot discriminate between substituents with different shapes(10); e.g., the MR values of butyl and tert-butyl are 19.61and 19.62, respectively Therefore, Verloop (11) developedthe STERIMOL parameters which are based on 3D models
of fragments generated with standard bond lengths andangles Topologically based approaches to derive shapedescriptors will be described below
Trang 6The Hammett constant is a measure of the withdrawing or electron-donating properties of substituentsand was originally derived from the effect of substituents onthe ionization of benzoic acids (12,13) Hammett defined theparameter s as follows with pKa being the negative decadiclogarithm of the ionization constant:
electron-s ¼ pKabenzoic acid pKameta; parasubstituted benzoic acid
Positive values of s correspond to electron withdrawal by thesubstituent from the aromatic ring (spara-nitro¼ 0.78), whereasnegative s values indicate an electron-donating substituent(spara-methoxy¼ 0.27) Electronic effects can be categorizedinto field-inductive and resonance effects Field-inductiveeffects consist of two components: 1) s- or p-bond mediatedinductive effect and 2) electrostatic field effect which is trans-mitted through solvent space Resonance effects energeticallystabilize a molecule by the delocalization of p electrons or byhyperconjugation (delocalization of s electrons in a p orbitalaligned with the s bond) Swain and Lupton (14) introducedthe factoring of s values into field and resonance effects,which were more consistently redefined by Hansch and Leo(9) A compilation of electronic substituent constants has beenpublished for 530 substituents (15) Although these tabulatedelectronic substituent constants are of great value, the maindrawback is the limited number of available data Therefore,quantum-chemical calculations to derive electronic wholemolecule and fragment descriptors are becoming increasinglycommon in QSAR=QSPR modeling (16,17)
Finally, indicator variables have to be mentioned as aspecial case of fragment descriptors Indicator variables showwhether a particular substructure or feature is present (l¼ 1)
or absent (l¼ 0) at a given substitution site on a scaffold cal applications of indicator variables are the description ofortho effects, cis=trans isomerism, chirality, different parentskeletons, charge state, etc (6)
Typi-The original Hansch analysis (6) correlates the mentioned physicochemical descriptors of the substituents
above-of a congeneric series with a biological activity From a moregeneral perspective, each substitution site (R1, R2, etc.) is
Trang 7characterized with respect to principal physicochemical erties (steric bulk, lipophilicity, hydrogen bonding, electro-nics), and the presence of special structural features isencoded by indicator variables This descriptor matrix is cor-related with the ligand concentration C which causes a biolo-gical effect (e.g., EC50: concentration of an agonist thatproduces 50% of the maximal response) The regression coeffi-cients are often determined by multiple linear regression(MLR) However, MLR assumes that the descriptors areuncorrelated, which is predominantly not the case Therefore,PLS is recommended as the general statistical engine since itdoes not suffer from the drawbacks of MLR:
In addition, quadratic terms can be introduced to account forthe frequently observed non-linear correlation of physico-chemical properties, such as log P, with bioactivity For exam-ple, let us assume that the lipophilicity within a compoundseries is positively correlated with a biological response Even
if an increase in lipophilicity is accompanied by an analogousrise of activity, there will be a lipophilicity optimum beyondwhich activity drops Possible causes for this deviation from
a linear correlation model are a decreasing solubility in eous body fluids or an enrichment in biological membranes
aqu-or body fat which would reduce the effective concentration
of the ligand at the site of action This can be described bythe following parabolic model that defines a log P optimumbeyond which activity is reduced:
log C ¼ a0 a1ðlog PÞ2þ a2log P
If a stable and predictive QSAR can be derived, the analysis ofthe statistical model allows one to determine which substitu-
Trang 8tion site or structural feature has the largest impact on ity and what physicochemical property profile is required forthe optimization of activity However, this strategy is confined
activ-to a congeneric series with a common scaffold
3.2 Heterogeneous Dataset
In heterogeneous datasets no common molecular ture can be defined A QSAR model of a molecular descriptormatrix consisting of n rows (molecules) and k descriptor col-umns requires that the column entries of the descriptormatrix denote the same property for each molecule For het-erogeneous datasets atomic descriptors cannot be compareddirectly due to the different number of atoms in each molecule
substruc-In the following two sections, van der Waals surface areadescriptors and autocorrelation functions will be introduced
as a means to allow for QSAR=QSPR modeling of compoundsets with no common core structures
3.2.1 van der Waals Surface Area Descriptors
The Hansch concept of correlating biological effects with cipal physicochemical properties of substituents has beenextended to whole molecules by Paul Labute (18) In a firststep, the van der Waals surface area (VSA) is calculated foreach atom of a molecule from a topological connection tablewith a predefined set of van der Waals radii and ideal bondlengths Thus, Vi, the contribution of atom i to the VSA of amolecule, is a conformationally independent approximate3D property which only requires 2D connectivity information
prin-In a second step, steric, lipophilic, and electrostatic propertiesare calculated for each atom by applying the algorithms ofWildman and Crippen (19) for determining log P and molarrefractivity and assigning the partial charges of Gasteigerand Marsili (20) Each of the three properties is divided into
a set of predefined ranges (10 bins for log P, 8 bins for MR,and 14 bins for partial charges) and for each property theatomic VSA contributions in a given property range bin areadded up Thus, the VSA descriptors correspond to a subdivi-sion of the total molecular surface area into surface patches
Trang 9that are assigned to ranges of steric, lipophilic, and static properties Summing up, each molecule is transformedinto a 10þ 8 þ 14 ¼ 32 dimensional vector Linear combina-tions of VSA descriptors correlate well with many widely useddescriptors such as connectivity indices, physicochemical,properties, atom counts, polarizability, etc However, theinterpretation of VSA-based QSAR models with respect
electro-to proposing chemical modifications for the optimization ofcompound properties is not straightforward
4 TOPOLOGICAL DESCRIPTORS
Topological descriptors are derived entirely from 2D tural formulas and, therefore, missing parameters, conforma-tional flexibility, or molecular alignment do not have to betaken into account The pros and cons of 2D vs 3D descriptorswill be briefly discussed in the following section Whereastopological descriptors can be easily calculated from molecu-lar graphs, the interpretation of topological indices withrespect to molecular structures is often far from obvious.There is still a highly controversial debate about the utility
struc-of topological indices which peaked in provocative statementslike ‘‘connectivity parameters are artificial parametersthat are worthless in real quantitative structure–activityrelationships’’ (23) Nevertheless, the interested reader should
Trang 10develop his=her own opinion and, therefore, the gical state (E-state) indices developed by Kier and Hall (24)will be introduced as one of the more intuitive examples oftopological indices.
electrotopolo-The general concept of the E-state indices is to ize each atom of a molecule in terms of its potential for elec-tronic interactions which is influenced by the boundneighboring atoms Kier and Hall describe the topologicalenvironment of an atom by the d-value, which is defined asthe number of adjacent atoms minus the number of boundhydrogens
4 lone pair electrons)
From the parameters d and dvthe term dv d is derived:
dv d ¼ p þ n
Thus, the term dv d is the total count of pi and lone-pair trons for each atom in a molecule It provides quantitativeinformation about the potential of an atom for intermolecularinteractions and, in addition, it is correlated with electronega-tivity (25)
elec-The intrinsic state I combines the information about thetopological environment of an atom and the availability ofelectrons for intermolecular interactions This is achieved bymultiplying the electronegativity-related term dv d with
Trang 11the accessibility 1=d (the more sigma bonds, d, there arearound a given atom, the less accessible it is):
I¼ð2=NÞ
2
dvþ 1dFinally, the influence of the molecular environment onto theI-states of each atom within a molecule is determined by sum-ming up pairwise atomic interactions These interactionsrepresent the perturbation of the I-state of a given atom bythe differences in I-states with all the other atoms in the samemolecule Since more distant atoms exert less perturbationthan neighboring atoms, these pairwise differences in I-statesare scaled by the inverse squared distances between atompairs Thus, the electrotopological state Si or ‘‘E-state’’ of anatom i is defined as
Si ¼ Iiþ SDIij; Dij¼ ðIi IjÞ=r2ij
Summing up, the E-state concept quantifies steric andelectronic effects Both the topological accessibility and theelectronegativity of an atom as well as the field effect of thelocal intramolecular environment are captured by a singleparameter
Although both I-states and E-states cannot be translatedback to molecular structures directly, they can, nevertheless,
be interpreted in terms of electronic and topological features.For instance, I-states for sp3 carbon atoms decrease fromprimary to quaternary carbon atoms (2.000–1.250), whichreflects the reduced steric accessibility The sp3 hybridized
Trang 12terminal groups –F, –OH, –NH2and –CH3have decreasing states of 8.0, 6.0, 4.0, and 2.0, respectively, which correlateswith the corresponding electronegativities In general, it isfound that E-states > 8 indicate a strong electrostatic andH-bonding effect, values between 3 and 8 represent weak H-bonding and dipolar forces, E-states in the range from 1 to 3are associated with van der Waals and hydrophobic interac-tions and values below 1 are typical of low electronegativityand topologically buried atoms (24).
I-Recently, the E-state concept has also been extended tohydrogen atoms (26) Since hydrogen atoms always occupy aterminal position, topology has no influence on the E-state
of a hydrogen atom The H-bond donor strength is directlyproportional to the electronegativity of the attached heavyatom and, therefore, the hydrogen E-state is entirely based
on the Kier–Hall relative electronegativities (KHE) (25):KHE¼d
with a predefined KHE of 0.2 for hydrogen atoms Smallnumerical E-state values for polar hydrogen atoms indicate
a low electron density on the hydrogen atom and, therefore,
a high polarity of the hydrogen bond
After having discussed the definition and the tion of the E-state indices, the focus will now shift to setting
interpreta-up the descriptor matrices For congeneric series with a mon scaffold, individual E-state (atom-level) values are calcu-lated for topologically equivalent scaffold atoms Thus, theimpact of the substituents onto the electron accessibility ofthe scaffold atoms is correlated with biological effects, whichallows one to identify those atoms in a molecule that are mostimportant for activity For heterogeneous datasets, an atom-type classification scheme is applied where each atom is