If a candidate fragment does not occur in thetraining structures, we remove it from the current level per-f But they are frequently implicitly considered: chains with more than six aroma
Trang 1Relationships for Toxicity Prediction
CHRISTOPH HELMA Institute for Computer Science, Universita¨t Freiburg, Georges Ko¨hler Allee, Freiburg, Germany
1 INTRODUCTION
The development of the lazar system was more or less abyproduct of editing this book The intention was to demon-strate how to use some of the concepts from previous chapters
as building blocks for the creation of a simple predictive icology system that can serve as a reference for further devel-opments Most of the machine learning and data miningtechniques have been described in an earlier chapter (1).Some of the basic ideas (utilizing linear fragments for predic-tions) were developed already in the 1980s by Klopman and
tox-lazar stands for Lazy Structure–Activity Relationships.
479
Trang 2are implemented in the MC4PC system [a description ofMC4PC is given elsewhere in this book (2)] Please note thatlazar is not a reimplementation of an existing system,because it uses its own distinct algorithms to generatedescriptors and to achieve the predictions.
Figure 1 Example learning set for lazar Compound structures are written in SMILES notation (17); ‘‘1’’ indicates active com- pounds, ‘‘0’’ indicates inactivity.
Trang 3Initially in this chapter, I review some observations fromthe application of machine learning and data mining techniques
to non-congeneric (compounds that do not have a common corestructure) data sets that led to the basic lazar concept After adescription of the lazar algorithm, I present an example applica-tion for the prediction of Salmonella mutagenicity and draw
Figure 2 Example test set Compound structures are written in SMILES notation (17); toxicological activities are unknown.
Trang 4some conclusions for further improvements from the analysis ofmisclassified compounds.
2 PROBLEM DEFINITION
First, a brief review of the problem setting: We have a data setwith chemical structures and measured toxicological activi-
to predict the toxicological activities of these untestedcompounds
More specifically, we want to infer from the chemicalstructure (or some of its properties) of the test compound toits toxicological activity, assuming that the biological activity
of a compound is determined by the chemical structure (or
For this purpose, we need:
A description of the chemicals features that areresponsible for toxicological activity, and
A model that makes predictions based on thesefeatures
Most efforts in predictive toxicology have been devoted tothe second task: The identification of predictive models based
toxicological effects, we frequently face the problem thatbiochemical mechanisms are diverse and unknown It is there-fore hard to guess (and select) the chemical features that arerelevant for a particular effect, and we risk making one ofthe following mistakes:
Omission of important features that determine ty=inactivity by selecting too few features
activi-a
Classifications (e.g., active=inactive) or numerical values (e.g., LC50s), for the sake of simplicity we will cover only the classification case in this chapter.
Trang 5Deterioration of the performance (accuracy and speed)
select-ing too many features
Recently several schemes for feature selection have beendevised in the machine learning community (see Journal ofMachine Learning Research 3, 2003, that contains a Special
recursive feature extraction (RFE) (8) performs very well on
a variety of problems, but at the risk of overfitting the data(9) In theory, it is possible to counteract by performinganother cross-validation step for feature selection, but thisleads frequently to practical problems: long computationtimes and fragmented learning data sets due to nested
Graph theoretic descriptors
Biological properties (e.g., from screening assays)
Spectra (IR, NMR, MS, )
c Most (if not all) of these systems are sensitive towards large numbers of irrelevant features Based on my experience, this is also true for systems that claim to be insensitive in this respect (e.g., support vector machines) d
In predictive toxicology, Klopman (7) used an automated feature selection process in the CASE and MULTICASE systems since the 1980s.
e
Recursive feature extraction needs an internal (10-fold) cross-validation step to evaluate the results (8) If we use another 10-fold cross-validation for feature selection, we have to generate 10 10 models On the other hand, the data for a single model shrink to 0.9 0.9 ¼ 81% of the original size, which can be problematic for small data sets.
Trang 6Models that are too general (improper consideration ofcompounds with a specific mode of action) (11).
Limitations of the models (e.g., substructures, thatare not in the training set) are often unclear Noindication if a structure falls beyond the scope ofthe model
Sensitivity toward skewed distributions of ves=inactives in the learning set
acti- Handling of missing values in the training set
Ambiguous parameter settings (e.g., cutoff cies in MolFea (12), Kernel type, gamma, epsilon, tol-
frequen-erance, parameters in support vector machines).
My intention was to address these problems with thedevelopment of lazar
3 THE BASIC lazar CONCEPT
lazar is in contrast to the majority of the approachesdescribed in this book— a lazy learning scheme Lazy learningmeans that we do not generate a global model from the com-plete training set Instead, we are creating small models ondemand: one individual model for each test structure Thishas the key advantage that the prediction models are morespecific (13) because we can consider the properties of the teststructure during model generation If we want to predict, e.g.,the mutagenicity of nitroaromatic compounds, informationfrom chlorinated aliphatic structures will be of little value
As we will see below, the selection of relevant features and vant examples from the training set is done automatically bythe system and does not require any input of chemical con-cepts On a practical side, we have integrated model creationand prediction Therefore, we need no computation time formodel generation (and validation), but predictions may requiremore computation time than predictions from a global model(this is of course very implementation dependent)
rele-lazar uses presently linear fragments to describe cal structures (but it is easy to use more complex fragments,e.g., subgraphs or to include other features like molecular
Trang 7chemi-properties, e.g., log P, HOMO, LUMO, etc.) Linear fragmentsare defined as chains of heavy atoms with connecting bonds.Branches or cycles are not considered explicitly in linear frag-
For-mally linear fragments are valid SMARTS expressions (seehttp://www.daylight.com for further references), which can
be handled by various computational chemistry libraries
For the sake of clarity, we will separate by discuss thefollowing three steps that are needed to obtain a predictionfor an untested structure:
structure
the prediction
In the following section, we will give a more detaileddescription of the algorithms for each of these steps
4 DETAILED DESCRIPTION
4.1 Fragment Generation
In lazar, we are presently using a very simplified variant ofthe MolFea (14) algorithm to determine the fragments of agiven structure The procedure is as follows
As a starting point, we use all elements from the iodic table of elements (including aromatic atoms) Theseare the candidate fragments for the first level First, weexamine which of them occur in (or match) the test struc-ture and eliminate those that do not match Then we checkwhich of the remaining fragments occur in the trainingstructures If a candidate fragment does not occur in thetraining structures, we remove it from the current level
per-f
But they are frequently implicitly considered: chains with more than six aromatic carbons, for example, indicate condensated aromatic rings.
Trang 8and store it in the set of unknown fragments, because wecannot determine if it contributes to activity or inactivity.From the remaining candidates (i.e., those that occur inthe test structure and the training structures), we generatethe candidates for the next level (i.e., candidates with anadditional bond and atom), this step is called refinementand will be described below The whole procedure isrepeated until the candidate pool has been depleted [i.e.,all fragments of the test structure that occur also in thetraining set have been identified (Fig 3)].
4.2 Fragment Refinement
bond and an atom to each fragment of level n This is, ofcourse, very inefficient because we generate (and match)way too many fragments We want to avoid to generate unne-cessary fragments because fragment matching is the time cri-
Fortunately, we can determine in many cases if a ment cannot match before generating the fragment For thispurpose, we can use an important property of the language
frag-of molecular fragments—the generality relationship:
Figure 3 Procedure for fragment generation.
g
lazar delegates this to the OpenBabel libraries http://openbabel sourceforge.net :
Trang 9We define that one fragment g is more general than a
C–O is more general than N–C–C–O) This has the quence that g matches whenever s does
conse-Linear fragments are symmetric, which means that twosyntactically different fragments are equivalent when theyare a reversal of one another (e.g., C–C–O and O–C–C denotethe same fragment)
Therefore we can conclude that g is more general than s
We can use this generality relationship to refine ments efficiently: As we know, all subfragments of a new can-didate fragment have to match; therefore, we need to combineonly the fragments of the present level (i.e., those that match
frag-on the test and training compounds) to reach the next level.The two fragments C–C and C–O, e.g., can be refined to C–C–C, C–C–O, and O–C–O This reduces the number of candi-dates considerably in comparison to attaching naively a bondand an atom
Another method to reduce the search space is to utilizethe known matches of the current level to determine thepotential matches of the new fragment If we know, e.g., that
we can conclude that:
C–C–C can occur in compounds fA, B, Cg
C–C–O can occur in compounds fBg
O–C–O can occur in compounds fB, Dg
Knowing the potential matches of a new fragment allowsus: (i) to remove candidates if they have no potential matches,and (ii) to perform the time consuming matching step only onthe potential matches and not the complete data set
As predictions are usually performed for a set of testcompounds, fragments (especially those from the lower levels,like C, C–C, etc.) are frequently to be reevaluated on thetraining set Storing the matches of the fragments in a
Trang 10database (that can be saved permanently) helps to preventthis reevaluation.
4.3 Identification of Relevant Fragments
After the successful identification of fragments of the teststructure, we have the following information:
The set of linear fragments that occurs in the teststructure and in the training structures
The set of the most general fragments that occur inthe test structure but not in the training structures(i.e., the shortest unknown fragments)
For each fragment, the set of training structures,where the fragment matches
The activity classifications for the training structures.Let us consider now each fragment f as a hypothesis thatindicates if a compound C with this fragment is active or inac-tive First we have to evaluate if a fragment indicates activity
or inactivity, then we have to distinguish between more orless predictive hypotheses (i.e., fragments) and select themost predictive ones
Fortunately, it is rather straightforward to evaluate thefragments on the training set because we know which com-pounds the fragment matches as well as the activity classifi-cations for these compounds If a fragment matches onlyactive or inactive compounds, the decision is obvious: We willcall a fragment activating if it matches only active compounds
or inactivating if it matches only inactive compounds In reallife, however, most fragments will occur in active as well asinactive compounds It is tempting to call a fragment activat-ing if it matches more active compounds than inactives This
is certainly true for training sets that contain an equal ber of active and inactive compounds But let us assume thatonly 10% (e.g., 10 of 100) of the training structures are activeand we have identified a fragment that matches five active
frag-ment matches 5 from 10 (50%) active compounds and 5 from
90 (5.6%) inactive compounds and that it is justified to call
Trang 11it an activating fragment This consideration leads to the
A fragment indicates activity, if compounds with thisfragment are more frequently active than compounds
A fragment indicates inactivity, if compounds withthis fragment are more frequently inactive than com-
Furthermore, we would like to know if the observeddeviation from the default distribution is due to chance ornot For this purpose, we can use the Chi-square test (withcorrections for small sample sizes) to calculate the probability
Figure 4 An activating fragment f a : number of active compounds with fragment f, f i : number of inactive compound with fragment f,
f all : total number of compounds with fragment f (f all ¼ f a þ f i ), n a
number of active compounds in the training set, n i number of tive compounds in the training set, n all total number of compounds
inac-in the trainac-ininac-ing set (nall¼ n a þ n i ).
h
Which is equivalent to the leverage in association rule mining (15).
Trang 12due to chance To obtain the total probability that a fragmentindicates (in) activity, we multiply both contributions:
pf ¼ pa pðw2Þ
4.4 Redundancy of Fragments
Figure 5 lists a set of fragments that have been generated for
a particular compound It is obvious that the chemical ing of most of the fragments is almost identical The first sixfragments point towards two aromatic rings that are con-nected by a single bond (probably a biphenyl) Using the sameinformation six times would lead to an overestimation of theeffect of the biphenyl structure—therefore we use only themost predictive fragment (i.e., the fragment with the highest
redun-dancy as follows
Figure 5 A set of redundant fragments for a particular compound The most predictive, non-redundant fragments are marked by an arrow.
Trang 13Two fragments f1, f2 are redundant if the matches of f1
c : c : c : n : n for our predictions
4.5 Classification
With this set of most predictive, non-redundant fragments, it ispossible to classify the test structure As each fragment is ahypothesis that indicates activity or inactivity with a certain
frag-ments to classify the test structure For this purpose we sum
calculate the confidence in our prediction by defining
(probability that the prediction “active” is true)
In addition, we will output the set of unknown fragments
demonstrates an example lazar prediction
rela-This can be changed if we know that the test compounds have a different distribution between actives and inactives than the training set.