PREDICTIVE TOXICOLOGY - CHAPTER 7 ppt

INTRODUCTION In this chapter, we will review basic techniques from edge discovery in databases KDD, data mining DM, andmachine learning ML that are suited for applications in pre-dictive

Trang 1

Machine Learning and Data Mining

STEFAN KRAMER Institut fu¨r Informatik, Technische

Universita¨t Mu¨nchen, Garching,

Mu¨nchen, Germany

CHRISTOPH HELMA Institute for Computer Science, Universita¨t Freiburg, Georges Ko¨hler

Allee, Freiburg, Germany

1 INTRODUCTION

In this chapter, we will review basic techniques from edge discovery in databases (KDD), data mining (DM), andmachine learning (ML) that are suited for applications in pre-dictive toxicology We will discuss primarily methods whichare capable of providing new insights and theories Methods,which work well for predictive purposes but do not returnmodels that are easily interpretable in terms of toxicological

approaches), will not be discussed here, but are discussedelsewhere in this book

Also not included in this chapter, yet important, arevisualization techniques, which are valuable for giving first

223

Trang 2

clues about regularities or errors in the data The chapter willfeature data analysis techniques originating from a variety offields, such as artificial intelligence, databases, and statistics.From artificial intelligence, we know about the structure ofsearch spaces for patterns and models, and how to searchthem efficiently Database literature is a valuable source ofinformation about efficient storage of and access to largevolumes of data, provides abstractions of data management,and has contributed the concept of query languages to datamining Statistics is of utmost importance to data miningand machine learning, since it provides answers to manyimportant questions arising in data analysis For instance,

it is necessary to avoid flukes, that is, patterns or models thatare due to chance and do not reflect structure inherent in thedata Also, the issue of prior knowledge has been studied tosome extent in the statistical literature

One of the most important lectures in data analysis isthat one cannot be too cautious with respect to the conclusions

to be drawn from the data It is never a good idea to rely toomuch on automatic tools without checking the results forplausibility Data analysis tools should never be appliednaively—the prime directive is ‘‘know your data.’’ Therefore,sanity checks, (statistical) quality control, configuration man-agement, and versioning are a necessity One should always

be aware of the possible threats to validity

Regarding the terminology in the paper, we will talkabout instances, cases, examples, observations interchange-

attribu-tes=features=variables (e.g., properties of the molecules,

chapter If we are considering prediction, we are aiming atthe prediction of one (or a few) dependent variables (or target

indepen-dent variables (e.g., molecular properties)

In several cases, we will refer to the computational plexity of the respective methods The time complexity of analgorithm gives us an asymptotic upper bound on the runtime

com-of the algorithm as a function com-of the size com-of the input problem.Thus, it gives us the worst-case behavior of algorithms It is

Trang 3

written in the O() (‘‘big O’’) notation, which in effect presses constants If the input size of a dataset is measured

sup-in terms of the number of sup-instances n, then O(n) means thatthe computation scales linearly with n (Note that we are alsointerested in the scalability in the number of features, m.)Sometimes, we will refer to the space complexity, whichmakes statements about the worst-case memory usage ofalgorithms Finally, we will assume basic knowledge of statis-tics and probability theory in the remainder of the chapter

This chapter consists of four main sections: The first part

is an introduction to data mining Among other things, itintroduces the terminology used in the rest of the chapter.The second part focuses on so-called descriptive data mining,the third part on predictive data mining Each class of techni-ques is described in terms of the inputs and outputs of therespective algorithms, sometimes including examples thereof

We also emphasize the typical usage and the advantages of thealgorithms, as well as the typical pitfalls and disadvantages.The fourth part of the chapter is devoted to references to therelevant literature, available tools, and implementations

1.1 Data Mining (DM) and Knowledge

Discovery in Databases

This section shall provide a non-technical introduction to datamining (DM) The book Data Mining by Witten and Frank (2)provides an excellent introduction into this area and it isquite readable even for non-computer scientists A recentreview (3) covers DM applications in toxicology Anotherrecommended reading is Advances in Knowledge Discoveryand Data Mining by Fayyad et al (4)

First, we will have to clarify the meaning of DM and itsrelation to other terms frequently used in this area, namelyknowledge discovery in databases (KDD) and machine learn-ing (ML) Common definitions (2,5–8) are:

Knowledge discovery (KDD) is the non-trivial process

of identifying valid, novel, potentially useful, and mately understandable structure in data

Trang 4

ulti- Data mining (DM) is the actual data analysis stepwithin this process It consists of the application ofstatistics, machine learning, and database techniques

to the dataset at hand

Machine learning (ML) is the study of computer rithms that improve automatically through experi-ence One ML task of particular interest in DM isclassification; that is, to classify new unseen instances

algo-on the basis of known training instances

This means that knowledge discovery is the process ofsupporting humans in their enterprise to make sense ofmassive amounts of data; data mining is the application oftechniques to achieve this goal; and machine learning isone of the techniques suitable for this task Other DM tech-niques originate from diverse fields, such, as statistics,visualization, and database research The focus in this chap-ter will be primarily on DM techniques based on machinelearning

In practice, many of these terms are not used in theirstrict sense In this chapter, we will also use sometimes thepopular term DM, when we mean KDD or ML

Table 1 shows the typical KDD process as described

by Fayyad et al (6) In the following, we will sketch the pted process for the task of extracting structure–activity

ada-Table 1 The Knowledge Discovery (KDD) Process According to Fayyad et al.

1 Definition of the goals of the KDD process

2 Creation or selection of a data set

3 Data cleaning and preprocessing

4 Data reduction and projection

5 Selecting data mining methods

6 Exploratory analysis and model=hypothesis selection

Trang 5

relationships (SARs) from experimental data The stepsclosely resemble those from the generic process by Fayyad:

of the SAR models (e.g., predictions for untested

mechanisms)

perform-ing experiments, downloadperform-ing data)

inconsisten-cies, and perform corrections

project and transformation of the data into a format,which is readable by DM programs

tools to see if they provide useful results

techni-que to the dataset

of its performance

activity of untested compounds

The typical KDD setting involves several iterations overthese steps Human intervention is an essential component ofthe KDD process Although most research has been focused

on the DM step of the process (and the present chapter willnot make an exception), the other steps are at least equallyimportant In practical applications, the data cleaningand preprocessing step is the most laborious and time-consuming task in the KDD process (and therefore oftenneglected)

In the following sections, we will introduce a few generalterms that are useful for describing and choosing DM systems

on a general level First, we will discuss the structure of thedata, which can be used by DM programs Then, we will have

a closer look at DM as search or optimization in the space ofpatterns and models Subsequently, we will distinguishbetween descriptive and predictive DM

Trang 6

1.2 Data Representation

Before feeding the data into a DM program, we have to form it into a computer-readable form From a computerscientists point of view, there are two basic data representa-tions relevant to DM, both will be illustrated with examples.Table 2 shows a table with physico-chemical properties ofchemical compounds For every compound, there are a fixednumber of parameters or features available, therefore it ispossible to represent the data in a single table In this table,each row represents an example and each column an attri-bute We call this type of representation propositional.Let us assume we want to represent chemical structures

trans-by identifying atoms and the connections (bonds) betweenthem It is obvious that this type of data does not fit into a sin-gle table, because each compound may have a different num-ber of atoms and bonds Instead we may write down the atoms

is called a relational representation Other biologically vant structures (e.g., genes, proteins) may be represented in

rele-a similrele-ar mrele-anner

The majority of research on ML and DM has beendevoted to propositional representations However, thereexists a substantial body of work on DM in relational repre-sentations Work in this area is published under the heading

of inductive logic programming and (multi-)relational datamining As of this writing, only very few commercial products

Table 2 Example of a Propositional Representation of Chemical Compounds Using Molecular Properties

Trang 7

are explicitly dealing with relational representations able non-commercial software packages include ACE by the

the Dutch company Kiminkii

One of the implications of choosing a relational tation is that the complexity of the DM task grows substan-tially This means that the runtimes of relational DMalgorithms are usually larger than those of their propositional

represen-Figure 1

Trang 8

relatives For the sake of brevity, we will not discussrelational DM algorithms in the remainder of the chapter.

1.3 DM as Search for Patterns and Models

The DM step in the KDD process can be viewed as the searchfor structure in the given data In the most extreme case, weare interested in the probability distribution of all variables(i.e., the full joint probability distribution of the data) Know-ing the joint probability distribution, we would be able toanswer all conceivable questions regarding the data If weonly want to predict one variable given the other variables,

we are dealing with a classification or regression task: fication is the prediction of one of a finite number of discreteclasses (e.g., carcinogens, non-carcinogens), regression is theprediction of a continuous, real-valued target variable (e.g.,

depen-dent variable given the independepen-dent variables, which requiresless data than estimating the full joint probability distribu-tion In all of the above cases, we are looking for global regu-larities, that is, models of the data However, we might just aswell be satisfied with local regularities in the data Local reg-ularities are often called patterns in the DM literature Fre-quently occurring substructures in molecules fall into thiscategory, for instance Other examples for patterns are depen-dencies among variables (functional or multivalued depen-dencies) as known from the database literature (9) Again,looking for patterns is an easier task than predicting a targetvariable or modeling the joint probability distribution

Most ML and DM approaches, at least conceptually, form some kind of search for patterns or models In manycases, we can distinguish between (a) the search for the struc-ture of the pattern=model (e.g., a subgroup or a decision tree),and (b) the search for parameters (e.g., of a linear classifier or

per-a Bper-ayesiper-an network) Almost per-alwper-ays the goper-al is to optimizesome scoring or loss function, be it simply the absolute or rela-tive frequency, information-theoretic measures that evaluatethe information content of a model, numerical error measuressuch as the root mean squared error, the degree to which the

Trang 9

information in the data can be compressed, or the like times, we do not explicitly perform search in the space of pat-terns or models, but, more directly, employ optimizationtechniques.

Some-Given these preliminaries, we can summarize the ments of ML and DM as follows First, we have to fix therepresentation of the data and the patterns or models Then,

ele-we often have a partial order and a lattice over the patterns ormodels that allows an efficient search for patterns and models

of interest With these ingredients, data mining often boilsdown to search=optimization over the structure=parameters

of patterns=models with respect to some scoring=loss tion

func-Finally, descriptive DM is the task to describe and acterize the data in some way, e.g., by finding frequentlyoccurring patterns in the data In contrast, the goal of predic-tive DM is to make predictions for yet unseen data Predictive

char-DM mostly involves the search for classification or regressionmodels (see below) Please note that clustering should be cate-gorized as descriptive DM, although some probabilistic var-iants thereof could be used indirectly for predictive purposes

as well

Given complex data, one popular approach is to performdescriptive DM first (i.e., to find interesting patterns todescribe the data), and perform predictive DM as a secondstep (i.e., to use these patterns as descriptors in a predictivemodel) For instance, we might search for frequently occur-ring substructures in molecules and then use them as fea-tures in some statistical models

2 DESCRIPTIVE DM

2.1 Tasks in Descriptive DM

In the subsequent sections, we will discuss two populartasks in descriptive DM First, we will sketch clustering,the task of finding groups of instances, such that the similar-ity within the groups is maximized and the similaritybetween the groups is minimized Second, we will sketch

Trang 10

frequent pattern discovery and its descendants, where thetask is to find all patterns with a minimum number ofoccurrences in the data (the threshold being specified by theuser).

2.2 Clustering

The task of clustering is to find groups of observations, suchthat the intragroup similarity is maximized and the inter-group similarity is minimized There are tons of papers andbooks on clustering, and it is hard to tell the advantagesand disadvantages of the respective methods Part of theproblem is that the evaluation and validation of clusteringresults is, to some degree, subjective Clustering is unsuper-vised learning in the sense that there is no target value to

be predicted

The content of this section is complementary to that ofMarchal et al (10): We focus on the advantages and disadvan-tages of the respective techniques, their computational com-plexity and give references to recent literature In thesection on resources, several pointers to existing implementa-tions will be given In the following exposition, we will closelyfollow Witten and Frank (12)

Clustering algorithms can be categorized along severaldimensions:

Categorical vs probabilistic: Are the observationsassigned to clusters categorically or with some

probability?

Exclusive vs overlapping: Does the algorithm allowfor overlapping clusters, or is each instance assigned

to exactly one cluster?

Hierarchical vs flat: Are the clusters ordered archically (nested), or does the algorithm return a flat

hier-list of clusters?

Practically, clustering algorithms exhibit large ences in computational complexity (the worst-case runtimebehavior as a function of the problem size) Methods depend-ing on pair-wise distances of all instances (stored in a

Trang 11

differ-so-called proximity matrix) are at least quadratic in time andspace, meaning that these methods are not suitable for verylarge datasets Other methods like k-means are better suitedfor such problems (see below).

As stated above, the goal of clustering is to optimize theconflicting goals of maximal homogeneity and maximalseparation However, in general, the evaluation of clusteringresults is not trivial Usually, the evaluation of clustersinvolves the inspection by human domain experts Thereexist several approaches to evaluating and comparingmethods:

If clustering is viewed as density estimation, then thelikelihood of the data given the clusters can be esti-mated This method can be used to evaluate clusteringresults on fresh test data

Other measures can be applied, for instance, based onthe mean (minimum, maximum) distance within acluster and the mean (minimum, maximum) distancebetween instances coming from different clusters

Take a classification task and see whether the knownclass structure is rediscovered by the clustering algo-rithm The idea is to hide the class labels, cluster thedata, and check to which degree the clusters containinstances of the same class

Vice versa, one can turn the discovered clusters into aclassification problem: Define one class per cluster,that is, assign the same class label to all instanceswithin each cluster, then apply a classifier to thedataset and estimate how well the classes can be sepa-rated In this way, we can also obtain an interpreta-tion of the clusters found

Example applications can be found in the area of array data (10) For instance, we might have the expressionlevels of several thousand genes for a group of patients Theuser may be interested in two tasks: clustering genes andclustering patients In the former case, we are interested

micro-in fmicro-indmicro-ing genes that behave similarly across all patients

Trang 12

In the latter case, we are looking for subgroups of patientsthat share a common gene expression profile.

2.2.1 Hierarchical Clustering

In hierarchical clustering, the goal is to find a hierarchy ofnested clusters Most algorithms of this category are agglom-erative, that is, they work bottom-up starting with single-instance clusters and merge the closest clusters until all datapoints are lying within the same cluster Obviously, one of thedesign decisions is how to define the distance between twoclusters, as opposed to the distance between two instances.Hierarchical clustering algorithms can also work top-down, in which case they are called divisive Divisive hier-archical clustering starts off with all instances in the samecluster In iterations, one cluster is selected and split accord-ing to some criterion, until all clusters contain a singleinstance There are many more agglomerative than divisiveclustering algorithms in the literature, and usually divisiveclustering is more time-consuming

In general, hierarchical clustering is at least quadratic inthe number of instances, which makes it impractical for verylarge datasets Both agglomerative and divisive methods pro-duce a so-called dendrogram, i.e., a graph showing at which

‘‘costs’’ two clusters are merged or divided Dendrogramscan readily be interpreted, but have to be handled with care

It is clear that hierarchical clustering algorithms by definitiondetect hierarchies of clusters in the data, whether they exist

or not (‘‘to a hammer everything looks like a nail’’) A frequentmistake is to apply this type of clustering uncritically and topresent the clustering results as the structure inherent inthe data, as opposed to the result of an algorithm

2.2.2 k-Means

k-Means clusters the data into k groups, where k is specified

by the user In the first step, k cluster centers are chosen (e.g.,randomly) In the second step, the instances are assigned tothe clusters based on their distance to the cluster centersdetermined in the first step Third, the centroids of the

Trang 13

clusters from step two are computed These steps are repeateduntil convergence is reached.

The complexity of k-means (if run for a constant number

of iterations—so far no results about the convergence vior of this old and simple algorithm are known) is

num-ber of centroids, and n is the size of the dataset The linearity

in the number of instances makes the algorithm well suitedfor very large datasets

Again, a word of caution is in order: The results can varysignificantly based on initial choice of cluster centers Thealgorithm is guaranteed to converge, but it converges only

to a local optimum Therefore, the standard approach is torun k-means several times with different random seeds (‘‘ran-dom restarts’’) Another disadvantage of k-means is that thenumber of clusters has to be specified beforehand In themeantime, algorithms automatically choosing k, such asX-means (11), have been proposed

2.2.3 Probabilistic=Model-Based Clustering

Ideally, clustering boils down to density estimation, that is,estimating the joint probability distribution of all our randomvariables of interest The advantage of this view is that wecan compare clustering objectively using the log-likelihood.From a probabilistic perspective, we want to find the clustersthat give the best explanations of the data Also, it is desirablethat each instance is not assigned deterministically to a clus-ter, but only with a certain probability One of the most pro-minent approaches to this task is mixture modeling, wherethe joint probability distribution is modeled as a weighted

Gaussians In the latter case, we are speaking of Gaussianmixture models In mixture modeling, each cluster is repre-sented by one distribution, governing the probabilities of fea-ture values in the corresponding cluster Since we usuallyconsider only a finite number of clusters, we are speaking offinite mixture models (12)

Trang 14

One of the most fundamental algorithms for findingfinite mixture models is the EM (expectation-maximization)algorithm (13) EM can be viewed as a generalization ofk-means as sketched above EM also relies on random initia-lizations and random restarts, and often converges in a fewiterations to a local optimum Still, probabilistic=model-basedclustering can be computationally very costly and relies onreasonable assumptions about the distributions governingthe data.

2.2.4 Other Relevant References

Another popular clustering algorithm is CLICK by Ron

in two phases: the first phase is divisive, the second erative (14) Implementations are available from the website

agglom-of the university agglom-of Tel Aviv A survey paper by the sameauthor (15) compares clustering algorithms in the context ofgene expression data Another recent experimental compari-son of several algorithms on gene expression data has beenperformed by Datta and Datta (16) Finally, fuzzy c-means(17), an algorithm based on k-means and fuzzy logic, appears

to be popular in the bioinformatics literature as well

2.3 Mining for Interesting Patterns

In this section, we introduce the task of mining for interestingpatterns Generally speaking, the task is defined as follows:

We are given a language of patterns L, a database D, and

an ‘‘interestingness predicate’’ q, which specifies which terns in L are of interest to the user with respect to D The

Alternatively, we might have a numeric measure of ingness that is to be maximized For space reasons, we willonly focus on the former problem definition here

interest-In its simplest form, interesting pattern discovery boilsdown to frequent pattern discovery That is, we are interested

in all patterns that occur with a minimum frequency in a

given database Why is frequency an interesting property?

Trang 15

First, if a pattern occurs only in a few cases, it might be due tochance Second, frequent patterns may be useful when itcomes to prediction Infrequent patterns are not likely to beuseful, when we have to generalize over several instances inorder to make a reasonable prediction.

Frequent pattern discovery can be defined for many called pattern domains Pattern domains are given by thetypes of data in D and the types of patterns in L that aresearched in D Obviously, L is to a large extent determined

so-by the types of data in D For instance, we might want to lyze databases of graphs (e.g., 2D structures of small mole-cules) The language L could then be defined as generalsubgraphs Alternatively, we might look for free (that is,unrooted) trees or linear paths (e.g., linear fragments, see

The most common pattern domain is that of so-calleditemsets Consider a database of supermarket transactionsconsisting of items that are purchased together Let I be theset of all possible products (so-called items) Then every pur-

multiplici-ties) Sets of items X are commonly called itemsets Thus, thetransaction database D is a multiset of itemsets The classic

DM problem then consists of finding all itemsets occurringwith a minimum frequency in the database Note that the pat-

min_freq In other words, an itemset is of interest to theuser if it occurs frequently enough in the transactiondatabase

Let us illustrate the concepts with an example In the

trans-actions: D¼ fx1x2x3x4, x1x3, x3x4, x1x3, x1x2x3, x1x4g If we areasking for all itemsets with a minimum absolute frequency

of 3 in this database, we obtain the following set of solutionpatterns:fx1, x3, x4, x1x3g

Trang 16

Algorithms for frequent pattern discovery generally fer from the combinatorial explosion that is due to the struc-ture of the pattern space: the worst-case complexity is mostlyexponential The practical behavior of these algorithms, how-ever, depends very much on properties of the data (e.g., howdense the transaction database is) and, conversely, on theminimum frequency threshold specified by the user Often,the question is how low the minimum frequency can be setbefore the programs run out of memory Another problemwith most algorithms for frequent pattern discovery is thatthey usually return far too many patterns Therefore, therelevant patterns have to be organized in a human-readableway or ranked, before they can be presented to the user.Finally, users should be aware that frequent patterns mightonly describe one part of the dataset and ignore the rest Inother words, all the frequent patterns might occur in the samepart of the database, and the remaining part is not repre-sented at all.

suf-On the positive side, search strategies like levelwisesearch (19) are quite general, so that their implementationfor new pattern domains, say, strings or XML data, is mostlystraightforward, if necessary

Examples for frequent pattern discovery are the searchfor frequently occurring molecular fragments in small mole-cules, the search for motifs in protein data, or the search forcoexpressed genes in microarray experiments

Continuing the example from above, we might have not

like to pose in differential data analysis For instance, wemight be looking for differentially expressed genes The gen-eral idea is to build systems that are able to answer complexqueries of this kind The user is posing constraints on the pat-terns of interest, and the systems employ intelligent searchtechniques to come up with solutions satisfying the con-straints The use of constraints and query languages in DM

Tiêu đề	Machine Learning and Data Mining
Tác giả	Stefan Kramer, Christoph Helma
Trường học	Technische Universität München
Chuyên ngành	Predictive Toxicology
Thể loại	lecture notes
Năm xuất bản	2005
Thành phố	Garching, Germany

Định dạng
Số trang	32
Dung lượng	368,09 KB