INTRODUCTION In this chapter, we will review basic techniques from edge discovery in databases KDD, data mining DM, andmachine learning ML that are suited for applications in pre-dictive
Trang 1Machine Learning and Data Mining
STEFAN KRAMER Institut fu¨r Informatik, Technische
Universita¨t Mu¨nchen, Garching,
Mu¨nchen, Germany
CHRISTOPH HELMA Institute for Computer Science, Universita¨t Freiburg, Georges Ko¨hler
Allee, Freiburg, Germany
1 INTRODUCTION
In this chapter, we will review basic techniques from edge discovery in databases (KDD), data mining (DM), andmachine learning (ML) that are suited for applications in pre-dictive toxicology We will discuss primarily methods whichare capable of providing new insights and theories Methods,which work well for predictive purposes but do not returnmodels that are easily interpretable in terms of toxicological
approaches), will not be discussed here, but are discussedelsewhere in this book
Also not included in this chapter, yet important, arevisualization techniques, which are valuable for giving first
223
Trang 2clues about regularities or errors in the data The chapter willfeature data analysis techniques originating from a variety offields, such as artificial intelligence, databases, and statistics.From artificial intelligence, we know about the structure ofsearch spaces for patterns and models, and how to searchthem efficiently Database literature is a valuable source ofinformation about efficient storage of and access to largevolumes of data, provides abstractions of data management,and has contributed the concept of query languages to datamining Statistics is of utmost importance to data miningand machine learning, since it provides answers to manyimportant questions arising in data analysis For instance,
it is necessary to avoid flukes, that is, patterns or models thatare due to chance and do not reflect structure inherent in thedata Also, the issue of prior knowledge has been studied tosome extent in the statistical literature
One of the most important lectures in data analysis isthat one cannot be too cautious with respect to the conclusions
to be drawn from the data It is never a good idea to rely toomuch on automatic tools without checking the results forplausibility Data analysis tools should never be appliednaively—the prime directive is ‘‘know your data.’’ Therefore,sanity checks, (statistical) quality control, configuration man-agement, and versioning are a necessity One should always
be aware of the possible threats to validity
Regarding the terminology in the paper, we will talkabout instances, cases, examples, observations interchange-
attribu-tes=features=variables (e.g., properties of the molecules,
chapter If we are considering prediction, we are aiming atthe prediction of one (or a few) dependent variables (or target
indepen-dent variables (e.g., molecular properties)
In several cases, we will refer to the computational plexity of the respective methods The time complexity of analgorithm gives us an asymptotic upper bound on the runtime
com-of the algorithm as a function com-of the size com-of the input problem.Thus, it gives us the worst-case behavior of algorithms It is
Trang 3written in the O() (‘‘big O’’) notation, which in effect presses constants If the input size of a dataset is measured
sup-in terms of the number of sup-instances n, then O(n) means thatthe computation scales linearly with n (Note that we are alsointerested in the scalability in the number of features, m.)Sometimes, we will refer to the space complexity, whichmakes statements about the worst-case memory usage ofalgorithms Finally, we will assume basic knowledge of statis-tics and probability theory in the remainder of the chapter
This chapter consists of four main sections: The first part
is an introduction to data mining Among other things, itintroduces the terminology used in the rest of the chapter.The second part focuses on so-called descriptive data mining,the third part on predictive data mining Each class of techni-ques is described in terms of the inputs and outputs of therespective algorithms, sometimes including examples thereof
We also emphasize the typical usage and the advantages of thealgorithms, as well as the typical pitfalls and disadvantages.The fourth part of the chapter is devoted to references to therelevant literature, available tools, and implementations
1.1 Data Mining (DM) and Knowledge
Discovery in Databases
This section shall provide a non-technical introduction to datamining (DM) The book Data Mining by Witten and Frank (2)provides an excellent introduction into this area and it isquite readable even for non-computer scientists A recentreview (3) covers DM applications in toxicology Anotherrecommended reading is Advances in Knowledge Discoveryand Data Mining by Fayyad et al (4)
First, we will have to clarify the meaning of DM and itsrelation to other terms frequently used in this area, namelyknowledge discovery in databases (KDD) and machine learn-ing (ML) Common definitions (2,5–8) are:
Knowledge discovery (KDD) is the non-trivial process
of identifying valid, novel, potentially useful, and mately understandable structure in data
Trang 4ulti- Data mining (DM) is the actual data analysis stepwithin this process It consists of the application ofstatistics, machine learning, and database techniques
to the dataset at hand
Machine learning (ML) is the study of computer rithms that improve automatically through experi-ence One ML task of particular interest in DM isclassification; that is, to classify new unseen instances
algo-on the basis of known training instances
This means that knowledge discovery is the process ofsupporting humans in their enterprise to make sense ofmassive amounts of data; data mining is the application oftechniques to achieve this goal; and machine learning isone of the techniques suitable for this task Other DM tech-niques originate from diverse fields, such, as statistics,visualization, and database research The focus in this chap-ter will be primarily on DM techniques based on machinelearning
In practice, many of these terms are not used in theirstrict sense In this chapter, we will also use sometimes thepopular term DM, when we mean KDD or ML
Table 1 shows the typical KDD process as described
by Fayyad et al (6) In the following, we will sketch the pted process for the task of extracting structure–activity
ada-Table 1 The Knowledge Discovery (KDD) Process According to Fayyad et al.
1 Definition of the goals of the KDD process
2 Creation or selection of a data set
3 Data cleaning and preprocessing
4 Data reduction and projection
5 Selecting data mining methods
6 Exploratory analysis and model=hypothesis selection
Trang 5relationships (SARs) from experimental data The stepsclosely resemble those from the generic process by Fayyad:
of the SAR models (e.g., predictions for untested
mechanisms)
perform-ing experiments, downloadperform-ing data)
inconsisten-cies, and perform corrections
project and transformation of the data into a format,which is readable by DM programs
tools to see if they provide useful results
techni-que to the dataset
of its performance
activity of untested compounds
The typical KDD setting involves several iterations overthese steps Human intervention is an essential component ofthe KDD process Although most research has been focused
on the DM step of the process (and the present chapter willnot make an exception), the other steps are at least equallyimportant In practical applications, the data cleaningand preprocessing step is the most laborious and time-consuming task in the KDD process (and therefore oftenneglected)
In the following sections, we will introduce a few generalterms that are useful for describing and choosing DM systems
on a general level First, we will discuss the structure of thedata, which can be used by DM programs Then, we will have
a closer look at DM as search or optimization in the space ofpatterns and models Subsequently, we will distinguishbetween descriptive and predictive DM
Trang 61.2 Data Representation
Before feeding the data into a DM program, we have to form it into a computer-readable form From a computerscientists point of view, there are two basic data representa-tions relevant to DM, both will be illustrated with examples.Table 2 shows a table with physico-chemical properties ofchemical compounds For every compound, there are a fixednumber of parameters or features available, therefore it ispossible to represent the data in a single table In this table,each row represents an example and each column an attri-bute We call this type of representation propositional.Let us assume we want to represent chemical structures
trans-by identifying atoms and the connections (bonds) betweenthem It is obvious that this type of data does not fit into a sin-gle table, because each compound may have a different num-ber of atoms and bonds Instead we may write down the atoms
is called a relational representation Other biologically vant structures (e.g., genes, proteins) may be represented in
rele-a similrele-ar mrele-anner
The majority of research on ML and DM has beendevoted to propositional representations However, thereexists a substantial body of work on DM in relational repre-sentations Work in this area is published under the heading
of inductive logic programming and (multi-)relational datamining As of this writing, only very few commercial products
Table 2 Example of a Propositional Representation of Chemical Compounds Using Molecular Properties
Trang 7are explicitly dealing with relational representations able non-commercial software packages include ACE by the
the Dutch company Kiminkii
One of the implications of choosing a relational tation is that the complexity of the DM task grows substan-tially This means that the runtimes of relational DMalgorithms are usually larger than those of their propositional
represen-Figure 1
Trang 8relatives For the sake of brevity, we will not discussrelational DM algorithms in the remainder of the chapter.
1.3 DM as Search for Patterns and Models
The DM step in the KDD process can be viewed as the searchfor structure in the given data In the most extreme case, weare interested in the probability distribution of all variables(i.e., the full joint probability distribution of the data) Know-ing the joint probability distribution, we would be able toanswer all conceivable questions regarding the data If weonly want to predict one variable given the other variables,
we are dealing with a classification or regression task: fication is the prediction of one of a finite number of discreteclasses (e.g., carcinogens, non-carcinogens), regression is theprediction of a continuous, real-valued target variable (e.g.,
depen-dent variable given the independepen-dent variables, which requiresless data than estimating the full joint probability distribu-tion In all of the above cases, we are looking for global regu-larities, that is, models of the data However, we might just aswell be satisfied with local regularities in the data Local reg-ularities are often called patterns in the DM literature Fre-quently occurring substructures in molecules fall into thiscategory, for instance Other examples for patterns are depen-dencies among variables (functional or multivalued depen-dencies) as known from the database literature (9) Again,looking for patterns is an easier task than predicting a targetvariable or modeling the joint probability distribution
Most ML and DM approaches, at least conceptually, form some kind of search for patterns or models In manycases, we can distinguish between (a) the search for the struc-ture of the pattern=model (e.g., a subgroup or a decision tree),and (b) the search for parameters (e.g., of a linear classifier or
per-a Bper-ayesiper-an network) Almost per-alwper-ays the goper-al is to optimizesome scoring or loss function, be it simply the absolute or rela-tive frequency, information-theoretic measures that evaluatethe information content of a model, numerical error measuressuch as the root mean squared error, the degree to which the
Trang 9information in the data can be compressed, or the like times, we do not explicitly perform search in the space of pat-terns or models, but, more directly, employ optimizationtechniques.
Some-Given these preliminaries, we can summarize the ments of ML and DM as follows First, we have to fix therepresentation of the data and the patterns or models Then,
ele-we often have a partial order and a lattice over the patterns ormodels that allows an efficient search for patterns and models
of interest With these ingredients, data mining often boilsdown to search=optimization over the structure=parameters
of patterns=models with respect to some scoring=loss tion
func-Finally, descriptive DM is the task to describe and acterize the data in some way, e.g., by finding frequentlyoccurring patterns in the data In contrast, the goal of predic-tive DM is to make predictions for yet unseen data Predictive
char-DM mostly involves the search for classification or regressionmodels (see below) Please note that clustering should be cate-gorized as descriptive DM, although some probabilistic var-iants thereof could be used indirectly for predictive purposes
as well
Given complex data, one popular approach is to performdescriptive DM first (i.e., to find interesting patterns todescribe the data), and perform predictive DM as a secondstep (i.e., to use these patterns as descriptors in a predictivemodel) For instance, we might search for frequently occur-ring substructures in molecules and then use them as fea-tures in some statistical models
2 DESCRIPTIVE DM
2.1 Tasks in Descriptive DM
In the subsequent sections, we will discuss two populartasks in descriptive DM First, we will sketch clustering,the task of finding groups of instances, such that the similar-ity within the groups is maximized and the similaritybetween the groups is minimized Second, we will sketch
Trang 10frequent pattern discovery and its descendants, where thetask is to find all patterns with a minimum number ofoccurrences in the data (the threshold being specified by theuser).
2.2 Clustering
The task of clustering is to find groups of observations, suchthat the intragroup similarity is maximized and the inter-group similarity is minimized There are tons of papers andbooks on clustering, and it is hard to tell the advantagesand disadvantages of the respective methods Part of theproblem is that the evaluation and validation of clusteringresults is, to some degree, subjective Clustering is unsuper-vised learning in the sense that there is no target value to
be predicted
The content of this section is complementary to that ofMarchal et al (10): We focus on the advantages and disadvan-tages of the respective techniques, their computational com-plexity and give references to recent literature In thesection on resources, several pointers to existing implementa-tions will be given In the following exposition, we will closelyfollow Witten and Frank (12)
Clustering algorithms can be categorized along severaldimensions:
Categorical vs probabilistic: Are the observationsassigned to clusters categorically or with some
probability?
Exclusive vs overlapping: Does the algorithm allowfor overlapping clusters, or is each instance assigned
to exactly one cluster?
Hierarchical vs flat: Are the clusters ordered archically (nested), or does the algorithm return a flat
hier-list of clusters?
Practically, clustering algorithms exhibit large ences in computational complexity (the worst-case runtimebehavior as a function of the problem size) Methods depend-ing on pair-wise distances of all instances (stored in a
Trang 11differ-so-called proximity matrix) are at least quadratic in time andspace, meaning that these methods are not suitable for verylarge datasets Other methods like k-means are better suitedfor such problems (see below).
As stated above, the goal of clustering is to optimize theconflicting goals of maximal homogeneity and maximalseparation However, in general, the evaluation of clusteringresults is not trivial Usually, the evaluation of clustersinvolves the inspection by human domain experts Thereexist several approaches to evaluating and comparingmethods:
If clustering is viewed as density estimation, then thelikelihood of the data given the clusters can be esti-mated This method can be used to evaluate clusteringresults on fresh test data
Other measures can be applied, for instance, based onthe mean (minimum, maximum) distance within acluster and the mean (minimum, maximum) distancebetween instances coming from different clusters
Take a classification task and see whether the knownclass structure is rediscovered by the clustering algo-rithm The idea is to hide the class labels, cluster thedata, and check to which degree the clusters containinstances of the same class
Vice versa, one can turn the discovered clusters into aclassification problem: Define one class per cluster,that is, assign the same class label to all instanceswithin each cluster, then apply a classifier to thedataset and estimate how well the classes can be sepa-rated In this way, we can also obtain an interpreta-tion of the clusters found
Example applications can be found in the area of array data (10) For instance, we might have the expressionlevels of several thousand genes for a group of patients Theuser may be interested in two tasks: clustering genes andclustering patients In the former case, we are interested
micro-in fmicro-indmicro-ing genes that behave similarly across all patients
Trang 12In the latter case, we are looking for subgroups of patientsthat share a common gene expression profile.
2.2.1 Hierarchical Clustering
In hierarchical clustering, the goal is to find a hierarchy ofnested clusters Most algorithms of this category are agglom-erative, that is, they work bottom-up starting with single-instance clusters and merge the closest clusters until all datapoints are lying within the same cluster Obviously, one of thedesign decisions is how to define the distance between twoclusters, as opposed to the distance between two instances.Hierarchical clustering algorithms can also work top-down, in which case they are called divisive Divisive hier-archical clustering starts off with all instances in the samecluster In iterations, one cluster is selected and split accord-ing to some criterion, until all clusters contain a singleinstance There are many more agglomerative than divisiveclustering algorithms in the literature, and usually divisiveclustering is more time-consuming
In general, hierarchical clustering is at least quadratic inthe number of instances, which makes it impractical for verylarge datasets Both agglomerative and divisive methods pro-duce a so-called dendrogram, i.e., a graph showing at which
‘‘costs’’ two clusters are merged or divided Dendrogramscan readily be interpreted, but have to be handled with care
It is clear that hierarchical clustering algorithms by definitiondetect hierarchies of clusters in the data, whether they exist
or not (‘‘to a hammer everything looks like a nail’’) A frequentmistake is to apply this type of clustering uncritically and topresent the clustering results as the structure inherent inthe data, as opposed to the result of an algorithm
2.2.2 k-Means
k-Means clusters the data into k groups, where k is specified
by the user In the first step, k cluster centers are chosen (e.g.,randomly) In the second step, the instances are assigned tothe clusters based on their distance to the cluster centersdetermined in the first step Third, the centroids of the
Trang 13clusters from step two are computed These steps are repeateduntil convergence is reached.
The complexity of k-means (if run for a constant number
of iterations—so far no results about the convergence vior of this old and simple algorithm are known) is
num-ber of centroids, and n is the size of the dataset The linearity
in the number of instances makes the algorithm well suitedfor very large datasets
Again, a word of caution is in order: The results can varysignificantly based on initial choice of cluster centers Thealgorithm is guaranteed to converge, but it converges only
to a local optimum Therefore, the standard approach is torun k-means several times with different random seeds (‘‘ran-dom restarts’’) Another disadvantage of k-means is that thenumber of clusters has to be specified beforehand In themeantime, algorithms automatically choosing k, such asX-means (11), have been proposed
2.2.3 Probabilistic=Model-Based Clustering
Ideally, clustering boils down to density estimation, that is,estimating the joint probability distribution of all our randomvariables of interest The advantage of this view is that wecan compare clustering objectively using the log-likelihood.From a probabilistic perspective, we want to find the clustersthat give the best explanations of the data Also, it is desirablethat each instance is not assigned deterministically to a clus-ter, but only with a certain probability One of the most pro-minent approaches to this task is mixture modeling, wherethe joint probability distribution is modeled as a weighted
Gaussians In the latter case, we are speaking of Gaussianmixture models In mixture modeling, each cluster is repre-sented by one distribution, governing the probabilities of fea-ture values in the corresponding cluster Since we usuallyconsider only a finite number of clusters, we are speaking offinite mixture models (12)
Trang 14One of the most fundamental algorithms for findingfinite mixture models is the EM (expectation-maximization)algorithm (13) EM can be viewed as a generalization ofk-means as sketched above EM also relies on random initia-lizations and random restarts, and often converges in a fewiterations to a local optimum Still, probabilistic=model-basedclustering can be computationally very costly and relies onreasonable assumptions about the distributions governingthe data.
2.2.4 Other Relevant References
Another popular clustering algorithm is CLICK by Ron
in two phases: the first phase is divisive, the second erative (14) Implementations are available from the website
agglom-of the university agglom-of Tel Aviv A survey paper by the sameauthor (15) compares clustering algorithms in the context ofgene expression data Another recent experimental compari-son of several algorithms on gene expression data has beenperformed by Datta and Datta (16) Finally, fuzzy c-means(17), an algorithm based on k-means and fuzzy logic, appears
to be popular in the bioinformatics literature as well
2.3 Mining for Interesting Patterns
In this section, we introduce the task of mining for interestingpatterns Generally speaking, the task is defined as follows:
We are given a language of patterns L, a database D, and
an ‘‘interestingness predicate’’ q, which specifies which terns in L are of interest to the user with respect to D The
Alternatively, we might have a numeric measure of ingness that is to be maximized For space reasons, we willonly focus on the former problem definition here
interest-In its simplest form, interesting pattern discovery boilsdown to frequent pattern discovery That is, we are interested
in all patterns that occur with a minimum frequency in a
given database Why is frequency an interesting property?
Trang 15First, if a pattern occurs only in a few cases, it might be due tochance Second, frequent patterns may be useful when itcomes to prediction Infrequent patterns are not likely to beuseful, when we have to generalize over several instances inorder to make a reasonable prediction.
Frequent pattern discovery can be defined for many called pattern domains Pattern domains are given by thetypes of data in D and the types of patterns in L that aresearched in D Obviously, L is to a large extent determined
so-by the types of data in D For instance, we might want to lyze databases of graphs (e.g., 2D structures of small mole-cules) The language L could then be defined as generalsubgraphs Alternatively, we might look for free (that is,unrooted) trees or linear paths (e.g., linear fragments, see
The most common pattern domain is that of so-calleditemsets Consider a database of supermarket transactionsconsisting of items that are purchased together Let I be theset of all possible products (so-called items) Then every pur-
multiplici-ties) Sets of items X are commonly called itemsets Thus, thetransaction database D is a multiset of itemsets The classic
DM problem then consists of finding all itemsets occurringwith a minimum frequency in the database Note that the pat-
min_freq In other words, an itemset is of interest to theuser if it occurs frequently enough in the transactiondatabase
Let us illustrate the concepts with an example In the
trans-actions: D¼ fx1x2x3x4, x1x3, x3x4, x1x3, x1x2x3, x1x4g If we areasking for all itemsets with a minimum absolute frequency
of 3 in this database, we obtain the following set of solutionpatterns:fx1, x3, x4, x1x3g
Trang 16Algorithms for frequent pattern discovery generally fer from the combinatorial explosion that is due to the struc-ture of the pattern space: the worst-case complexity is mostlyexponential The practical behavior of these algorithms, how-ever, depends very much on properties of the data (e.g., howdense the transaction database is) and, conversely, on theminimum frequency threshold specified by the user Often,the question is how low the minimum frequency can be setbefore the programs run out of memory Another problemwith most algorithms for frequent pattern discovery is thatthey usually return far too many patterns Therefore, therelevant patterns have to be organized in a human-readableway or ranked, before they can be presented to the user.Finally, users should be aware that frequent patterns mightonly describe one part of the dataset and ignore the rest Inother words, all the frequent patterns might occur in the samepart of the database, and the remaining part is not repre-sented at all.
suf-On the positive side, search strategies like levelwisesearch (19) are quite general, so that their implementationfor new pattern domains, say, strings or XML data, is mostlystraightforward, if necessary
Examples for frequent pattern discovery are the searchfor frequently occurring molecular fragments in small mole-cules, the search for motifs in protein data, or the search forcoexpressed genes in microarray experiments
Continuing the example from above, we might have not
like to pose in differential data analysis For instance, wemight be looking for differentially expressed genes The gen-eral idea is to build systems that are able to answer complexqueries of this kind The user is posing constraints on the pat-terns of interest, and the systems employ intelligent searchtechniques to come up with solutions satisfying the con-straints The use of constraints and query languages in DM