4 Using the database We will now illustrate how easy it is to use this experiment database to investigate a wide range of questions on the behavior of learning algorithms by simply writi
Trang 1all their parameters Also, 86 commonly used classification datasets were taken fromthe UCI repository and inserted together with their calculated characteristics Then,
to generate a sample of classification experiments that covers a wide range of ditions, while also allowing to test the performance of some algorithms under veryspecific conditions, some algorithms were explored more thoroughly than others.First, we ran all experiments with their default parameter settings on all datasets.Secondly, we defined sensible values for the most important parameters of the algo-rithms SMO (which trains a support vector machine), MultilayerPerceptron, J48 (aC4.5 implementation), 1R (a simple rule learner) and Random Forests (an ensemblelearner) and varied each of these parameters one by one, while keeping all other pa-rameters at default Finally, we further explored the parameter spaces of J48 and 1R
con-by selecting random parameter settings until we had about 1000 experiments on eachdataset For all randomized algorithms, each experiment was repeated 20 times withdifferent random seeds All experiments (about 250,000 in total) where evaluatedusing 10-fold cross-validation, using the same folds for each dataset
An online interface is available athttp://www.cs.kuleuven.be/~dtai/expdb/forthose who want to reuse experiments for their own purposes, together with a fulldescription and code which may be of use to set up similar databases, for example tostore, analyse and publish the results of large benchmark studies
4 Using the database
We will now illustrate how easy it is to use this experiment database to investigate
a wide range of questions on the behavior of learning algorithms by simply writingthe right queries and interpreting the results, or by applying data mining algorithms
to model more complex interactions
4.1 Comparing different algorithms
A first question may be “How do all algorithms in this database compare on a cific dataset D?” To investigate this, we query for the learning algorithm name andevaluation result (e.g predictive accuracy), linked to all experiments on (an instanceof) dataset D, which yields the following query:
spe-SELECT l.name, v.pred_acc
FROM experiment e, learner_inst li, learner l, data_inst di,
Trang 2Investigating Classifier Learning Behavior with Experiment Databases 425
Fig 2 Algorithm performance comparison on the monks-problems-2_test dataset.
spectrum (like J48), others jump to 100% accuracy for certain parameter settings(SMO with higher-order polynomial kernels and MultilayerPerceptron when enoughhidden nodes are used)
We can also compare two algorithms A1 and A2 on all datasets by joining their
performance results (with default settings) on each dataset, and plotting them againsteach other, as shown in Fig 3 Moreover, querying also allows us to use aggregates
and to order results, e.g to directly build rankings of all algorithms by their average
error over all datasets, using default parameters:
SELECT l.name, avg(v.mn_abs_err) AS avg_err FROM experiment e,
learner l, learner_inst li, evaluation v WHERE v.eid = e.eid and
e.learner_inst = li.liid and li.lid = l.lid and li.default = true
GROUP BY l.name ORDER BY avg_err asc
Similar questions can be answered in the same vein With small adjustments, wecan query for the variance, of each algorithm’s error (over all or a single dataset),study how much error rankings differ from one dataset to another, or study howparameter optimization affects these rankings
SELECT s1.name, avg(s1.pred_acc) AS A1_acc, avg(s2.pred_acc) AS A2_acc FROM (SELECT d.name, e.pred_acc FROM WHERE l.name = 'A1' ) AS s1 JOIN (SELECT d.name, e.pred_acc FROM WHERE l.name = 'A2' ) AS s2 ON s1.name = s2.name GROUP BY s1.name
Fig 3 Comparing relative performance of J48 and OneR with a single query.
Trang 34.2 Querying for parameter effects
Previous queries generalized over all parameter settings Yet, starting from our firstquery, we can easily study the effect of a specific parameter P by “zooming in” onthe results of algorithm A (by adding this constraint) and selecting the value of Plinked to (an instantiation of) A, yielding Fig 4a:
SELECT v.pred_acc, lv.value
FROM experiment e, learner_inst li, learner l, data_inst di,
dataset d, evaluation v, learner_parameter lp, learner_parval lv
WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lidand l.name='A' and lv.liid=li.liid and lv.pid = lp.pid and lp
name='P' and e.data_inst = di.diid and di.did = d.did and d.name='D'Sometimes the effect of a parameter P may be dependent on the value of anotherparameter Such a parameter P2 can however be controlled (e.g by demanding itsvalue to be larger than V) by extending the previous query with a constraint requiringthat the learner instances additionally are amongst those where parameter P2 obeysthose constraints
WHERE and lv.liid IN
(SELECT lv.liid FROM learner_parval lv, learner_parameter lp
WHERE lv.pid = lp.pid and lp.name='P2' and lv.value>V)
Launching and visualizing such queries yield results such as in Fig 4, clearlyshowing the effect of the selected parameter and the variation caused by other pa-rameters As such, it is immediately obvious how general an observed trend is: allconstraints are explicitly mentioned in the query
Fig 4 The effect of the minimal leafsize of J48 on monks-problems-2_test (a), after
re-quiring binary trees (b), and after also suppressing reduced error pruning (c)
Trang 4Investigating Classifier Learning Behavior with Experiment Databases 427
4.3 Querying for the effect of dataset properties
It also becomes easy to investigate the interactions between data properties and ing algorithms For instance, we can use our experiments to study the effect of adataset’s size on the performance of algorithm A2:
learn-SELECT v.pred_acc, d.nr_examples
FROM experiment e, learner_inst li, learner l, data_inst di,
dataset d, evaluation v
WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lidand l.name='A' and e.data_inst = di.diid and di.did = d.did
4.4 Applying data mining techniques to the experiment database
There can be very complex interactions between parameter settings, dataset teristics and the resulting performance of learning algorithms However, since a largenumber of experimental results are available for each algorithm, we can apply datamining algorithms to model those interactions
charac-For instance, to automatically learn which of J48’s parameters have the greatestimpact on its performance on monks-problems-2_test (see Fig 4), we queriedfor the available parameter settings and corresponding results We discretized theperformance with thresholds on 67% (default accuracy) and 85%, and we used J48
to generate a (meta-)decision tree that, given the used parameter settings, predicts inwhich interval the accuracy lies The resulting tree (with 97.3% accuracy) is shown
in Fig 5 It clearly shows which are the most important parameters to tune, and howthey affect J48’s performance
Likewise, we can study for which dataset characteristics one algorithm greatlyoutperforms another Starting from the query in Fig 3, we additionally queried for
a wide range of data characteristics and discretized the performance gain of J48over 1R in three classes: “draw”, “win_J48” (4% to 20% gain), and “large_win_J48”(20% to 70% gain) The tree returned by J48 on this meta-dataset is shown in Fig 6,and clearly shows for which kinds of datasets J48 has a clear advantage over OneR
Fig 5 Impact of parameter settings Fig 6 Impact of dataset properties.
2To control the value of additional dataset properties, simply add these constraints to the list:WHERE and d.nr_attributes>5
Trang 54.5 On user-friendliness
The above SQL queries are relatively complicated Part of this is however a quence of the relatively complex structure of the database A good user interface,including a graphical query tool and an integrated visualization tool, would greatlyimprove the usability of the database
conse-5 Conclusions
We have presented an experiment database for classification, providing awell-structured repository of fully described classification experiments, thus allow-ing them to be easily verified, reused and related to theoretical properties of algo-rithms and datasets We show how easy it is to investigate a wide range of questions
on the behavior of these learning algorithms by simply writing the right queries andinterpreting the results, or by applying data mining algorithms to model more com-plex interactions The database is available online and can be used to gain new in-sights into classifier learning and to validate and refine existing results We believethis database and underlying software may become a valuable resource for research
in classification and, more broadly, machine learning and data analysis
Acknowledgements
We thank Anneleen Van Assche and Celine Vens for their useful comments and helpbuilding meta-decision trees and Anton Dries for implementing the dataset character-izations Hendrik Blockeel is Postdoctoral Fellow of the Fund for Scientific Research
- Flanders (Belgium) (FWO-Vlaanderen), and this research is further supported byGOA 2003/08 “Inductive Knowledge Bases”
References
BLOCKEEL, H (2006): Experiment databases: A novel methodology for experimental
re-search Lecture Notes in Computer Science, 3933, 72-85.
BLOCKEEL, H and Vanschoren J (2007): Experiment Databases: Towards an Improved
Ex-perimental Methodology in Machine Learning Lecture Notes in Computer Science, 4702,
to appear.
KALOUSIS, A and HILARIO, M (2000): Building Algorithm Profiles for prior Model
Se-lection in Knowledge Discovery Systems Engineering Intelligent Syst., 8(2).
PENG, Y et al (2002): Improved Dataset Characterisation for Meta-Learning Lecture Notes
in Computer Science, 2534, 141-152.
VAN SOMEREN, M (2001): Model Class Selection and Construction: Beyond the
Pro-crustean Approach to Machine Learning Applications Lecture Notes in Computer
Sci-ence, 2049, 196-217.
WITTEN, I.H and FRANK, E (2005): Data Mining: Practical Machine Learning Tools and
Techniques (2nd edition) Morgan Kaufmann.
Trang 6KNIME : The Konstanz Information Miner
Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, TobiasKötter, Thorsten Meinl, Peter Ohl, Christoph Sieb, Kilian Thiel and Bernd
WiswedelALTANA Chair for Bioinformatics and Information Mining,
Department of Computer and Information Science, University of Konstanz,
Box M712, 78457 Konstanz, Germany
contact@knime.org
Abstract The Konstanz Information Miner is a modular environment, which enables easy
visual assembly and interactive execution of a data pipeline It is designed as a teaching,research and collaboration platform, which enables simple integration of new algorithms andtools as well as data manipulation or visualization methods in the form of new modules ornodes In this paper we describe some of the design aspects of the underlying architecture andbriefly sketch how new nodes can be incorporated
or models An additional advantage of these systems is the intuitive, graphical way
to document what has been done
KNIME, the Konstanz Information Miner provides such a pipelining environment.Figure 1 shows a screenshot of an example analysis flow In the center, a flow isreading in data from two sources and processes it in several, parallel analysis flows,consisting of preprocessing, modeling, and visualization nodes On the left a reposi-tory of nodes is shown From this large variety of nodes, one can select data sources,data preprocessing steps, model building algorithms, as well as visualization toolsand drag them onto the workbench, where they can be connected to other nodes The
Trang 7Fig 1 An example analysis flow insideKNIME.
ability to have all views interact graphically (visual brushing) creates a powerful
en-vironment to visually explore the data sets at hand.KNIMEis written in Java and itsgraphical workflow editor is implemented as an Eclipse (Eclipse Foundation (2007))plug-in It is easy to extend through an open API and a data abstraction framework,which allows for new nodes to be quickly added in a well-defined way
In this paper we describe some of the internals ofKNIMEin more detail Moreinformation as well as downloads can be found at http://www.knime.org
2 Architecture
The architecture ofKNIMEwas designed with three main principles in mind
• Visual, interactive framework: Data flows should be combined by simpledrag&drop from a variety of processing units Customized applications can bemodeled through individual data pipelines
• Modularity: Processing units and data containers should not depend on each other
in order to enable easy distribution of computation and allow for independent velopment of different algorithms Data types are encapsulated, that is, no typesare predefined, new types can easily be added bringing along type specific ren-derers and comparators New types can be declared compatible to existing types
de-• Easy expandability: It should be easy to add new processing nodes or views anddistribute them through a simple plugin mechanism without the need for compli-cated install/deinstall procedures
Trang 8KNIME: The Konstanz Information Miner 321
In order to achieve this, a data analysis process consists of a pipeline of nodes, nected by edges that transport either data or models Each node processes the arriv-ing data and/or model(s) and produces results on its outputs when requested Fig-ure 2 schematically illustrates this process The type of processing ranges from basicdata operations such as filtering or merging to simple statistical functions, such ascomputations of mean, standard deviation or linear regression coefficients to compu-tation intensive data modeling operators (clustering, decision trees, neural networks,
con-to name just a few) In addition, most of the modeling nodes allow for an interactiveexploration of their results through accompanying views In the following we willbriefly describe the underlying schemata of data, node, workflow management andhow the interactive views communicate
2.1 Data structures
All data flowing between nodes is wrapped within a class called DataTable, whichholds meta-information concerning the type of its columns in addition to the actualdata The data can be accessed by iterating over instances of DataRow Each rowcontains a unique identifier (or primary key) and a specific number of DataCellobjects, which hold the actual data The reason to avoid access by Row ID or index isscalability, that is, the desire to be able to process large amounts of data and thereforenot be forced to keep all of the rows in memory for fast random access.KNIMEemploys a powerful caching strategy which moves parts of a data table to the harddrive if it becomes too large Figure 3 shows a UML diagram of the main underlyingdata structure
2.2 Nodes
Nodes inKNIMEare the most general processing units and usually resemble one node
in the visual workflow representation The class Node wraps all functionality and
Fig 2 A schematic for the flow of data and models in aKNIMEworkflow
Trang 9makes use of user defined implementations of a NodeModel, possibly a NodeDialog,and one or more NodeView instances if appropriate Neither dialog nor view must beimplemented if no user settings or views are needed This schema follows the well-known Model-View-Controller design pattern In addition, for the input and outputconnections, each node has a number of Inport and Outport instances, which caneither transport data or models Figure 4 shows a UML diagram of this structure.
of servers Thanks to the underlying graph structure, the workflow manager is able
to determine all nodes required to be executed along the paths leading to the nodethe user actually wants to execute
Fig 3 A UML diagram of the data structure and the main classes it relies on.
Trang 10KNIME: The Konstanz Information Miner 323
Fig 4 A UML diagram of the Node and the main classes it relies on.
2.4 Views and interactive brushing
Each Node can have an arbitrary number of views associated with it Through ceiving events from a HiLiteHandler (and sending events to it) it is possible tomark selected points in such a view to enable visual brushing Views can range fromsimple table views to more complex views on the underlying data (e g scatterplots,parallel coordinates) or the generated model (e g decision trees, rules)
re-2.5 Meta nodes
So-called Meta Nodes wrap a sub workflow into an encapsulating special node This
provides a series of advantages such as enabling the user to design much larger,more complex workflows and the encapsulation of specific actions To this end somecustomized meta nodes are available, which allow for a repeated execution of theenclosed sub workflow, offering the ability to model more complex scenarios such ascross-validation, bagging and boosting, ensemble learning etc Due to the modularity
ofKNIME, these techniques can then be applied virtually to any (learning) algorithmavailable in the repository
Additionally, the concept of Meta Nodes helps to assign dedicated servers to thissubflow or export the wrapped flow to other users as a predefined module
2.6 Distributed processing
Due to the modular architecture it is easy to designate specific nodes to be run onseparate machines But to accommodate the increasing availability of multi-core ma-
Trang 11chines, the support for shared memory parallelism also becomes increasingly tant.KNIMEoffers a unified framework to parallelize data-parallel operations Sieb
impor-et al (2007) describe further extensions, which enable the distribution of complextasks such as cross validation on a cluster or a GRID
3 Repository
KNIMEalready offers a large variety of nodes, among them are nodes for varioustypes of data I/O, manipulation, and transformation, as well as data mining and ma-chine learning and a number of visualization components Most of these nodes havebeen specifically developed forKNIMEto enable tight integration with the frame-work; other nodes are wrappers, which integrate functionality from third party li-braries Some of these are summarized in the next section
• Mining algorithms: clustering (k-means, sota, fuzzy c-means), decision tree,
(fuzzy) rule induction, regression, subgroup and association rule mining, neuralnetworks (probabilistic neural networks and multi-layer-perceptrons)
• Visualization: scatter plot, histogram, parallel coordinates, multidimensional ing, rule plotters
scal-• Misc: scripting nodes
3.2 External tools
KNIMEintegrates functionality of different open source projects that essentially coverall major areas of data analysis such as WEKA (Witten and Frank (2005)) for ma-chine learning and data mining, the R environment (R Development core team (2007))for statistical computations and graphics, and JFreeChart (Gilbert (2005)) for visual-ization
• WEKA: essentially all algorithm implementations, for instance support vectormachines, Bayes networks and Bayes classifier, decision tree learners
• R-project: console node to interactively execute R commands, basic R plottingnode
• JFreeChart: various line, pie and histogram charts
The integration of these tools not only enriches the functionality available inKNIMEbut has also proven to be helpful to overcome compatibility limitations whenthe aim is on using these different libraries in a shared setup
Trang 12KNIME: The Konstanz Information Miner 325
KNIMEalready includes plug-ins to incorporate existing data analysis tools It is ally straightforward to create wrappers for external tools without having to modifythese executables themselves Adding new nodes toKNIME, also for native new op-erations, is easy For this, one needs to extend three abstract classes:
usu-• NodeModel: this class is responsible for the main computations It requires tooverwrite three main methods: configure(), execute(), and reset() Thefirst takes the meta information of the input tables and creates the definition ofthe output specification The execute function performs the actual creation ofthe output data or models, and reset discards all intermediate results
• NodeDialog: this class is used to specify the dialog that enables the user to just individual settings that affect the node’s execution A standardized set ofDefaultDialogComponent objects allows the node developer to quickly createdialogs when only a few standard settings are needed
ad-• NodeView: this class can be extended multiple times to allow for different viewsonto the underlying model Each view is automatically registered with aHiLiteHandler which sends events when other views have hilited points andallows to launch events in case points have been hilit inside this view
In addition to the three model, dialog, and view classes the programmer also needs toprovide a NodeFactory, which serves to create new instances of the above classes.The factory also provides names and other details such as the number of availableviews or a flag indicating absence or presence of a dialog
A wizard integrated in the Eclipse-based development environment enables venient generation of all required class bodies for a new node
con-5 Conclusion
KNIME, the Konstanz Information Miner offers a modular framework, which vides a graphical workbench for visual assembly and interactive execution of datapipelines It features a powerful and intuitive user interface, enables easy integration
pro-of new modules or nodes, and allows for interactive exploration pro-of analysis results ortrained models In conjunction with the integration of powerful libraries such as theWEKA data mining toolkit and the R-statistics software, it constitutes a feature richplatform for various data analysis tasks
KNIMEis an open source project available at http://www.knime.org The currentrelease version 1.2.1 (as of 14 May 2007) has numerous improvements over the firstpublic version released in July 2006.KNIMEis actively maintained by a group ofabout 10 people and has more than 6 000 downloads so far It is free for non-profitand academic use