In the next section we present a Grid-based environment, named Knowledge Grid, whose aim is to support general PDKD applications, providing an interface both to manage and access large r
Trang 1whose main goal is to design, develop and implement an infrastructure to effectively support scientific knowledge discovery processes from high-throughput informatics In this context,
a series of testbeds and demonstrations are being carried out, for using the technology in the areas of life sciences, environmental modeling and geo-hazard prediction
The building blocks in Discovery Net are the so-called Knowledge Discovery Services (KDS), distinguished in Computation Services and Data Services The former typically
com-prise algorithms, e.g data preparation and Data Mining, while the latter define relational tables (as queries) and other data sources Both kinds of services are described (and registered) by
means of Adapters, providing information such as input and output types, parameters, loca-tion and/or platform/operating system constraints, factories (objects allowing to retrieve
ref-erences to services and to download them), keywords and a human-readable description KDS are used to compose moderately complex data-pipelined processes The composition may be carried out by means of a GUI which provides access to a library of services The XML-based
language used to describe processes is called Discovery Process Markup Language (DPML).
Each composed process can be deployed and published as a new process Typically, process descriptions are not bound to specific servers since the actual resources are later resolved by lookup servers (see below)
Discovery Net is based on an open architecture using common protocols and
infrastruc-tures such as Globus Toolkit Servers are distinguished into (i) Knowledge Servers, allowing
storage and retrieval of knowledge (meant as raw data and knowledge models) and processes;
(ii) Resource Discovery Servers, providing a knowledge base of service definitions and per-forming resource resolution; (iii) Discovery Meta-Information Servers, used to store informa-tion about the Knowledge Schema, i.e the sets of features of known databases, their types, and
how they can be composed with each other
Finally, we outline here some interesting Data Mining testbeds developed at the National Center for Data Mining (NCDM) at the University of Illinois at Chicago (UIC) (www.ncdm.uic edu/testbeds.htm
• The Terra Wide Data Mining Testbed (TWDM) TWDM is an infrastructure for the remote
analysis, distributed mining, and real-time exploration of scientific, engineering, business, and other complex data It consists of five geographically distributed nodes linked by
op-tical networks through StarLight (an advanced opop-tical infrastructure) in Chicago These
sites include StarLight itself, the Laboratory for Advanced Computing at UIC, SARA
in Amsterdam, and the Dalhousie University in Halifax In 2003 new sites will be con-nected, including the Imperial College in London A central idea in TWDM is to keep generated predictive models up-to-date with respect to newly available data, in order to achieve better predictions (as this is an important aspect in many “critical” domains, such
as infectious disease tracking) TWDM is based on DataSpace, another NCDM project for supporting real-time streaming data; in DataSpace the Data Tranformation Markup Language (DTML) is used to describe how to update “profiles”, i.e aggregate data which
are inputs of predictive models, on the basis of new “events”, i.e new bits of information
• The Terabyte Challenge Testbed The Terabyte Challenge Testbed is an open, distributed
testbed for DataSpace tools, services, and protocols It involves a number of organizations, including the University of Illinois at Chicago, the University of Pennsylvania, the Uni-versity of California at Davis and the Imperial College The testbed consists of ten sites distributed over three continents connected by high–performance links Each site pro-vides a number of local clusters of workstations which are connected to form wide area
meta-clusters maintained by the National Scalable Cluster Project So far, meta-clusters
have been used by applications in high energy physics, computational chemistry, nonlin-ear simulation, bioinformatics, medical imaging, network traffic analysis, digital libraries
Center for Data Mini
• The Terra Wide Data Mining Testbed
):
Trang 2of video data, etc Currently, the Terabyte Challenge Testbed consists of approximately
100 nodes and 2 terabytes of disk storage
• The Global Discovery Network (GDN) The GDN is a collaboration between the
Labora-tory for Advanced Computing of the National Center for Data Mining and the Discovery Net project (see above) It will link the Discovery Net to the Terra Wide Data Mining Testbed to create a combined global testbed with a critical mass of data
The GridMiner project at the University of Vienna aims to cover the main aspects of knowl-edge discovery on Grids GridMiner is a model based on the OGSA framework (Foster et al.,
2002), and embraces an open architecture in which a set of services are defined for handling data distribution and heterogeneity, supporting different types of analysis strategies, as well as tools and algorithms, and providing for OLAP support Key components in GridMiner are the
Data Access service, the Data Mediation service, and the Data Mining service Data Access
implements the data access to databases and data repositories; Data Mediation provides for
a view of distributed data by logically integrating them into virtual data sources (VDS) and
allowing to send queries to them and combine and deliver back the results The Data Mining layer comprises a set of specific services useful to prepare and execute a Data Mining appli-cation, as well as present its results The system has not been yet implemented on a Grid; a preliminary fully centralized version of the system is currently available
GATES (Grid-based AdapTive Execution on Streams) is an OGSA based system that
pro-vides support for processing data streams in a Grid environment (Agrawal, 2003) GATES aims to support the distributed analysis of data streams arising from distributed sources (e.g., data from large–scale experiments/simulations), providing automatic resource discovery, and
an interface for enabling self-adaptation to meet real-time constraints
Some of the systems discussed above support specific domains applications, others sup-port a more general class of problems Moreover, some of such systems are mainly advanced interfaces for integrating, accessing, and elaborating large datasets, whereas others provide more specific functionalities for the support of typical knowledge discovery processes
In the next section we present a Grid-based environment, named Knowledge Grid, whose aim is to support general PDKD applications, providing an interface both to manage and access large remote data sets, and to execute high-performance data analysis on them
53.4 The Knowledge Grid
The Knowledge Grid (Cannataro and Talia, 2003) is an environment providing knowledge dis-covery services for a wide range of high–performance distributed applications Data sets and Data Mining and analysis tools used in such applications are increasingly becoming available
as stand-alone packages and as remote services on the Internet Examples include gene and DNA databases, network access and intrusion data, drug features and effects data repositories, astronomy data files, and data about web usage, content, and structure
Knowledge discovery procedures in all these applications typically require the creation and management of complex, dynamic, multi-step workflows At each step, data from various sources can be moved, filtered, and integrated and fed into a Data Mining tool Based on the output results, the analyst chooses which other data sets and mining components should be integrated in the workflow, or how to iterate the process to get a knowledge model Workflows are mapped on a Grid by assigning its nodes to the Grid hosts and using interconnections for implementing communication among the workflow nodes
Trang 3The Knowledge Grid supports such activities by providing mechanisms and high–level services for searching resources, representing, creating, and managing knowledge discovery processes, and for composing existing data services and data mining services in a structured manner, allowing designers to plan, store, document, verify, share and re-execute their work-flows as well as manage their output results
The Knowledge Grid architecture is composed of a set of services divided in two layers:
the Core K-Grid layer that interfaces the basic and generic Grid middleware services and the High-level K-Grid layer that interfaces the user by offering a set of services for the design
and execution of knowledge discovery applications Both layers make use of repositories that provide information about resource metadata, execution plans, and knowledge obtained as result of knowledge discovery applications
S3
D3
H2
S1
H3
H1
D2
S3
D1
D3
Component selection
Application workflow composition
Application execution
on the Grid
Fig 53.1 Main steps of application composition and execution in the Knowledge Grid
In the Knowledge Grid environment, discovery processes are represented as workflows that a user may compose using both concrete and abstract Grid resources Knowledge discov-ery workflows are defined using a visual interface that shows resources (data, tools, and hosts)
to the user and offers mechanisms for integrating them in a workflow Information about single resources and workflows are stored using an XML-based notation that represents a workflow
(called execution plan in the Knowledge Grid terminology) as a data-flow graph of nodes,
each representing either a Data Mining service or a data transfer service The XML represen-tation allows the workflows for discovery processes to be easily validated, shared, translated
in executable scripts, and stored for future executions Figure 53.1 shows the main steps of the composition and execution processes of a knowledge discovery application on the Knowledge Grid
Trang 453.4.1 Knowledge Grid Components and Tools
Figure 53.2 shows the general structure of the Knowledge Grid system and its main compo-nents and interaction patterns
KBR
KEPR KMR
DAS
Data Access
Service
TAAS
Tools and Algorithms Access Service
EPMS
Execution Plan Management Service
RPS
Results Presentation Service
KDS
Knowledge Directory
Service
RAEMS
Resource Allocation and Execution Mgmt Service
High level K-Grid layer
Core K-Grid layer
Resource Metadata Execution Plan Metadata Model Metadata
Fig 53.2 The Knowledge Grid general structure and components
The High-level K-Grid layer includes services used to compose, validate, and execute a parallel and distributed knowledge discovery computation Moreover, the layer offers services
to store and analyze the discovered knowledge Main services of the High-level K-Grid layer are:
• The Data Access Service (DAS) allows for the search, selection, transfer, transformation,
and delivery of data to be mined
• The Tools and Algorithms Access Service (TAAS) is responsible for searching, selecting
and downloading Data Mining tools and algorithms
• The Execution Plan Management Service (EPMS) An execution plan is represented by
a graph describing interactions and data flows among data sources, extraction tools, Data Mining tools, and visualization tools The Execution Plan Management Service allows for defining the structure of an application by building the corresponding graph and adding
a set of constraints about resources Generated execution plans are stored, through the
RAEMS, in the Knowledge Execution Plan Repository (KEP
R).
• The Results Presentation Service (RPS) offers facilities for presenting and visualizing the
extracted knowledge models (e.g., association rules, clustering models, classifications) The Core K-Grid layer includes two main services:
• The Knowledge Directory Service (KDS) that manages metadata describing Knowledge
Grid resources Such resources comprise hosts, repositories of data to be mined, tools and
Trang 5algorithms used to extract, analyze, and manipulate data, distributed knowledge discovery execution plans and knowledge obtained as result of the mining process The metadata in-formation is represented by XML documents stored in a Knowledge Metadata Repository (KMR)
• The Resource Allocation and Execution Management Service (RAEMS) is used to find a
suitable mapping between an “abstract” execution plan (formalized in XML) and available resources, with the goal of satisfying the constraints (computing power, storage, memory, database, network performance) imposed by the execution plan After the execution plan activation, this service manages and coordinates the application execution and the storing
of knowledge results in the Knowledge Base Repository (KBR).
An Application Scenario
We discuss here a simple meta-learning process over the Knowledge Grid, to show how the execution of a distributed Data Mining application can benefit from the Knowledge Grid
ser-vices (Cannataro et al., 2002B) Meta-learning aims to generate a number of independent
classifiers by applying learning programs to a collection of distributed data sets in parallel The classifiers computed by learning programs are then collected and combined to obtain a
global classifier (Prodromidis et al., 2000).
Figure 53.3 shows a distributed meta-learning scenario, in which a global classifier GC is obtained on NodeZ starting from the original data set DS stored on NodeA.
Partitioner
P
Data
Set DS
NodeA
Learner
L1
Training
Set TR1
Step 1
Combiner/Tester
CT
Validation
Set VS Testing
Set TS
Classifier
C1
Global Classifier
GC
Step 2
Step 3
…
Training
Set TRi
Training
Set TRn
Nodei
Learner
Li
Learner
Ln
…
NodeZ
Classifier
Ci
Classifier
Cn
…
…
Node1
Noden
Fig 53.3 A distributed meta-learning scenario
This process can be described through the following steps:
1 On NodeA, training sets TR1, ,TR n , testing set TS and validation set VS are extracted from DS by the partitioner P Then TR1, ,TR n , TS and VS are respectively moved from NodeA to Node1, ,Node n , and to NodeZ.
2 On each Node i (i=1, ,n) the classifier C i is trained from TR i by the learner L i Then each
C is moved from Node to NodeZ.
Trang 63 On NodeZ, the C1, ,C n classifiers are combined and tested on TS and validated on VS
by the combiner/tester CT to produce the global classifier GC.
To design such an application, a Knowledge Grid user interacts with the EPMS service, which provides a visual interface - see below - to compose a workflow (execution plan) describing at
a high level the needed activities involved in the overall Data Mining computation
Through the execution plan, computing, software and data resources are specified along
with a set of requirements on them In our example the user requires a set of n nodes providing the Learner software and a node providing the Combiner/Tester software, all of them
satisfy-ing given platform constraints and performance requirements In addition, the execution plan includes information about how to coordinate the execution of all the steps, as outlined above The execution plan is then processed by the RAEMS, which takes care of its allocation In particular, it first finds appropriate resources matching user requirements (i.e., a set of concrete
hosts Node1, ,Node n offering the software L, and a host Node Z providing the CT software),
using the KDS services Next, it manages the execution of the overall application, enforcing dependencies among data extraction, transfer, and mining steps, as specified in the execution plan The operations of data extraction and transfer are performed at a lower level by invoking the DAS services We observe here that, where needed, the RAEMS may perform software staging by means of the TAAS service
Finally, the RAEMS manages results retrieving (i.e., transferring of the global classifier
GC to the user host), and visualizes them using the RPS facilities.
Implementation
A software environment that implements the main components of the Knowledge Grid, com-prising services and functionalities ranging from information and discovery services to visual
design and execution facilities is VEGA - Visual Environment for Grid Applications - (Can-nataro et al., 2002A, Can(Can-nataro et al., 2002A).
The main goal of VEGA is to offer a set of visual functionalities that give users the pos-sibility to design applications starting from a view of the present Grid status (i.e., available nodes and resources), and composing the different stages constituting them inside a structured environment The high-level features offered by VEGA are intended to provide the user with easy access to Grid facilities with a high level of abstraction, in order to leave her free to concentrate on the application design process To fulfill this aim VEGA builds a visual envi-ronment based on the component framework concept, by using and enhancing basic services offered by the Knowledge Grid and the Globus Toolkit
Key concepts in the VEGA approach to the design of a Grid application are the visual language used to describe in a component-like manner, and through a graphical representation, the jobs constituting an application, and the possibility to group these jobs in workspaces
to form specific interdependent stages A consistency checking module parses the model of the computation both while the design is in progress and prior to execute it, monitoring and driving user actions so as to obtain a correct and consistent graphical representation of the
application Together with the workspace concept, VEGA makes available also the virtual resource abstraction; thanks to these entities it is possible to compose applications working
on data processed/generated in previous phases even if the execution has not been performed
yet VEGA includes an execution service, which gives the user the possibility to execute the
designed application, monitor its status, and visualize results
Trang 753.5 Summary
Parallel and Grid-based Data Mining are key technologies to enhance performance of knowl-edge discovery processes on large amount of data Parallel Data Mining is a mature area that produced algorithms and techniques broadly integrated in data mining systems and suites Today parallel Data Mining systems and algorithms can be integrated as components of Grid-based systems to develop high-performance knowledge discovery applications
This chapter introduced Data Mining techniques on parallel architectures, showing how large-scale Data Mining and knowledge discovery applications can achieve scalability by us-ing systems, tools and performance offered by parallel processus-ing systems Some experiences and results in parallelizing Data Mining algorithms according to different approaches have been also reported
To perform Data Mining on massive data sets, distributed across multiple sites, knowl-edge discovery systems based on Grid infrastructures are emerging The chapter discussed the main benefits coming from the use of Grid models and platforms in developing distributed knowledge discovery systems, analyzing some emerging Grid-based Data Mining systems Parallel and Grid-based Data Mining will play a more and more important role for data analysis and knowledge extraction in several application contexts The Knowledge Grid en-vironment, we shortly described here, is a representative effort to build a Grid-based parallel and distributed knowledge discovery system for a wide set of high–performance distributed applications
References
Agrawal G High-level Interfaces and Abstractions for Grid-based Data Mining Workshop
on Data Mining and Exploration Middleware for Distributed and Grid Computing; 2003 September 18–19; Minneapolis, MI
Agrawal R., Shafer J.C Parallel Mining of Association Rules IEEE Transactions on Knowl-edge and Data Engineering 1996; 8: 962-969
Agrawal R, Srikant R Fast Algorithms for Mining Association Rules Proceedings of the 20th International Conference on Very Large Databases; 1994; Santiago, Chile Berman F From TeraGrid to Knowledge Grid Communications of the ACM 2001; 44(11): 27-28
Berry, M JA, Linoff, G., Data Mining Techniques for Marketing, Sales, and Customer Sup-port New York: Wiley Computer Publishing, 1997
Beynon M, Kurc T, Catalyurek U, Chang C, Sussman A, Saltz J Distributed Processing of Very Large Datasets with DataCutter Parallel Computing 2001 27(11):1457-1478 Bigus, J P., Data Mining with Neural Networks New York: McGraw-Hill, 1996
Bruynooghe M., Parallel Implementation of Fast Clustering Algorithms Proceedings of the International Symposium on High Performance Computing; 1989 March 22-24; Mont-pellier, France Elsevier Science, 1989; 65-78
Cannataro M, Congiusta A, Talia D, Trunfio P A Data Mining Toolset for Distributed High-performance Platforms Proceedings of the International Conference on Data Mining Methods and Databases for Engineering; 2002 September 25-27; Bologna, Italy Wessex Institute Press, 2002; 41-50
Cannataro M., Talia D The Knowledge Grid Communications of the ACM 2003; 46(1):89-93
Trang 8Cannataro M, Talia D, Trunfio P KNOWLEDGE GRID: High Performance Knowledge Dis-covery Services on the Grid Proceedings of the 2nd International Workshop GRID 2001;
2001 November; Denver, CO Springer-Verlag, 2001; LNCS 2242:38-50
Cannataro M., Talia D., Trunfio P Distributed Data Mining on the Grid Future Generation Computer Systems 2002 18(8):1101-1112
Congiusta A, Talia D, Trunfio P VEGA: A Visual Environment for Developing Complex Grid Applications Proceedings of the First International Workshop on Knowledge Grid and Grid Intelligence (KGGI); 2003 October 13; Halifax, Canada
Catlett C The TeraGrid: a Primer, 2002
Curcin V, Ghanem M, Guo Y, Kohler M, Rowe A, Syed J, Wendel P Discovery Net: Towards
a Grid of Knowledge Discovery Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining; 2002 July 23-26; Edmonton, Canada Foster I, Kesselman C, Nick J, Tuecke S (2002) The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration
Foti D, Lipari D, Pizzuti C, Talia D Scalable Parallel Clustering for Data Mining on Mul-ticomputers Proceedings of the 3rd International Workshop on High Performance Data Mining; 2000; Cancun Springer-Verlag, 2000; LNCS 1800:390-398
Freitas, A A., Lavington, S H, Mining Very Large Database with Parallel Processing Boston: Kluwer Academic Publishers, 1998
Giannadakis N., Rowe A., Ghanem M., Guo Y InfoGrid: Providing Information Integration for Knowledge Discovery Information Sciences 2003; 155:199-226
Han E H., Karypis G., Kumar V Scalable Parallel Data Mining for Association Rules IEEE Transactions on Knowledge and Data Engineering 2000; 12(2):337-352
Hinke T., Novonty J Data Mining on NASA’s Information Power Grid Proceedings 9th International Symposium on High Performance Distributed Computing; 2000 August 1-4; Pittsburgh, PA
Johnston W E Computational and Data Grids in Large-Scale Science and Engineering Fu-ture Generation Computer Systems 2002; 18(8):1085-1100
Judd D, McKinley K, Jain AK Large-Scale Parallel Data Clustering Proceedings of the International Conference On Pattern Recognition; 1996; Wien
Kargupta, H., Chan, P (Eds.), Advances in Distributed and Parallel Knowledge Discovery Boston: AAAI/MIT Press, 2000
Kufrin R Generating C4.5 Production Rules in Parallel Proceedings of the 14th National Conference on Artificial Intelligence; AAAI Press, 1997
Li X., Fang Z Parallel Clustering Algorithms Parallel Computing 1989; 11:
275–290
Moore R.W (2001) Knowledge-Based Grids: Two Use Cases GGF-3 Meeting
Neri F, Giordana A A Parallel Genetic Algorithm for Concept Learning Proceedings of the 6th International Conference on Genetic Algorithms; 1995 July 15-19; Pittsburgh, PA Morgan Kaufmann, 1995; 436-443
Olson C.F Parallel Algorithms for Hierarchical Clustering Parallel Computing 1995; 21:1313-1325
Pearson, R A “A Coarse-grained Parallel Induction Heuristic.” In Parallel Processing for Artificial Intelligence 2, H Kitano, V Kumar, C.B Suttner, ed Elsevier Science, 1994 Prodromidis, A L., Chan, P K., Stolfo, S J “Meta-Learning in Distributed Data Mining Systems: Issues and Approaches”, In Advances in Distributed and Parallel Knowledge Discovery, H Kargupta, P Chan, ed AAAI Press, 2000
Shafer J, Agrawal R, Mehta M SPRINT: A Scalable Parallel Classifier for Data Mining Proceedings of the 22nd International Conference Very Large Databases; 1996; Bombay
Trang 9Skillicorn D Strategies for Parallel Data Mining IEEE Concurrency 1999; 7(4):26-35 Skillicorn D., Talia D Mining Large Data Sets on Grids: Issues and Prospects Computing and Informatics 2002; 21:347-362
Witten, I H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations San Francisco: Morgan Kaufmann, 2000
Zaki M.J Parallel and Distributed Association Mining: A Survey IEEE Concurrency 1999; 7(4):14-25
Trang 10Collaborative Data Mining
Steve Moyle
Oxford University Computing Laboratory
Summary Collaborative Data Mining is a setting where the Data Mining effort is distributed
to multiple collaborating agents – human or software The objective of the collaborative Data Mining effort is to produce solutions to the tackled Data Mining problem which are consid-ered better by some metric, with respect to those solutions that would have been achieved by individual, non-collaborating agents The solutions require evaluation, comparison, and ap-proaches for combination Collaboration requires communication, and implies some form of community The human form of collaboration is a social task Organizing communities in an effective manner is non-trivial and often requires well defined roles and processes Data Min-ing, too, benefits from a standard process This chapter explores the standard Data Mining process CRISP-DM utilized in a collaborative setting
Key words: Collaborative Data Mining, CRISP-DM, ROC
54.1 Introduction
Data Mining is about solving problems using data (Witten and Frank, 2000), and such it is normally a creative activity leveraging human intelligence This is similar to the spirit and practices of scientific discovery (Bacon, 1994, Popper, 1977, Kuhn, 1970) which utilize many techniques including induction, abduction, hunches, and clever guessing to propose hypothe-ses that aid in understanding the problem and finally lead to a solution Collaboration is the act
of working together with one or more people in order to achieve something (Soukhanov, 2001) Collaboration in intelligence-intensive activities may lead to increased results However, col-laboration brings its own difficulties including communication, coordination, as well as
cul-tural and social difficulties Some of these difficulties can be analyzed by the e-Collaboration Space model (McKenzie and van Winkelen, 2001).
Data Mining projects benefit from a rigorous process and methodology (Adriaans and
Zantinge, 1996, Fayyad et al., 1996, Chapman et al., 2000) For collaborative Data Mining,
such processes need to be embedded in a broader set of processes that support the collaboration
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_54, © Springer Science+Business Media, LLC 2010