Data Mining and Knowledge Discovery Handbook, 2 Edition part 105 pot

In the next section we present a Grid-based environment, named Knowledge Grid, whose aim is to support general PDKD applications, providing an interface both to manage and access large r

Trang 1

whose main goal is to design, develop and implement an infrastructure to effectively support scientiﬁc knowledge discovery processes from high-throughput informatics In this context,

a series of testbeds and demonstrations are being carried out, for using the technology in the areas of life sciences, environmental modeling and geo-hazard prediction

The building blocks in Discovery Net are the so-called Knowledge Discovery Services (KDS), distinguished in Computation Services and Data Services The former typically

com-prise algorithms, e.g data preparation and Data Mining, while the latter deﬁne relational tables (as queries) and other data sources Both kinds of services are described (and registered) by

means of Adapters, providing information such as input and output types, parameters, loca-tion and/or platform/operating system constraints, factories (objects allowing to retrieve

ref-erences to services and to download them), keywords and a human-readable description KDS are used to compose moderately complex data-pipelined processes The composition may be carried out by means of a GUI which provides access to a library of services The XML-based

language used to describe processes is called Discovery Process Markup Language (DPML).

Each composed process can be deployed and published as a new process Typically, process descriptions are not bound to speciﬁc servers since the actual resources are later resolved by lookup servers (see below)

Discovery Net is based on an open architecture using common protocols and

infrastruc-tures such as Globus Toolkit Servers are distinguished into (i) Knowledge Servers, allowing

storage and retrieval of knowledge (meant as raw data and knowledge models) and processes;

(ii) Resource Discovery Servers, providing a knowledge base of service deﬁnitions and per-forming resource resolution; (iii) Discovery Meta-Information Servers, used to store informa-tion about the Knowledge Schema, i.e the sets of features of known databases, their types, and

how they can be composed with each other

Finally, we outline here some interesting Data Mining testbeds developed at the National Center for Data Mining (NCDM) at the University of Illinois at Chicago (UIC) (www.ncdm.uic edu/testbeds.htm

• The Terra Wide Data Mining Testbed (TWDM) TWDM is an infrastructure for the remote

analysis, distributed mining, and real-time exploration of scientiﬁc, engineering, business, and other complex data It consists of ﬁve geographically distributed nodes linked by

op-tical networks through StarLight (an advanced opop-tical infrastructure) in Chicago These

sites include StarLight itself, the Laboratory for Advanced Computing at UIC, SARA

in Amsterdam, and the Dalhousie University in Halifax In 2003 new sites will be con-nected, including the Imperial College in London A central idea in TWDM is to keep generated predictive models up-to-date with respect to newly available data, in order to achieve better predictions (as this is an important aspect in many “critical” domains, such

as infectious disease tracking) TWDM is based on DataSpace, another NCDM project for supporting real-time streaming data; in DataSpace the Data Tranformation Markup Language (DTML) is used to describe how to update “proﬁles”, i.e aggregate data which

are inputs of predictive models, on the basis of new “events”, i.e new bits of information

• The Terabyte Challenge Testbed The Terabyte Challenge Testbed is an open, distributed

testbed for DataSpace tools, services, and protocols It involves a number of organizations, including the University of Illinois at Chicago, the University of Pennsylvania, the Uni-versity of California at Davis and the Imperial College The testbed consists of ten sites distributed over three continents connected by high–performance links Each site pro-vides a number of local clusters of workstations which are connected to form wide area

meta-clusters maintained by the National Scalable Cluster Project So far, meta-clusters

have been used by applications in high energy physics, computational chemistry, nonlin-ear simulation, bioinformatics, medical imaging, network trafﬁc analysis, digital libraries

Center for Data Mini

• The Terra Wide Data Mining Testbed

):

Trang 2

of video data, etc Currently, the Terabyte Challenge Testbed consists of approximately

100 nodes and 2 terabytes of disk storage

• The Global Discovery Network (GDN) The GDN is a collaboration between the

Labora-tory for Advanced Computing of the National Center for Data Mining and the Discovery Net project (see above) It will link the Discovery Net to the Terra Wide Data Mining Testbed to create a combined global testbed with a critical mass of data

The GridMiner project at the University of Vienna aims to cover the main aspects of knowl-edge discovery on Grids GridMiner is a model based on the OGSA framework (Foster et al.,

2002), and embraces an open architecture in which a set of services are deﬁned for handling data distribution and heterogeneity, supporting different types of analysis strategies, as well as tools and algorithms, and providing for OLAP support Key components in GridMiner are the

Data Access service, the Data Mediation service, and the Data Mining service Data Access

implements the data access to databases and data repositories; Data Mediation provides for

a view of distributed data by logically integrating them into virtual data sources (VDS) and

allowing to send queries to them and combine and deliver back the results The Data Mining layer comprises a set of speciﬁc services useful to prepare and execute a Data Mining appli-cation, as well as present its results The system has not been yet implemented on a Grid; a preliminary fully centralized version of the system is currently available

GATES (Grid-based AdapTive Execution on Streams) is an OGSA based system that

pro-vides support for processing data streams in a Grid environment (Agrawal, 2003) GATES aims to support the distributed analysis of data streams arising from distributed sources (e.g., data from large–scale experiments/simulations), providing automatic resource discovery, and

an interface for enabling self-adaptation to meet real-time constraints

Some of the systems discussed above support speciﬁc domains applications, others sup-port a more general class of problems Moreover, some of such systems are mainly advanced interfaces for integrating, accessing, and elaborating large datasets, whereas others provide more speciﬁc functionalities for the support of typical knowledge discovery processes

In the next section we present a Grid-based environment, named Knowledge Grid, whose aim is to support general PDKD applications, providing an interface both to manage and access large remote data sets, and to execute high-performance data analysis on them

53.4 The Knowledge Grid

The Knowledge Grid (Cannataro and Talia, 2003) is an environment providing knowledge dis-covery services for a wide range of high–performance distributed applications Data sets and Data Mining and analysis tools used in such applications are increasingly becoming available

as stand-alone packages and as remote services on the Internet Examples include gene and DNA databases, network access and intrusion data, drug features and effects data repositories, astronomy data ﬁles, and data about web usage, content, and structure

Knowledge discovery procedures in all these applications typically require the creation and management of complex, dynamic, multi-step workflows At each step, data from various sources can be moved, filtered, and integrated and fed into a Data Mining tool Based on the output results, the analyst chooses which other data sets and mining components should be integrated in the workflow, or how to iterate the process to get a knowledge model Workflows are mapped on a Grid by assigning its nodes to the Grid hosts and using interconnections for implementing communication among the workflow nodes

Trang 3

The Knowledge Grid supports such activities by providing mechanisms and high–level services for searching resources, representing, creating, and managing knowledge discovery processes, and for composing existing data services and data mining services in a structured manner, allowing designers to plan, store, document, verify, share and re-execute their work-ﬂows as well as manage their output results

The Knowledge Grid architecture is composed of a set of services divided in two layers:

the Core K-Grid layer that interfaces the basic and generic Grid middleware services and the High-level K-Grid layer that interfaces the user by offering a set of services for the design

and execution of knowledge discovery applications Both layers make use of repositories that provide information about resource metadata, execution plans, and knowledge obtained as result of knowledge discovery applications

S3

D3

H2

S1

H3

H1

D2

S3

D1

D3

Component selection

Application workflow composition

Application execution

on the Grid

Fig 53.1 Main steps of application composition and execution in the Knowledge Grid

In the Knowledge Grid environment, discovery processes are represented as workflows that a user may compose using both concrete and abstract Grid resources Knowledge discov-ery workflows are defined using a visual interface that shows resources (data, tools, and hosts)

to the user and offers mechanisms for integrating them in a workflow Information about single resources and workflows are stored using an XML-based notation that represents a workflow

(called execution plan in the Knowledge Grid terminology) as a data-ﬂow graph of nodes,

each representing either a Data Mining service or a data transfer service The XML represen-tation allows the workﬂows for discovery processes to be easily validated, shared, translated

in executable scripts, and stored for future executions Figure 53.1 shows the main steps of the composition and execution processes of a knowledge discovery application on the Knowledge Grid

Trang 4

53.4.1 Knowledge Grid Components and Tools

Figure 53.2 shows the general structure of the Knowledge Grid system and its main compo-nents and interaction patterns

KBR

KEPR KMR

DAS

Data Access

Service

TAAS

Tools and Algorithms Access Service

EPMS

Execution Plan Management Service

RPS

Results Presentation Service

KDS

Knowledge Directory

Service

RAEMS

Resource Allocation and Execution Mgmt Service

High level K-Grid layer

Core K-Grid layer

Resource Metadata Execution Plan Metadata Model Metadata

Fig 53.2 The Knowledge Grid general structure and components

The High-level K-Grid layer includes services used to compose, validate, and execute a parallel and distributed knowledge discovery computation Moreover, the layer offers services

to store and analyze the discovered knowledge Main services of the High-level K-Grid layer are:

• The Data Access Service (DAS) allows for the search, selection, transfer, transformation,

and delivery of data to be mined

• The Tools and Algorithms Access Service (TAAS) is responsible for searching, selecting

and downloading Data Mining tools and algorithms

• The Execution Plan Management Service (EPMS) An execution plan is represented by

a graph describing interactions and data ﬂows among data sources, extraction tools, Data Mining tools, and visualization tools The Execution Plan Management Service allows for deﬁning the structure of an application by building the corresponding graph and adding

a set of constraints about resources Generated execution plans are stored, through the

RAEMS, in the Knowledge Execution Plan Repository (KEP

R).

• The Results Presentation Service (RPS) offers facilities for presenting and visualizing the

extracted knowledge models (e.g., association rules, clustering models, classiﬁcations) The Core K-Grid layer includes two main services:

• The Knowledge Directory Service (KDS) that manages metadata describing Knowledge

Grid resources Such resources comprise hosts, repositories of data to be mined, tools and

Trang 5

algorithms used to extract, analyze, and manipulate data, distributed knowledge discovery execution plans and knowledge obtained as result of the mining process The metadata in-formation is represented by XML documents stored in a Knowledge Metadata Repository (KMR)

• The Resource Allocation and Execution Management Service (RAEMS) is used to ﬁnd a

suitable mapping between an “abstract” execution plan (formalized in XML) and available resources, with the goal of satisfying the constraints (computing power, storage, memory, database, network performance) imposed by the execution plan After the execution plan activation, this service manages and coordinates the application execution and the storing

of knowledge results in the Knowledge Base Repository (KBR).

An Application Scenario

We discuss here a simple meta-learning process over the Knowledge Grid, to show how the execution of a distributed Data Mining application can beneﬁt from the Knowledge Grid

ser-vices (Cannataro et al., 2002B) Meta-learning aims to generate a number of independent

classiﬁers by applying learning programs to a collection of distributed data sets in parallel The classiﬁers computed by learning programs are then collected and combined to obtain a

global classiﬁer (Prodromidis et al., 2000).

Figure 53.3 shows a distributed meta-learning scenario, in which a global classiﬁer GC is obtained on NodeZ starting from the original data set DS stored on NodeA.

Partitioner

P

Data

Set DS

NodeA

Learner

L1

Training

Set TR1

Step 1

Combiner/Tester

CT

Validation

Set VS Testing

Set TS

Classifier

C1

Global Classifier

GC

Step 2

Step 3

…

Training

Set TRi

Training

Set TRn

Nodei

Learner

Li

Learner

Ln

…

NodeZ

Classifier

Ci

Classifier

Cn

…

Node1

Noden

Fig 53.3 A distributed meta-learning scenario

This process can be described through the following steps:

1 On NodeA, training sets TR1, ,TR n , testing set TS and validation set VS are extracted from DS by the partitioner P Then TR1, ,TR n , TS and VS are respectively moved from NodeA to Node1, ,Node n , and to NodeZ.

2 On each Node i (i=1, ,n) the classiﬁer C i is trained from TR i by the learner L i Then each

C is moved from Node to NodeZ.

Trang 6

3 On NodeZ, the C1, ,C n classiﬁers are combined and tested on TS and validated on VS

by the combiner/tester CT to produce the global classiﬁer GC.

To design such an application, a Knowledge Grid user interacts with the EPMS service, which provides a visual interface - see below - to compose a workﬂow (execution plan) describing at

a high level the needed activities involved in the overall Data Mining computation

Through the execution plan, computing, software and data resources are speciﬁed along

with a set of requirements on them In our example the user requires a set of n nodes providing the Learner software and a node providing the Combiner/Tester software, all of them

satisfy-ing given platform constraints and performance requirements In addition, the execution plan includes information about how to coordinate the execution of all the steps, as outlined above The execution plan is then processed by the RAEMS, which takes care of its allocation In particular, it ﬁrst ﬁnds appropriate resources matching user requirements (i.e., a set of concrete

hosts Node1, ,Node n offering the software L, and a host Node Z providing the CT software),

using the KDS services Next, it manages the execution of the overall application, enforcing dependencies among data extraction, transfer, and mining steps, as speciﬁed in the execution plan The operations of data extraction and transfer are performed at a lower level by invoking the DAS services We observe here that, where needed, the RAEMS may perform software staging by means of the TAAS service

Finally, the RAEMS manages results retrieving (i.e., transferring of the global classiﬁer

GC to the user host), and visualizes them using the RPS facilities.

Implementation

A software environment that implements the main components of the Knowledge Grid, com-prising services and functionalities ranging from information and discovery services to visual

design and execution facilities is VEGA - Visual Environment for Grid Applications - (Can-nataro et al., 2002A, Can(Can-nataro et al., 2002A).

The main goal of VEGA is to offer a set of visual functionalities that give users the pos-sibility to design applications starting from a view of the present Grid status (i.e., available nodes and resources), and composing the different stages constituting them inside a structured environment The high-level features offered by VEGA are intended to provide the user with easy access to Grid facilities with a high level of abstraction, in order to leave her free to concentrate on the application design process To fulﬁll this aim VEGA builds a visual envi-ronment based on the component framework concept, by using and enhancing basic services offered by the Knowledge Grid and the Globus Toolkit

Key concepts in the VEGA approach to the design of a Grid application are the visual language used to describe in a component-like manner, and through a graphical representation, the jobs constituting an application, and the possibility to group these jobs in workspaces

to form speciﬁc interdependent stages A consistency checking module parses the model of the computation both while the design is in progress and prior to execute it, monitoring and driving user actions so as to obtain a correct and consistent graphical representation of the

application Together with the workspace concept, VEGA makes available also the virtual resource abstraction; thanks to these entities it is possible to compose applications working

on data processed/generated in previous phases even if the execution has not been performed

yet VEGA includes an execution service, which gives the user the possibility to execute the

designed application, monitor its status, and visualize results

Trang 7

53.5 Summary

Parallel and Grid-based Data Mining are key technologies to enhance performance of knowl-edge discovery processes on large amount of data Parallel Data Mining is a mature area that produced algorithms and techniques broadly integrated in data mining systems and suites Today parallel Data Mining systems and algorithms can be integrated as components of Grid-based systems to develop high-performance knowledge discovery applications

This chapter introduced Data Mining techniques on parallel architectures, showing how large-scale Data Mining and knowledge discovery applications can achieve scalability by us-ing systems, tools and performance offered by parallel processus-ing systems Some experiences and results in parallelizing Data Mining algorithms according to different approaches have been also reported

To perform Data Mining on massive data sets, distributed across multiple sites, knowl-edge discovery systems based on Grid infrastructures are emerging The chapter discussed the main beneﬁts coming from the use of Grid models and platforms in developing distributed knowledge discovery systems, analyzing some emerging Grid-based Data Mining systems Parallel and Grid-based Data Mining will play a more and more important role for data analysis and knowledge extraction in several application contexts The Knowledge Grid en-vironment, we shortly described here, is a representative effort to build a Grid-based parallel and distributed knowledge discovery system for a wide set of high–performance distributed applications

References

Agrawal G High-level Interfaces and Abstractions for Grid-based Data Mining Workshop

on Data Mining and Exploration Middleware for Distributed and Grid Computing; 2003 September 18–19; Minneapolis, MI

Agrawal R., Shafer J.C Parallel Mining of Association Rules IEEE Transactions on Knowl-edge and Data Engineering 1996; 8: 962-969

Agrawal R, Srikant R Fast Algorithms for Mining Association Rules Proceedings of the 20th International Conference on Very Large Databases; 1994; Santiago, Chile Berman F From TeraGrid to Knowledge Grid Communications of the ACM 2001; 44(11): 27-28

Berry, M JA, Linoff, G., Data Mining Techniques for Marketing, Sales, and Customer Sup-port New York: Wiley Computer Publishing, 1997

Beynon M, Kurc T, Catalyurek U, Chang C, Sussman A, Saltz J Distributed Processing of Very Large Datasets with DataCutter Parallel Computing 2001 27(11):1457-1478 Bigus, J P., Data Mining with Neural Networks New York: McGraw-Hill, 1996

Bruynooghe M., Parallel Implementation of Fast Clustering Algorithms Proceedings of the International Symposium on High Performance Computing; 1989 March 22-24; Mont-pellier, France Elsevier Science, 1989; 65-78

Cannataro M, Congiusta A, Talia D, Trunﬁo P A Data Mining Toolset for Distributed High-performance Platforms Proceedings of the International Conference on Data Mining Methods and Databases for Engineering; 2002 September 25-27; Bologna, Italy Wessex Institute Press, 2002; 41-50

Cannataro M., Talia D The Knowledge Grid Communications of the ACM 2003; 46(1):89-93

Trang 8

Cannataro M, Talia D, Trunﬁo P KNOWLEDGE GRID: High Performance Knowledge Dis-covery Services on the Grid Proceedings of the 2nd International Workshop GRID 2001;

2001 November; Denver, CO Springer-Verlag, 2001; LNCS 2242:38-50

Cannataro M., Talia D., Trunﬁo P Distributed Data Mining on the Grid Future Generation Computer Systems 2002 18(8):1101-1112

Congiusta A, Talia D, Trunﬁo P VEGA: A Visual Environment for Developing Complex Grid Applications Proceedings of the First International Workshop on Knowledge Grid and Grid Intelligence (KGGI); 2003 October 13; Halifax, Canada

Catlett C The TeraGrid: a Primer, 2002

Curcin V, Ghanem M, Guo Y, Kohler M, Rowe A, Syed J, Wendel P Discovery Net: Towards

a Grid of Knowledge Discovery Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining; 2002 July 23-26; Edmonton, Canada Foster I, Kesselman C, Nick J, Tuecke S (2002) The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration

Foti D, Lipari D, Pizzuti C, Talia D Scalable Parallel Clustering for Data Mining on Mul-ticomputers Proceedings of the 3rd International Workshop on High Performance Data Mining; 2000; Cancun Springer-Verlag, 2000; LNCS 1800:390-398

Freitas, A A., Lavington, S H, Mining Very Large Database with Parallel Processing Boston: Kluwer Academic Publishers, 1998

Giannadakis N., Rowe A., Ghanem M., Guo Y InfoGrid: Providing Information Integration for Knowledge Discovery Information Sciences 2003; 155:199-226

Han E H., Karypis G., Kumar V Scalable Parallel Data Mining for Association Rules IEEE Transactions on Knowledge and Data Engineering 2000; 12(2):337-352

Hinke T., Novonty J Data Mining on NASA’s Information Power Grid Proceedings 9th International Symposium on High Performance Distributed Computing; 2000 August 1-4; Pittsburgh, PA

Johnston W E Computational and Data Grids in Large-Scale Science and Engineering Fu-ture Generation Computer Systems 2002; 18(8):1085-1100

Judd D, McKinley K, Jain AK Large-Scale Parallel Data Clustering Proceedings of the International Conference On Pattern Recognition; 1996; Wien

Kargupta, H., Chan, P (Eds.), Advances in Distributed and Parallel Knowledge Discovery Boston: AAAI/MIT Press, 2000

Kufrin R Generating C4.5 Production Rules in Parallel Proceedings of the 14th National Conference on Artiﬁcial Intelligence; AAAI Press, 1997

Li X., Fang Z Parallel Clustering Algorithms Parallel Computing 1989; 11:

275–290

Moore R.W (2001) Knowledge-Based Grids: Two Use Cases GGF-3 Meeting

Neri F, Giordana A A Parallel Genetic Algorithm for Concept Learning Proceedings of the 6th International Conference on Genetic Algorithms; 1995 July 15-19; Pittsburgh, PA Morgan Kaufmann, 1995; 436-443

Olson C.F Parallel Algorithms for Hierarchical Clustering Parallel Computing 1995; 21:1313-1325

Pearson, R A “A Coarse-grained Parallel Induction Heuristic.” In Parallel Processing for Artiﬁcial Intelligence 2, H Kitano, V Kumar, C.B Suttner, ed Elsevier Science, 1994 Prodromidis, A L., Chan, P K., Stolfo, S J “Meta-Learning in Distributed Data Mining Systems: Issues and Approaches”, In Advances in Distributed and Parallel Knowledge Discovery, H Kargupta, P Chan, ed AAAI Press, 2000

Shafer J, Agrawal R, Mehta M SPRINT: A Scalable Parallel Classiﬁer for Data Mining Proceedings of the 22nd International Conference Very Large Databases; 1996; Bombay

Trang 9

Skillicorn D Strategies for Parallel Data Mining IEEE Concurrency 1999; 7(4):26-35 Skillicorn D., Talia D Mining Large Data Sets on Grids: Issues and Prospects Computing and Informatics 2002; 21:347-362

Witten, I H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations San Francisco: Morgan Kaufmann, 2000

Zaki M.J Parallel and Distributed Association Mining: A Survey IEEE Concurrency 1999; 7(4):14-25

Trang 10

Collaborative Data Mining

Steve Moyle

Oxford University Computing Laboratory

Summary Collaborative Data Mining is a setting where the Data Mining effort is distributed

to multiple collaborating agents – human or software The objective of the collaborative Data Mining effort is to produce solutions to the tackled Data Mining problem which are consid-ered better by some metric, with respect to those solutions that would have been achieved by individual, non-collaborating agents The solutions require evaluation, comparison, and ap-proaches for combination Collaboration requires communication, and implies some form of community The human form of collaboration is a social task Organizing communities in an effective manner is non-trivial and often requires well deﬁned roles and processes Data Min-ing, too, beneﬁts from a standard process This chapter explores the standard Data Mining process CRISP-DM utilized in a collaborative setting

Key words: Collaborative Data Mining, CRISP-DM, ROC

54.1 Introduction

Data Mining is about solving problems using data (Witten and Frank, 2000), and such it is normally a creative activity leveraging human intelligence This is similar to the spirit and practices of scientiﬁc discovery (Bacon, 1994, Popper, 1977, Kuhn, 1970) which utilize many techniques including induction, abduction, hunches, and clever guessing to propose hypothe-ses that aid in understanding the problem and ﬁnally lead to a solution Collaboration is the act

of working together with one or more people in order to achieve something (Soukhanov, 2001) Collaboration in intelligence-intensive activities may lead to increased results However, col-laboration brings its own difﬁculties including communication, coordination, as well as

cul-tural and social difﬁculties Some of these difﬁculties can be analyzed by the e-Collaboration Space model (McKenzie and van Winkelen, 2001).

Data Mining projects beneﬁt from a rigorous process and methodology (Adriaans and

Zantinge, 1996, Fayyad et al., 1996, Chapman et al., 2000) For collaborative Data Mining,

such processes need to be embedded in a broader set of processes that support the collaboration

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

Định dạng
Số trang	10
Dung lượng	394,99 KB