Three interpretations are: 1 multiple software agents applying Data Mining algorithms to solve the same problem; 2 humans using modern collaboration techniques to apply Data Mining to a
Trang 1setting Such well defined and strong processes include, for instance, clear model evaluation procedures (Blockeel and Moyle, 2002)
Different perspectives exist on what collaborative Data Mining is (this is discussed further
in section 54.5) Three interpretations are: 1) multiple software agents applying Data Mining algorithms to solve the same problem; 2) humans using modern collaboration techniques to apply Data Mining to a single, defined problem; 3) Data Mining the artifacts of human collab-oration This chapter will focus solely on the second item – that of humans using collaboration
techniques to apply data mining to a single task With sufficient definition of a particular Data
Mining problem, this is similar to a multiple software agent Data Mining framework (the first item), although this is not the aim of the chapter Many of the difficulties encountered in human collaboration will also be encountered in designing a system for software agent collaboration Collaborative Data Mining aims to combine the results generated by isolated experts,
by enabling the collaboration of geographically dispersed laboratories and companies For each Data Mining problem, a virtual team of experts is selected on the basis of adequacy and availability Experts apply their methods to solving the problem – but also communicate with each other to share their growing understanding of the problem It is here that collaboration is key
The process of analyzing data through models has many similarities to experimental re-search Like the process of scientific discovery, Data Mining can benefit from different tech-niques used by multiple researchers who collaborate, compete, and compare results to improve their combined understanding The rest of this chapter is organized as follows The potential difficulties in (remote) collaboration and a framework for analyzing such difficulties are out-lined A standard Data Mining process is reviewed, and studied for the likely contributions that can be achieved collaboratively A collaboration process for Data Mining is presented, with clear guidelines for the practitioner so that they may avoid the potential pitfalls related to col-laborative Data Mining A brief summary of real examples of the application of colcol-laborative Data Mining are presented The chapter concludes with a discussion
54.2 Remote Collaboration
This section considers the motivations behind (remote) collaboration1, and types of collab-oration it enables It then reviews the framework proposed by McKenzie and Van
Winke-len (McKenzie and van WinkeWinke-len, 2001) for working within e-Collaboration Space The term e-Collaboration will be used as shorthand for remote collaboration, but many of the principles
can be applied to local collaboration also
54.2.1 E-Collaboration:Motivations and Forms
The main motivation for collaboration (Moyle et al., 2003) is to harness dispersed
exper-tise and to enable knowledge sharing and learning in a manner that builds intellectual capi-tal (Edvinsson and Malone, 1997) This offers tancapi-talizing potential rewards including boost-ing innovation, flexible resource management, and reduced risk (Amara, 1990, Mowshowitz,
1997, Nohria and Eccles, 1993, Snow et al., 1996), but these rewards are offset by numerous
difficulties mainly due to the increased complexity of a virtual environment
In (McKenzie and van Winkelen, 2001) seven distinct forms of e-collaborating organiza-tions that can be distinguished either by their structure or the intent behind their formation are
1The term “remote” is removed in the sequel
Trang 2identified These are: 1) virtual/smart organizations; 2) a community of interest and practice; 3) a virtual enterprise; 4) virtual teams; 5) a community of creation; 6) collaborative product commerce or customer communities; and 7) virtual sourcing and resource coalitions For col-laborative data mining forms 4, and 5 are most relevant These forms are summarized below
• Virtual Teams are temporary culturally diverse geographically dispersed work groups that
communicate electronically These can be smaller entities within virtual enterprises, or within a transnational organization They can be categorized by changing membership and multiple organizational contexts
• A Community of creation is revolves around a central firm and shares its knowledge for
the purpose of innovation This structure consists of individuals and organizations with ever changing boundaries
Having recognized the collaboration form makes it possible to analyze the difficulties that might be encountered Such an analysis can be performed with respect to the e-collaboration space model described in the next section
54.2.2 E-Collaboration Space
Each type of e-collaboration form can be usefully analyzed with respect to McKenzie and Van
Winkelen’s e-Collaboration Space model (McKenzie and van Winkelen, 2001) This model
casts each form into the space by studying their location on the three dimensions of: number
of boundaries crossed, task, and relationships.
• Boundaries crossed: The more boundaries that are crossed in e-collabo
ration, the more barriers to a successful outcome are present All communication takes place across some boundary (Wilson, 2002) Fewer boundaries between agents lead to a lower risk of misunderstanding In e-collaboration the number of boundaries is automat-ically increased Influential boundaries to successful e-collaboration are: technological, temporal, organizational, and cultural
• Task: The nature of the tasks involved in the collaborative project is influenced by the
complexity of the processes, uncertainty of the available information and outcomes, and interdependence of the various stages of the task The complexity can be broadly classified into linear – step-by-step processes; or non-linear The interdependence of a task relates
to whether it can be decomposed into subtasks which can be worked on independently by different participants
• Relationships: Relationships are key to any successful collaboration When electronic
communication is the only mode of interaction it is harder for relationships to form, be-cause the instinctive reading of signals that establish trust and mutual understanding are less accessible to participants
For the remainder of the chapter only the dimension of task will be highlighted within the
e-collaboration space model As will be described in the next sub-section, task complexity makes collaborative Data Mining risk prone
54.2.3 Collaborative Data Mining in E-Collaboration Space
Different forms of e-collaboration – as measured relative to the dimensions of task, bound-aries, and relationships – can be viewed as locations in a three dimensional e-collaboration
Trang 3space The location of a collaborative Data Mining project depends on the actual setting of such a project The most well defined dimension with respect to the Data Mining process
(refer back to section 60.2.1) is that of task.
The task complexity of Data Mining is high Not only is there a high level of expertise
in-volved in a Data Mining project, but also there is the risk that in reaching the final solution(s), much effort will appear – in hindsight – to have been wasted Data miners have long under-stood the need for a methodology to support the Data Mining process (Adriaans and Zantinge,
1996, Fayyad et al., 1996, Chapman et al., 2000) All these methodologies are explicit that the
Data Mining process is non-linear, and warns that information uncovered in later phases can invalidate assumptions made in earlier phases As a result the previous phases may need to be re-visited To exacerbate the situation, Data Mining is by its very nature a speculative process – there may be no valuable information contained in the data sources at all, or the techniques being used may not have sufficient power to uncover it A typical Data Mining project at the start of the collaboration is summarized with respect to the e-collaboration model in Table 54.1
Table 54.1 The position of a disperse collaborative Data Mining project in E-collaboration space (†potential boundary depending on situation)
Task Boundaries Crossed Relationships
- Complex non-linear - Medium technological - Medium commonality of view interdependencies - temporal† - Medium duration of existing
- Uncertainty - geographical relationship
- large organizational† - Medium duration of
- cultural† collaboration
54.3 The Data Mining Process
Data Mining processes broadly consist of a number of phases These phases, however, are interrelated and are not necessarily executed in a linear manner For example, the results of one phase may uncover more detail relating to an earlier phase and may force more effort
to be expended on a phase previously thought complete The CRISP-DM methodology —
CRoss Industry Standard Process for Data Mining (Chapman et al., 2000), is an attempt to
standardise the process of Data Mining In CRISP-DM, six interrelated phases are used to
describe the Data Mining process: business understanding, data understanding, data prepa-ration, modelling, evaluation, and deployment (Figure 54.1) The main outputs of the business understanding phase are the definition of business and data mining objectives as well as
busi-ness and Data Mining evaluation criteria In this phase an assessment of resource requirements
and estimation of risk is performed In the data understanding phase data collected and
char-acterized Data quality is also assessed
During data preparation, tables, records and attributes are selected and transformed for modelling Modelling is the process of extracting input/output patterns from given data and
deriving models — typically mathematical or logical models In the modelling phase, vari-ous techniques (e.g association rules, decision trees, logistic regression, k-means clustering)
Trang 4Fig 54.1 The CRISP-DM cycle
are selected and applied and their parameters are calibrated – or tuned – to optimal values.
Different models are compared, and possibly combined
In the evaluation phase models are selected and reviewed according to the business
cri-teria The whole Data Mining process is reviewed and a list of possible actions is elaborated
In the last phase, deployment is planned, implemented, and monitored The entire project is
typically documented and summarized in a report
The CRISP-DM handbook (Chapman et al., 2000) describes in detail how each of the
main phases is subdivided into specific tasks, with clearly defined predecessors/successors, and inputs/outputs
54.4 Collaborative Data Mining Guidelines
The CRISP-DM Data Mining process described in the preceding section can be adopted by Data Mining agents collaborating remotely on a particular Data Mining project (SolEuNet,
2002, Flach et al., 2003) Not all of the CRISP-DM methodology can be entirely performed
in a collaboartive setting Business understanding for instance, requires intense close contact
with the business environment for which the Data Mining is being performed The phases
that can most easily be performed in a remote-collaborative fashion are data preparation and modelling The other phases can nevertheless benefit from a collaborative approach Although
many of the specific tasks can be carried out independently, care must be taken by the par-ticipants to ensure that efforts are not wasted Principles to guide the process of collaboration should be established in advance of a collaborative Data Mining project For instance, indi-vidual agents must communicate or share any intermediate results – or improvements in the
current best understanding of the Data Mining problem – so that all agents have the new
knowledge Providing a catalogue of up-to-date knowledge about the problem assists new agents entering the Data Mining project Furthermore, procedures are required for how results from different agents are compared, and ultimately combined, so that the value of efforts is greater than the sum of the individual components
54.4.1 Collaboration Principles
(Moyle et al., 2003) present a framework for collaborative Data Mining, involving both
prin-ciples and technological support Collaborative groupware technology, with specific
Trang 5function-ality to support data mining are described (Vo et al., 2001) Principles for collaborative data mining are outlined as follows (Moyle et al., 2003).
1 Requisite management Sufficient management processes should be established In
par-ticular the definition and objectives of the Data Mining problem should be clear from the start of the project to all participants An infrastructure ensuring information flows within the network of agents should be provided
2 Problem Solving Freedom Agents should use their expertise and tools to execute Data
Mining tasks to solve problem in the manner they find best
3 Start any time All the necessary information about the Data Mining problem should be
captured and made available to participants at all times This includes problem definition, data, evaluation criteria, and any knowledge produced
4 Stop any time Participants should work on their solutions so that a working solution
– however crude – is available whenever a stop signal is issued These solutions will typically be Data Mining models One approach is to try simpler modeling techniques first (Holte, 1993)
5 Online knowledge sharing The knowledge about the Data Mining problem gained by
each participant at each phase should be shared with all participants in a timely manner
6 Security Data and information about the Data Mining problem may contain sensitive
information and must not to be revealed outside the project Access to information must
be controlled
Having established a collaborative Data Mining project with appropriate principles and sup-port, how can the results of the Data Mining efforts be compared and combined so that the results are maximized? This is the question that the next section deals with
54.4.2 Data Mining model evaluation and combination
One of the main outputs from the Data Mining process (Chapman et al., 2000) are the Data
Mining models These may take many forms including decision trees, rules, artificial neural-networks, regression equations (see (Mitchell, 1997) as an introduction to machine learning,
and (Hair et al., 1998) as an introduction to statistics text) Different agents may produce
models in the different forms, which requires methods for both evaluating them and combining them
When multiple agents produce multiple models as the result of data mining effort a process for evaluating their relative merits must be established Such processes are well defined in Data
Mining challenge problems (e.g (Srinivasan et al., 1999,Page and Hatzis, 2001)) For example
a challenge recipe for the production of classificatory models can be found in (Moyle and Srinivasan, 2001) To ensure accurate comparisons, models built by different agents must be evaluated in exactly the same way, on the same data This sounds like an obvious statement, but agents can easily make adjustments to their copy of the data to suit their particular approaches, without making the changes available to the other agents This makes any model evaluation
ad comparison extremely difficult
Furthermore, the evaluation criterion or criteria (there may be several) deemed most ap-propriate may change during the knowledge discovery process For instance, at some point one may wish to redefine the data set on which models are evaluated (e.g because it is found that it contains outliers that make the evaluation procedure inaccurate) and re-evaluate pre-viously built models In (Blockeel and Moyle, 2002) it is discussed how this evaluation and re-evaluation leads to significant extra efforts for the different agents and consequently is a barrier to the knowledge discovery process, unless adequate software support is provided
Trang 6One approach to control model evaluation is to centralize the process Consider an
ab-stracted Data Mining process where agents first tune their modeling algorithm (which outputs the algorithm and its parameter settings, I), before building a final model (which is output as M) The agent then uses the model to predict the labels on a test set (producing predictions, P), from which an overall evaluation of the model (resulting in a score S) is determined The point
at which these outputs are published for all agents to access depend on the architecture of the evaluation system as shown in Figure 54.2 A single evaluation agent provides the evaluation procedures; different agents submit information on their models to this agent, which stores this information and automatically evaluates it according to all relevant criteria If criteria change, the evaluation agent automatically re-evaluates previously submitted models
In such a framework information about produced models can be submitted at several lev-els, as illustrated in Figure 54.2 Agents can run their own models on a test set and send only predictions to the evaluation agent (assuming evaluation is based on predictions only), they can submit descriptions of the models themselves, or even just send a complete description on the model producing algorithm and the used parameters to the evaluation agent which has been augmented with modeling algorithms These respective options offer increased centralization and increasingly flexible evaluation possibilities, but also involve increasingly sophisticated software support (Blockeel and Moyle, 2002)
Communicating Data Mining models to the evaluation agent can be performed using a
standard format For instance in (Flach et al., 2003) models from multiple agents were
sub-mitted in a standard, XML style, format (using the standard Predictive Markup Modeling Language (PMML) (The Data Mining Group, 2003)) Such a procedure has been adopted for
a real-world collaborative Data Mining project (Flach et al., 2003).
Model combination is not always possible However, when restricted to binary-classificatory models it is possible to utilize Receiver Operating Characteristic (Provost and Fawcett, 2001) curves to assist both model comparison, and model combination ROC analysis plots differ-ent binary-classification models on a two dimensional space with respect to the type of errors the models make – false positive errors, and false negative errors2 The actual performance
of a model at run-time depends on the costs of errors at run-time, and the distribution of the classes at run-time The values of these run-time parameters – or operating characteristics – determine the optimal model(s) for use in prediction ROC analysis enables models to be
com-pared, which may result in some models never being optimal under any operating conditions
and can be discarded The remaining models are those that are located on the ROC convex hull (ROCCH)
As well as determining non-optimal models, ROC analysis can be used to combine mod-els One method is to use more two adjacent models on the ROCCH that are located either side
of the operating condition in combination to make run-time predictions Another approach to using ROCCH is to modify a single model into multiple models, that then can be plotted in
ROC space (Flach et al., 2001) resulting in models that fit a broader range of operating con-ditions (Wettschereck et al., 2003) describe a support system that performs model evaluation,
model visualization, and model comparison, which has been applied in a collaborative Data
Mining setting (Flach et al., 2003).
2The axes on an ROC curve are actually the true positive rate versus the false positive rate
Trang 7Fig 54.2 Two different architectures for model evaluation The path finishing in dashed ar-rows depicts agents in charge of building and evaluating their own models before publishing their results centrally The path of solid arrows depicts Data Mining agents submitting their models to a centralized evaluation agent which provides the services of executing submitted models on a test set, evaluating the predictions to produce scores, and then publishing the results The information submitted to the central evaluation agent is: I=algorithm and param-eter settings to produce models; M=models; P=predictions made by the models on a test set; S=scores of the value of the models
54.5 Discussion
References containing the keywords: collaborative Data Mining collaboration partition
natu-rally into the following categories
• Multiple software agents applying Data Mining algorithms to solve the same problem:
(e.g (Ramakrishnan, 2001)) this presupposes that the Data Mining task and its associated
data are well defined a priori.
• Humans using modern collaboration techniques to apply Data Mining to a single, defined problem (e.g (Mladenic et al., 2003)).
• Data Mining the artifacts of human collaboration: (e.g (Biuk-Aghai and Simoff, 2001))
these artifacts are typically the conversations and associated documents collected via some electronic based discussion forum
• The collaboration process itself resulting in increased knowledge: a form of knowledge
growth by collection within a context
• Grid style computing facilities collaborating to provide resources for Data Mining: (e.g (Singh et al., 2003)) these resources are typically providing either federated data or
dis-tributed computing power
• Integrating Data Mining techniques into business process software: (e.g (McDougall,
2001)) for example Enterprise Resource Planning systems, and groupware Note that this,
too, implies a priori knowledge of what the Data Mining problems are to be solved.
Trang 8This chapter focused mainly on the second item – that of humans using collaboration
tech-niques to apply Data Mining to a single task With sufficient definition of a particular Data
Mining problem, this can lead to a multiple software agent Data Mining framework (the first item), although this is not the aim of this chapter
Many Data Mining challenges have been issued, which by their nature always result in
“winners” and “losers” However, in collaborative approaches, much can be learned from the losers as the Data Mining projects proceed Much initial effort is required to establish a Data Mining challenge (e.g problem specification, data collection and preprocessing, specification
of evaluation criteria) – even before the participants register This effort also needs to be ex-pended in a collaborative setting so that the objectives of the Data Mining project are clearly articulated in advance
The application of the collaborative methodology and techniques described here has been
performed with mixed success in the data mining projects (Flach et al., 2003, Stepnkov et al.,
2003, Jorge et al., 2003) More development of collaborative Data Mining processes and
sup-porting tools and communication environments are likely to improve the results of harnessing dispersed Data Mining expertise
54.6 Conclusions
Collaborative Data Mining is more difficult the single team setting Data mining benefits from adhering to established processes One key notion in Data Mining methodologies is that of
understanding (e.g CRISP-DM contains the phases, business understanding and data
under-standing) How are such understandings produced, articulated, maintained, and communicated
to all collaborating agents? What happens when understandings change – how much of the data mining process will need re-work? How does one agent’s understanding differ from an-other, simply due to communication, language and cultural differences?
Practitioners embarking on collaborative Data Mining might wish to heed some of the lessons learned from other collaborative Data Mining projects:
• Analyze the form of collaboration proposed and understand how difficult it is likely to be.
• Establish a methodology that all participants can utilize along with support tools and
tech-nologies
• Ensure that all results – intermediate or otherwise – are recorded, and shared in a timely
manner
• Encourage competition among participants.
• Define metrics for success at all stages.
• Define model evaluation and combination procedures.
References
Adriaans, P., and Zantinge, D., Data Mining Addison-Wesley, New York, 1996.
Amara, R., New directions for innovations Futures 53-22(2): p 142 - 152, 1990.
Bacon, F., Novum Organum, eds P Urbach and J Gibson Open Court Publishing Company,
1994
Biuk-Aghai, R.P and S.J Simoff An integrative framework for knowledge extraction in collaborative virtual environments In The 2001 International ACM SIGGROUP Con-ference on Supporting Group Work Boulder, Colorado, USA, 2001.
Trang 9Blockeel, H and S.A Moyle Collaborative Data Mining needs centralised model evalua-tion In Proceedings of the ICML-2002 Workshop on Data Mining Lessons Learned The
University of New South Wales, Sydney, 2002
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R
CRISP-DM 1.0: Step-by-step data mining guide The CRISP-DM consortium, 2000 Edvinsson, L and Malone, M.S Intellectual Capital: Realizing Your Company’s True Value
by Finding Its Hidden Brainpower HarperBusiness, New York, USA, 1997.
Fayyad, U., et al., eds Advances in Knowledge Discovery and Data Mining MIT Press,
1996
Flach, P.A., et al., Decision support for Data Mining: introduction to ROC analysis and its application In Data Mining and Decision Support: Integration and Collaboration, D.
Mladenic, et al., editors Kluwer Academic Publishers, 2003
Flach, P., Blockeel, H., Gaertner, T., Grobelnik, M., Kavsek, B., Kejkula, M., Krzywania, D., Lavrac, N., Mladenic, D., Moyle, S., Raeymaekers, S.,
Rauch, J., Ribeiro, R., Sclep, G., Struyf, J., Todorovski, L., Torgo, L., Wettsc
-hereck, D., and Wu, S On the road to knowledge: mining 21 years of UK traffic acci-dent reports, In Data Mining and Decision Support: Integration and Collaboration, D.
Mladenic, et al., editors Kluwer Academic Publishers, 2003
Hair, J.F., Anderson, R.E., Tatham, R.L., and Black, W.C Multivariate Data Analysis
Pren-tice Hall, 1998
Holte, R.C., Very Simple Classification Rules Perform Well on Most Commonly Used Datasets Machine Learning, 1993.
53-3: p 63-91
Jorge, J., Alves, M.A., Grobelnik, M., Mladenic, D., and Petrak, J Web site access analysis for a national statistical agency In Data Mining and Decision Support: Integration and
Collaboration, D Mladenic, et al., editors, p 157 – 166 Kluwer Academic Publishers, 2003
Kuhn, T.S., The structure of scientific revolutions 2nd, enlarged ed 1962, University of
Chicago Press, Chicago, 1970
McDougall, P., Companies that dare to share information are cashing in on new opportuni-ties InformationWeek, May 7, 2001.
McKenzie, J and C van Winkelen Exploring E-collaboration Space In the proceedings
of The first annual Knowledge Management Forum Conference Henley Management
College, 2001
Mitchell, T Machine Learning Department of Computer Science, Carnegie Mellon
Univer-sity McGraw-Hill Book Company, Pittsburgh, 1997
Mladenic, D., Lavrac, N., Bohanec, M., and Moyle, S editors Data Mining and Decision Support: Integration and Collaboration Kluwer Academic Publishers, 2003.
Mowshowitz, A., Virtual Organization Communications of ACM, 53-40(9): p 30 - 37 1997 Moyle, S A., Srinivasan A., Classificatory challenge-Data Mining: a recipe Informatica
53-25(3): p 343–347 2001
Moyle, S., J McKenzie, and A Jorge, Collaboration in a Data Mining virtual organization.
In Data Mining and Decision Support: Integration and Collaboration, D Mladenic, et
al., editors Kluwer Academic Publishers, 2003
Nohria, N and R.G Eccles, eds Network and organizations; structure form and action.
Harvard Business School Press, Boston, 1993
Page, C.D and C Hatzis, KDD Cup 2001 University of Wisconsin,
http://www.cs.wisc.edu/˜dpage/kddcup2001/, 2001
Popper, K The Logic of Scientific Discovery Routledge, 1977.
Trang 10Provost, F and T Fawcett Robust Classification for Imprecise Environments Machine
Learning 53-42: p 203-231, 2001
Ramakrishnan., R Mass Collaboration and Data Mining (keynote address) In The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001) San Francisco, California, 2001.
Singh, R., Leigh, J., DeFanti, T.A., and Karayannis F TeraVision: a High Resolution Graph-ics Streaming Device for Amplified Collaboration Environments Journal of Future
Gen-eration Computer Systems (FGCS) 53-19(6): p 957-972, 2003
Snow, C.C., S.A Snell, and S.C Davison Using transnational teams to globalize your com-pany Organizational Dynamics 53-24(4): p 50 - 67, 1996.
SolEuNet The Solomon European Netowrk – Data Mining and Decision Support for Busi-ness CompetitiveBusi-ness: A European Virtual Enterprise.
http://soleunet.ijs.si/, 2002
Soukhanov, A., ed Microsoft Encarta College Dictionary: The First Dictionary for the In-ternet Age St Martin’s Press, 2001.
A Srinivasan, R.D King, and D.W Bristol An assessment of submissions made to the Pre-dictive Toxicology Evaluation Challenge In Proceedings of the Sixteenth International Conference on Artificial Intelligence (IJCAI-99) Morgan Kaufmann, Los Angeles, CA,
1999
Stepnkov, O., J Klma, and P Mikovsk Collaborative Data Mining with RAMSYS and Suma-tra TT: Prediction of resources for a health farm In Data Mining and Decision Support: Integration and Collaboration, D Mladenic, et al., editors p 215 – 227 Kluwer
Aca-demic Publishers, 2003
The Data Mining Group, The Predictive Model Markup Language (PMML).
http://www.dmg.org/, 2003
Vo, A., Richter, G., Moyle, S., Jorge, A Collaboration support for virtual data mining en-terprises In 3rd International Workshop on Learning Software Organizations (LSO’01).
Springer-Verlag, 2001
Wettschereck, D., A Jorge, and S Moyle Visaulisation and Evaluation Support of Knowl-edge Discovery through the Predictive Model Markup Language In 7th International Knowledge-Based Intelligent Information and Engineering Systems (KES 2003),
Ox-ford Springer-Verlag, 2003
Wilson, T.D The nonsense of knowledge management Information Research 53-8(1), 2002 Witten, I.H and E Frank Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Morgan Kaufmann, San Francisco, 2000.