Data Mining and Knowledge Discovery Handbook, 2 Edition part 63 doc

Key words: Interestingness Measures, Association Rules Introduction According to Fayyad et al., 1996 “Knowledge Discovery in Databases KDD is the non-trivial process of identifying valid

Trang 1

600 Noa Ruschin Rimini and Oded Maimon

On the next phase of the research we plan to further develop the proposed scheme which

is based on fractal representation, to account for online changes in monitored processes We plan to suggest a novel type of online interactive SPC chart that enables a dynamic inspection

of non-linear state dependant processes

The presented algorithmic framework is applicable for many practical domains, for exam-ple visual analysis of the affect of operation sequence on product quality (See Ruschin-Rimini

et al., 2009), visual analysis of customers action history, visual analysis of products defect

codes history, and more

The developed application was utilized by the General Motors research labs located in Bangalore, India, for visual analysis of vehicle failure history

References

Arbel, R and Rokach, L., Classiﬁer evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier

Barnsley M., Fractals Everywhere, Academic Press, Boston, 1988

Barnsley, M., Hurd L P., Fractal Image Compression, A K Peters, Boston, 1993

Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp 3592-3612, 2007

Da Cunha C., Agard B., and Kusiak A., Data mining for improvement of product quality,

International Journal of Production Research, 44(18-19), pp 4027-4041, 2006 Falconer K., Techniques in Fractal geometry, John Wiley & Sons, 1997

Jeffrey H J., Chaos game representation of genetic sequences, Nucleic Acids Res., vol 18,

pp 2163 – 2170, 1990

Keim D A., Information Visualization and Visual Data mining, IEEE Transactions of Visu-alization and Computer Graphics, Vol 7, No 1, pp 100-107, 2002

Maimon O., and Rokach, L Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D Braha (ed.), Kluwer Academic Publishers, pp 311–336, 2001 Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behav-ioral classiﬁcation of the host, Computational Statistics and Data Analysis, 52(9):4544–

4566, 2008

Quinlan, J R., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993

Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-sition, Int J Intelligent Systems Technologies and Applications, 4(1):57-78, 2008 Rokach L., Genetic algorithm-based feature set partitioning for classiﬁcation prob-lems,Pattern Recognition, 41(5):1676–1700, 2008

Rokach, L., Decomposition methodology for classiﬁcation tasks: a meta decomposer frame-work, Pattern Analysis and Applications, 9(2006):257–271

Rokach, L and Maimon, O., Theory and applications of attribute decomposition, IEEE In-ternational Conference on Data Mining, IEEE Computer Society Press, pp 473–480, 2001

Rokach L and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel-ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158

Rokach L., and Maimon O., Data mining for improving the quality of manufacturing: A

feature set decomposition approach Journal of Intelligent Manufacturing, 17(23.3), pp.

285-299, 2006

Trang 2

29 Visual Analysis of Sequences Using Fractal Geometry 601 Rokach, L and Maimon, O and Arbel, R., Selective voting-getting more for less in sensor fusion, International Journal of Pattern Recognition and Artiﬁcial Intelligence 20 (3) (2006), pp 329–350

Rokach, L and Maimon, O and Averbuch, M., Information Retrieval System for Medical Narrative Reports, Lecture Notes in Artiﬁcial intelligence 3055, page 217-228 Springer-Verlag, 2004

Rokach L., Maimon O and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-proach, Proceedings of the 14th International Symposium On Methodologies For Intel-ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,

2003, pp 24–31

Rokach L., Romano R and Maimon O., Mining manufacturing databases to discover the

effect of operation sequence on the product quality, Journal of Intelligent Manufacturing,

2008

Ruschin-Rimini N., Maimon O and Romano R., Visual Analysis of Quality-related Manu-facturing Data Using Fractal Geometry, working paper submitted for publication, 2009

Weiss C H., Visual Analysis of Categorical Time Series, Statistical Methodology 5, pp

56-71, 2008

Trang 4

Interestingness Measures - On Determining What Is Interesting

Sigal Sahar

Department of Computer Science,

Tel-Aviv University, Israel

gales@post.tau.ac.il

Summary As the size of databases increases, the sheer number of mined from them can easily overwhelm users of the KDD process Users run the KDD process because they are

overloaded by data To be successful, the KDD process needs to extract interesting patterns

from large masses of data In this chapter we examine methods of tackling this challenge: how

to identify interesting patterns.

Key words: Interestingness Measures, Association Rules

Introduction

According to (Fayyad et al., 1996) “Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” Mining algorithms primarily focus on discovering patterns in data, for exam-ple, the Apriori algorithm (Agrawal and Shafer, 1996) outputs the exhaustive list of association rules that have at least the predeﬁned support and conﬁdence thresholds Interestingness dif-ferentiates between the “valid, novel, potentially useful and ultimately understandable” mined association rules and those that are not—differentiating the interesting patterns from those that are not interesting Thus, determining what is interesting, or interestingness, is a critical part

of the KDD process In this chapter we review the main approaches to determining what is interesting

Figure 30.1 summarizes the three main types of interestingness measures, or approaches to determining what is interesting Subjective interestingness explicitly relies on users’ speciﬁc needs and prior knowledge Since what is interesting to any user is ultimately subjective, these subjective interestingness measures will have to be used to reach any complete solution of determining what is interesting (Silberschatz and Tuzhilin, 1996) differentiate between sub-jective and obsub-jective interestingness Obsub-jective interestingness refers to measures of interest

“where interestingness of a pattern is measured in terms of its structure and the underlying data used in the discovery process” (Silberschatz and Tuzhilin, 1996) but requires user inter-vention to select which of these measures to use and to initialize it Impartial interestingness, introduced in (Sahar, 2001), refers to measures of interest that can be applied automatically

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_30, © Springer Science+Business Media, LLC 2010

Trang 5

604 Sigal Sahar

Pruning &

constraints

Summarization

Ranking patterns

Interest via

not interesting

Rule-by-rule

classification

Expert/

grammar

Fig 30.1 Types of Interestingness Approaches

to the output of any association rule mining algorithm to reduce the number of not-interesting rules independently of the domain, task and users

30.1 Deﬁnitions and Notations

LetΛ be a set of attributes over the boolean domain Λ is the superset of all attributes we

discuss in this chapter An itemset I is a set of attributes: I ⊆Λ A transaction is a subset of attributes ofΛ that have the boolean valueTRUE We will refer to the set of transactions over

Λ as a database If exactly s% of the transactions in the database contain an itemset I then we say that I has support s, and express the support of I as P(I) Given a support threshold s we will call itemsets that have at least support s large or frequent.

Let A and B be two sets of attributes such that A,B ⊆ Λ and A ∩ B = /0 Let D be a set of

transactions overΛ Following the deﬁnition in (Agrawal and Shafer, 1996), an association

rule A→B is deﬁned to have support s% and conﬁdence c% in D if s% of the transactions in

D contain A ∪ B and c% of the transactions that contain A also contain B For convenience,

in an association rule A→B we will refer to A as the assumption and B as the consequent of the rule We will express the support of A→B as P(A ∪ B) We will express the conﬁdence of

A →B as P(B|A) and denote it with conﬁdence(A→B) (Agrawal and Shafer, 1996) presents

an elegantly simple algorithm to mine the exhaustive list of association rules that have at least predeﬁned support and conﬁdence thresholds from a boolean database

30.2 Subjective Interestingness

What is interesting to users is ultimately subjective; what is interesting to one user may be known or irrelevant, and therefore not interesting, to another user To determine what is sub-jectively interesting, users’ domain knowledge—or at least the portion of it that pertains to the data at hand—needs to be incorporated into the solution In this section we review the three main approaches to this problem

Trang 6

30 Interestingness Measures 605

30.2.1 The Expert-Driven Grammatical Approach

In the first and most popular approach, the domain knowledge required to subjectively de-termine which rules are interesting is explicitly described through a predefined grammar In this approach a domain expert is expected to express, using the predefined grammar what is, or what is not, interesting This approach was introduced by (Klemettinen et al., 1994), who were the first to apply subjective interestingness, and many other applications followed (Klemetti-nen et al., 1994) define pattern templates that describe the structure of interesting association rules through inclusive templates and the structure of not-interesting rules using restrictive templates (Liu et al., 1997) present a formal grammar that allows the expression of imprecise

or vague domain knowledge, the General Impressions (Srikant et al., 1997) introduce into the mining process user defined constraints, including taxonomical constraints, in the form of boolean expressions, and (Ng et al., 1998) introduce user constraints as part of an architecture that supports exploratory association rule mining (Padmanabhan and Tuzhilin, 2000) use a set of predefined user beliefs in the mining process to output a minimal set of unexpected as-sociation rules with respect to that set of beliefs (Adomavicius and Tuzhilin, 1997) define an action hierarchy to determine which which association rules are actionable; actionability is an aspect of being subjectively interesting (Adomavicius and Tuzhilin, 2001, Tuzhilin and Ado-mavicius, 2002) iteratively apply expert-driven validation operators to incorporate subjective interestingness in the personalization and bioinformatics domains

In some cases the required domain knowledge can be obtained from a pre-existing knowl-edge base, thus eliminating the need to engage directly with a domain expert to acquire it For example, in (Basu et al., 2001) the WordNet lexical knowledge-base is used to measure the novelty—an indicator of interest—of an association rule by assessing the dissimilarity be-tween the assumption and the consequent of the rule An example of a domain where such

a knowledge base exists naturally is when detecting rule changes over time, as in (Liu et al., 2001a) In many domains, these knowledge-bases are not readily available In those cases the success of this approach is conditioned on the availability of a domain expert willing and able

to complete the task of deﬁning all the required domain knowledge This is no easy task: the domain expert may unintentionally neglect to deﬁne some of the required domain knowledge, some of it may not be applicable across all cases, and could change over time Acquiring such

a domain expert for the duration of the task is often costly and sometimes unfeasible But given the domain knowledge required, this approach can output the small set of subjectively interesting rules

30.2.2 The Rule-By-Rule Classiﬁcation Approach

In the second approach, taken in (Subramonian, 1998), the required domain knowledge base is constructed by classifying rules from prior mining sessions This approach does not depend on the availability of domain experts to deﬁne the domain knowledge, but does require very in-tensive user interaction of a mundane nature Although the knowledge base can be constructed incrementally, this, as the author says, can be a tedious process

30.2.3 Interestingness Via What Is Not Interesting Approach

The third approach, introduced by (Sahar, 1999), capitalizes on an inherent aspect in the inter-estingness task: the majority of the mined association rules are not interesting In this approach

a user is iteratively presented with simple rules, with only one attribute in their assumption and

Trang 7

606 Sigal Sahar

one attribute in the consequent, for classiﬁcation These rules are selected so that a single user classiﬁcation of a rule can imply that a large number of the mined association rules are also not-interesting The advantages of this approach are that it is simple so that a naive user can use it without depending on a domain expert to provide input, that it very quickly, with only

a few questions, can eliminate a signiﬁcant portion of the not-interesting rules, and that it cir-cumvents the need to deﬁne why a rule is interesting However, this approach is used only to reduce the size of the interestingness problem by substantially decreasing the number of po-tentially interesting association rules, rather than pinpointing the exact set of interesting rules This approach has been integrated into the mining process in (Sahar, 2002b)

30.3 Objective Interestingness

The domain knowledge needed in order to apply subjective interestingness criteria is difﬁcult

to obtain Although subjective interestingness is needed to reach the short list of interesting patterns, much can be done without explicitly using domain knowledge The application of objective interestingness measures depends only the structure of the data and the patterns extracted from it; some user intervention will still be required to select the measure to be used, etc In this section we review the three main types of objective interestingness measures

30.3.1 Ranking Patterns

To rank association rules according to their interestingness, a mapping, f , is introduced from

the set of mined rules,Ω, to the domain of real numbers:

The number an association rule is mapped to is an indication of how interesting this rule is; the larger the number a rule is mapped to, the more interesting the rule is assumed to be Thus, the mapping imposes an order, or ranking, of interest on a set of association rules

Ranking rules according to their interest has been suggested in the literature as early

as (Piatetsky-Shapiro, 1991) (Piatetsky-Shapiro, 1991) introduced the ﬁrst three principles

of interestingness evaluation criteria, as well as a simple mapping that could satisfy them: P-S(A→B) = P(A ∪ B) − P(B) · P(A) Since then many different mappings, or rankings, have been proposed as measures of interest Many deﬁnitions of such mappings, as well as their em-pirical and theoretical evaluations, can be found in (Kl¨osgen, 1996, Bayardo Jr and Agrawal,

1999, Sahar and Mansour, 1999, Hilderman and Hamilton, 2000, Hilderman and Hamilton,

2001, Tan et al., 2002) The work on the principles introduced by (Piatetsky-Shapiro, 1991) has been expanded by (Major and Mangano, 1995, Kamber and Shinghal, 1996) (Tan et al., 2002) extends the studies of the properties and principles of the ranking criteria (Hilderman and Hamilton, 2001) provide a very thorough review and study of these criteria, and introduce

an interestingness theory for them

30.3.2 Pruning and Application of Constraints

The mapping in Equation 30.1 can also be used as a pruning technique: prune as not-interesting all the association rules that are mapped to an interest score lower than a user-deﬁned thresh-old Note that in this section we only refer to pruning and application of constraints performed

Trang 8

30 Interestingness Measures 607 using objective interestingness measures, and not subjective ones, such as removing rules if they contain, or do not contain, certain attributes

Additional methods can be used to prune association rules without requiring the use of the an interest mapping Statistical tests such as theχ2test are used for pruning in (Brin et al., 1997,Liu et al., 1999,Liu et al., 2001b) These tests have parameters that need to be initialized

A collection of pruning methods is described in (Shah et al., 1999)

Another type of pruning is the constraint based approach of (Bayardo Jr et al., 1999) To output a more concise list of rules as the output of the mining process, the algorithm of (Ba-yardo Jr et al., 1999) only mines rules that comply with the usual constraints of minimum support and confidence thresholds as well as with two new constraints The first constraint is a user-specified consequent (subjective interestingness) The second, unprecedented, constraint

is of a user-specified minimum confidence improvement threshold Only rules whose confi-dence is at least the minimum conficonfi-dence improvement threshold greater than the conficonfi-dence

of any of their simpliﬁcations are outputted; a simpliﬁcation of a rule is formed by removing one or more attributes from its assumption

30.3.3 Summarization of Patterns

Several distinct methods fall under the summarization approach (Aggarwal and Yu, 1998) introduce a redundancy measure that summarizes all the rules at the predeﬁned support and conﬁdence levels very compactly by using more ”complex” rules The preference to complex

rules is formally deﬁned as follows: a rule C→D is redundant with respect to A→B if (1)

A ∪ B = C ∪ D and A ⊂ C, or (2) C ∪ D ⊂ A ∪ B and A ⊆ C A different type of summary that

favors less ”complex” rules was introduced by (Liu et al., 1999) (Liu et al., 1999) provide a summary of association rules with a single attributed consequent using a subset of ”direction-setting” rules, rules that represent the direction a group of non-direction-setting rules follows The direction is calculated using theχ2test, which is also used to prune the mined rules prior

to the discovery of direction-setting rules (Liu et al., 2000) present a summary that simpliﬁes the discovered rules by providing an overall picture of the relationships in the data and their exceptions (Zaki, 2000) introduces an approach to mining only the non-redundant association rules from which all the other rules can be inferred (Zaki, 2000) also favors ”less-complex”

rules, deﬁning a rule C→D to be redundant if there exists another rule A→B such that A ⊆ C and B ⊆ D and both rules have the same conﬁdence.

(Adomavicius and Tuzhilin, 2001) introduce summarization through similarity based rule grouping The similarity measure is speciﬁed via an attribute hierarchy, organized by a domain expert who also speciﬁes a level of rule aggregation in the hierarchy, called a cut The asso-ciation rules are then mapped to aggregated rules by mapping to the cut, and the aggregated rules form the summary of all the mined rules

(Toivonen et al., 1995) suggest clustering rules “that make statements about the same database rows [ ]” using a simple distance measure, and introduce an algorithm to compute rule covers as short descriptions of large sets of rules For this approach to work without los-ing any information, (Toivonen et al., 1995) make a monotonicity assumption, restrictlos-ing the databases on which the algorithm can be used (Sahar, 2002a) introduce a general clustering framework for association rules to facilitate the exploration of masses of mined rules by au-tomatically organizing them into groups according to similarity To simplify interpretation of the resulting clusters, (Sahar, 2002a) also introduces a data-inferred, concise representation of the clusters, the ancestor coverage

Trang 9

608 Sigal Sahar

30.4 Impartial Interestingness

To determine what is interesting, users need to ﬁrst determine which interestingness measures

to use for the task Determining interestingness according to different measures can result in different sets of rules outputted as interesting This dependence of the output of the interesting-ness analysis on the interestinginteresting-ness measure used is clear when domain knowledge is applied explicitly, in the case of the subjective interestingness measures (Section 30.2) When domain knowledge is applied implicitly, this dependence may not be as clear, but it still exists As (Sa-har, 2001) shows, objective interestingness measures depend implicitly on domain knowledge

This dependence is manifested during the selection of the objective interestingness measure

to be used, and, when applicable, during its initialization (for pruning and constraints) and the interpretation of the results (for summarization)

(Sahar, 2001) introduces a new type of interestingness measure, as part of an interest-ingness framework, that can be applied automatically to eliminate a portion of the rules that

is not interesting, as in Figure 30.2 This type of interestingness is called impartial interest-ingness because it is domain-independent, task-independent, and user-independent, making

it impartial to all considerations affecting other interestingness measures Since the impartial interestingness measures do not require any user intervention, they can be applied sequentially and automatically, directly following the Data Mining process, as depicted in Figure 30.2 The impartial interestingness measure preprocess the mined rules to eliminate those rules that are not interesting regardless of the domain, task and user, and so they form the Interestingness PreProcessing Step This step is followed by Interestingness Processing, which includes the application of objective (when needed) and subjective interestingness criteria

rules outputted

by a data-mining

algorithm

Interestingness Processing:

(objective and subjective criteria)

pruning, summarization, etc.

Interestingness PreProcessed rules

Interesting rules

Interestingness PreProcessing (impartial criteria)

includes several techniques

Fig 30.2 Framework for Determining Interestingness

To be able to define impartial measures, (Sahar, 2001) assume that the goal of the in-terestingness analysis on a set of mined rules is to find a subset of interesting rules, rather than to infer from the set of mined rules rules that have not been mined that could po-tentially be interesting An example of an impartial measure is (Overfitting, (Sahar, 2001))

the deletion of all rules r = A ∪C→B if there exists another mined rule =r = A→B such that

conﬁdence(=r) ≥ conﬁdence(r)

Trang 10

30 Interestingness Measures 609 30.5 Concluding Remarks

Characterizing what is interesting is a difﬁcult problem, primarily because what is interest-ing is ultimately subjective Numerous attempts have been made to formulate these qualities, ranging from evidence and simplicity to novelty and actionability, with no formal deﬁnition for “interestingness” emerging so far In this chapter we reviewed the three main approaches

to tackling the challenge of discovering which rules are interesting under certain assumptions Some of the interestingness measures reviewed have been incorporated into the mining process as opposed to being applied after the mining process (Spiliopoulou and Roddick, 2000) discuss the advantages of processing the set of rules after the mining process, and introduce the concept of higher order mining, showing that rules with higher order semantics can be extracted by processing the mined results (Hipp and G¨unter, 2002) argue that pushing constraints into the mining process ”[ ] is based on an understanding of KDD that is no longer up-to-date” as KDD is an iterative discovery process rather than ”pure hypothesis investigation” There is no consensus on whether it is advisable

to push constraints into the mining process An optimal solution is likely to be produced through a balanced combination of these approaches; some interestingness measures (such

as the impartial ones) can be pushed into the mining process without overﬁtting its output

to match the subjective interests of only a small audience, permitting further interestingness analysis that will tailor it to each user’s subjective needs

Data Mining algorithms output patterns Interestingness discovers the potentially inter-esting patterns To be successful, the KDD process needs to extract the interinter-esting patterns from large masses of data That makes interestingness a very important capability in the ex-tremely data-rich environment in which we live It is likely that our environment will continue

to inundate us with data, making determining interestingness critical for success

References

Adomavicius, G and Tuzhilin, A (1997) Discovery of actionable patterns in databases:

The action hierarchy approach In Proceedings of the Third International Conference

on Knowledge Discovery and Data Mining, pages 111–114, Newport Beach, CA, USA.

AAAI Press

Adomavicius, G and Tuzhilin, A (2001) Expert-driven validation of rule-based user models

in personalization applications Data Mining and Knowledge Discovery, 5(1/2):33–58.

Aggarwal, C C and Yu, P S (1998) A new approach to online generation of association rules Technical Report Research Report RC 20899, IBM T J Watson Research Center

Agrawal, R., Heikki, M., Srikant, R., Toivonen, H., and Verkamo, A I (1996) Advances

in Knowledge Discovery and Data Mining, chapter 12: Fast Discovery of Association Rules, pages 307–328 AAAI Press/The MIT Press, Menlo Park, California.

Basu, S., Mooney, R J., Pasupuleti, K V., and Ghosh, J (2001) Evaluating the novelty of

text-mined rules using lexical knowledge In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–238,

San Francisco, CA, USA

Bayardo Jr., R J and Agrawal, R (1999) Mining the most interesting rules In Proceedings

of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 145–154, San Diego, CA.

Định dạng
Số trang	10
Dung lượng	430,06 KB