Data Mining and Knowledge Discovery Handbook, 2 Edition part 64 docx

In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 263–266, Portland, OR, USA.. In Proceedings of the Third International Conference on K

Trang 1

610 Sigal Sahar

Bayardo Jr., R J., Agrawal, R., and Gunopulos, D (1999) Constraint-based rule mining

in large, dense databases In Proceedings of the Fifteenth IEEE ICDE International Conference on Data Engineering, pages 188–197, Sydney, Australia.

Brin, S., Motwani, R., and Silverstein, C (1997) Beyond market baskets: Generalizing

asso-ciation rules to correlations In Proceedings of ACM SIGMOD International Conference

on Management of Data, pages 265–276, Tucson, AZ, USA.

Fayyad, U M., Piatetsky-Shapiro, G., and Smyth, P (1996) Advances in Knowledge Dis-covery and Data Mining, chapter 1: From Data Mining to Knowledge DisDis-covery: An Overview, pages 1–34 AAAI Press.

Hilderman, R J and Hamilton, H J (2000) Principles for mining summaries using objective

measures of interestingness In Proceedings of the Twelfth IEEE International Confer-ence on Tools with Artiﬁcial IntelligConfer-ence (ICTAI), pages 72–81, Vancouver, Canada Hilderman, R J and Hamilton, H J (2001) Knowledge Discovery and Measures of Interest.

Kluwer Academic Publishers

Hipp, J and G¨unter, U (2002) Is pushing constraints deeply into the mining algorithms

really what we want? SIGKDD Explorations, 4(1): 50–55.

Kamber, M and Shinghal, R (1996) Evaluating the interestingness of characteristic rules

In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 263–266, Portland, OR, USA.

Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkam, A I (1994)

Find-ing interestFind-ing rules from large sets of discovered association rules In ProceedFind-ings of the Third ACM CIKM International Conference on Information and Knowledge Man-agement, pages 401–407, Orlando, FL, USA ACM Press.

Kl¨osgen, W (1996) Advances in Knowledge Discovery and Data Mining, chapter 10: Ex-plora: a Multipattern and Multistrategy Discovery Assistant, pages 249–271 AAAI

Press

Liu, B., Hsu, W., and Chen, S (1997) Using general impressions to analyze discovered

classiﬁcation rules In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 31–36, Newport Beach, CA, USA AAAI Press.

Liu, B., Hsu, W., and Ma, Y (1999) Pruning and summarizing the discovered associations

In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 125–134, San Diego, CA, USA.

Liu, B., Hsu, W., and Ma, Y (2001a) Discovery the set of fundamental rule changes In

Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 335–340, San Francisco, CA, USA.

Liu, B., Hsu, W., and Ma, Y (2001b) Identifying non-actionable association rules In

Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 329–334, San Francisco, CA, USA.

Liu, B., Hu, M., and Hsu, W (2000) Multi-level organization and summarization of the

discovered rules In Proceedings of the Sixth ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, pages 208–217, Boston, MA, USA.

Major, J A and Mangano, J J (1995) Selecting among rules induced from a hurricane

databases Journal of Intelligent Information Systems, 4:39–52.

Ng, R T., Lakshmanan, L V S., Han, J., and Pang, A (1998) Exploratory mining and

pruning optimizations of constrained association rules In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 13–24.

Padmanabhan, B and Tuzhilin, A (2000) Small is beautiful: Discovering the minimal

set of unexpected patterns In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 54–63, Boston, MA, USA.

Trang 2

30 Interestingness Measures 611 Piatetsky-Shapiro, G (1991) Knowledge Discovery in Databases, chapter 13: Discovery, Analysis, and Presentation of Strong Rules, pages 248–292 AAAI/MIT Press

Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-tive reports Lecture notes in artiﬁcial intelligence, 3055 pp 217-228, Springer-Verlag (2004)

Sahar, S (1999) Interestingness via what is not interesting In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

pages 332–336, San Diego, CA, USA

Sahar, S (2001) Interestingness preprocessing In Proceedings of the IEEE ICDM Interna-tional Conference on Data Mining, pages 489–496, San Jose, CA, USA.

Sahar, S (2002a) Exploring interestingness through clustering: A framework In Proceed-ings of the IEEE ICDM International Conference on Data Mining, pages 677–680,

Mae-bashi City, Japan

Sahar, S (2002b) On incorporating subjective interestingness into the mining process In

Proceedings of the IEEE ICDM International Conference on Data Mining, pages 681–

684, Maebashi City, Japan

Sahar, S and Mansour, Y (1999) An empirical evaluation of objective interestingness

cri-teria In SPIE Conference on Data Mining and Knowledge Discovery, pages 63–74,

Orlando, FL, USA

Shah, D., Lakshmanan, L V S., Ramamritham, K., and Sudarshan, S (1999) Interestingness

and pruning of mined patterns In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Philadelphia, PA,

USA

Silberschatz, A and Tuzhilin, A (1996) What makes patterns interesting in knowledge

discovery systems IEEE Transactions on Knowledge and Data Engineering (TKDE),

8(6):970–974

Spiliopoulou, M and Roddick, J F (2000) Higher order mining: Modeling and mining

the results of knowledge discovery In Proceedings of the Second Conference on Data Mining Methods and Databases, pages 309–320, Cambridge, UK WIT Press.

Srikant, R., Vu, Q., and Agrawal, R (1997) Mining association rules with item constraints

In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 67–73, Newport Beach, CA, USA AAAI Press.

Subramonian, R (1998) Deﬁning diff as a Data Mining primitive In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 334–

338, New York City, NY, USA AAAI Press

Tan, P.-N., Kumar, V., and Srivastava, J (2002) Selecting the right interestingness measure

for association patterns In Proceedings of the Eight ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, pages 32–41, Edmonton, Alberta,

Canada

Toivonen, H., Klemettinen, M., Ronkainen, P., H¨at¨onen, K., and Mannila, H (1995) Pruning

and grouping discovered association rules In Proceedings of the MLnet Familiariza-tion Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases,

pages 47–52, Heraklion, Crete, Greece

Tuzhilin, A and Adomavicius, G (2002) Handling very large numbers of association rules

in the analysis of microarray data In Proceedings of the Eight ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, pages 396–404,

Edmon-ton, Alberta, Canada

Trang 3

612 Sigal Sahar

Zaki, M J (2000) Generating non-redundant association rules In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

pages 34–43, Boston, MA, USA

Trang 4

Quality Assessment Approaches in Data Mining

Maria Halkidi1and Michalis Vazirgiannis2

1 Department of Computer Science and Engineering, University of California at Riverside, USA,

Department of Informatics, Athens University of Economics and Business, Greece mhalkidi@cs.ucr.edu

2 Department of Informatics, Athens University of Economics and Business, Greece mvazirg@aueb.gr

Summary The Data Mining process encompasses many different speciﬁc techniques and

algorithms that can be used to analyze the data and derive the discovered knowledge An important problem regarding the results of the Data Mining process is the development of efﬁcient indicators of assessing the quality of the results of the analysis This, the quality

as-sessment problem, is a cornerstone issue of the whole process because: i) The analyzed data may hide interesting patterns that the Data Mining methods are called to reveal Due to the

size of the data, the requirement for automatically evaluating the validity of the extracted pat-terns is stronger than ever

ii)A number of algorithms and techniques have been proposed which under different assump-tions can lead to different results iii)The number of patterns generated during the Data Mining process is very large but only a few of these patterns are likely to be of any interest to the

do-main expert who is analyzing the data In this chapter we will introduce the do-main concepts and quality criteria in Data Mining Also we will present an overview of approaches that have been proposed in the literature for evaluating the Data Mining results

Key words: cluster validity, quality assessment, unsupervised learning, clustering

Introduction

Data Mining is mainly concerned with methodologies for extracting patterns from large data

repositories There are many Data Mining methods which accomplishing a limited set of tasks produces a particular enumeration of patterns over data sets The main tasks of Data Mining

which have already been discussed in previous sections are: i) Clustering, ii) Classiﬁcation, iii) Association Rule Extraction, iv)Time Series, v) Regression, and vi) Summarization.

Since a Data Mining system could generate under different conditions thousands or mil-lion of patterns, questions arise for the quality of the Data Mining results, such as which of the extracted patterns are interesting and which of them represent knowledge

In general terms, a pattern is interesting if it is easily understood, valid, potentially useful and novel A pattern also is considered as interesting if it validates a hypothesis that a user

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_31, © Springer Science+Business Media, LLC 2010

Trang 5

614 Maria Halkidi and Michalis Vazirgiannis

seeks to conﬁrm An interesting pattern represents knowledge The quality of patterns depends both on the quality of the analysed data and the quality of the Data Mining results Thus several techniques have been developed aiming at evaluating and preparing the data used as input to the Data Mining process Also a number of techniques and measures have been developed aiming at evaluating and interpreting the extracted patterns

Generally, the term ‘Quality’ in Data Mining corresponds to the following issues:

• Representation of the ‘real’ knowledge included in the analyzed data The analyzed data

hides interesting information that the Data Mining methods are called to reveal The re-quirement for evaluating the validity of the extracted knowledge and representing it to be exploitable by the experts of domain is stronger than ever

• Algorithms tuning A number of algorithms and techniques have been proposed which

under different assumptions could lead to different results Also, there are Data Mining approaches considered as more suitable for speciﬁc application domains (e.g spatial data, business, marketing etc.) The selection of a suitable method for a speciﬁc data analysis task in terms of their performance and the quality of its results is one of the major problems

in Data Mining

• Selection of the most interesting and representative patterns for the data The number of

patterns generated during the Data Mining process is very large but only a few of these patterns are likely to be of any interest to the domain expert analyzing the data Many

of the patterns are either irrelevant or obvious and do not provide new knowledge The selection of the most representative patterns for a data set is another important issue in terms of the quality assessment

Depending on the data-mining task the quality assessment approaches aim at estimating different aspects of quality Thus, in the case of classiﬁcation the quality refers to i) the ability

of the designed classiﬁcation model to correctly classify new data samples, ii) the ability of

an algorithm to define classification models with high accuracy, and iii) the interestingness of the patterns extracted during the classification process In clustering, the quality of extracted patterns is estimated in terms of their validity and their fitness to the analyzed data The number

of groups into which the analyzed data can be partitioned is another important problem in the clustering process On the other hand the quality in association rules corresponds to the signiﬁcance and interestingness of the extracted rules Another quality criterion for association rules is the proportion of data that the extracted rules represent Since the quality assessment

is widely recognized as a major issue in Data Mining, techniques for evaluating the relevance and usefulness of discovered patterns attract the interests of researchers These techniques are broadly referred to as:

• Interestingness measures in case of classiﬁcation or association rules applications.

• Cluster validity indices (or measures) in case of clustering.

In the following section, there is a brief discussion about the role of pre-processing in the quality assessment Then we proceed with the presentation of quality assessment techniques related to the Data Mining tasks These techniques, depending on the Data Mining tasks refer

to, are organized into the following categories: i) Classiﬁers accuracy techniques and related measures, ii) Classiﬁcation rules interestingness measures, iii) Association Rules Interesting-ness Measures, and iv) Cluster Validity approaches.

Trang 6

31 Quality Assessment Approaches in Data Mining 615 31.1 Data Pre-processing and Quality Assessment

Data in the real world tends to be ‘dirty’ Database users frequently report errors, unusual values, and inconsistencies in the stored data Then, it is usual in the real world for the analyzed data to be:

• incomplete (i.e lacking attribute values, lacking certain attributes of interest, or containing

only aggregate data),

• noisy: containing errors or outliers,

• inconsistent: containing discrepancies in codes used to categorize items or the names used

to refer to the same data items

Based on a set of data that lacks quality the results of the mining process unavoidably tend to be inaccurate and lack any interest for the expert of domain In other words quality decisions must be based on quality data Data pre-processing is a major step in the knowledge discovery process Data pre-processing techniques applied prior to Data Mining step could help to improve the quality of analyzed data and consequently the accuracy and efﬁciency of the subsequent mining processes

There are a number of data pre-processing techniques aimed at substantially improving the overall quality of the extracted patterns (i.e information included in analyzed data) The most widely used are summarized below (Han and Kamber, 2001):

• Data cleaning which can be applied to remove noise and correct inconsistencies in the

data

• Data transformation A common transformation technique is the normalization It is

ap-plied to improve the accuracy and efﬁciency of mining algorithms involving distance mea-surements

• Data reduction It is applied to reduce the data size by aggregating or eliminating

redun-dant features

31.2 Evaluation of Classiﬁcation Methods

Classiﬁcation is one of the most commonly applied Data Mining tasks and a number of clas-siﬁcation approaches has been proposed in literature These approaches can be compared and evaluated based on the following criteria (Han and Kamber, 2001):

• Classiﬁcation model accuracy: The ability of the classiﬁcation model to correctly predict

the class into which new or previously unseen data are classiﬁed

• Speed: It refers to the computation costs in building and using a classiﬁcation model.

• Robustness: The ability of the model to handle noise or data with missing values and make

correct predictions

• Scalability: The method ability to construct the classiﬁcation model efﬁciently given large

amounts of data

• Interpretability: It refers to the level of understanding that the constructed model provides.

31.2.1 Classiﬁcation Model Accuracy

The accuracy of a classification model designed according to a set of training data is one of the most important and widely used criteria in the classification process It allows one to evaluate how accurately the designed model (classifier) will classify future data (i.e data on which the model has not been trained) Accuracy also helps in the comparison of different classifiers The most common techniques for assessing the accuracy of a classifier are:

Trang 7

1 Hold-out method The given data set is randomly partitioned into two independent sets, a

training set and a test set Usually, two thirds of the data are considered for the training set and the remaining data are allocated to the test set The training data are used to define the classification model (classifier) Then the classifiers accuracy is estimated based on the test data Since only a proportion of the data is used to derive the model the estimate

of accuracy tends to be pessimistic A variation of the hold-out method is the random

sub-sampling technique In this case the hold-out method is repeated k times and the overall

accuracy is estimated as the average of the accuracies obtained from each iteration

2 k-fold cross-validation The initial data set is partitioned into k subsets, called ‘folds’, let S = {S1, !S k } These subsets are mutually exclusive and have approximately equal size The classiﬁer is iteratively trained and tested k times In iteration i, the S isubset is reserved as the test set while the remaining subsets are used to train the classiﬁer Then the

accuracy is estimated as the overall number of correct classiﬁcations from the k iterations,

divided by the total number of samples in the initial data A variation of this method is the stratiﬁed cross-validation in which the subsets are stratiﬁed so that the class distribution

of the samples in each subset is approximately the same as that in the initial data set

3 Bootstraping This method is k-fold cross validation, with k set to the number of initial

samples It samples the training instances uniformly with replacement and leave-one-out

In each iteration, the classiﬁer is trained on the set of k − 1 samples that is randomly

selected from the set of initial samples, S The testing is performed using the remaining subset

Though the use of the above-discussed techniques for estimating classification model ac-curacy increases the overall computation time, they are useful for assessing the quality of classification models and/or selecting among the several classifiers

Alternatives to the Accuracy Measure

There are cases that the estimation of an accuracy rate may mislead one about the quality

of a derived classiﬁer For instance, assume a classiﬁer is trained to classify a set of data

as ‘positive’ or ‘negative’ A high accuracy rate may not be acceptable since the classifier could correctly classify only the negative samples giving no indication about the ability of the classifier to recognize positive and negative samples In this case, the sensitivity and specificity measures can be used as an alternative to the accuracy measures (Han and Kamber, 2001)

Sensitivity assesses how well the classiﬁer can recognize positive samples and is deﬁned

as

Sensitivity=true positive

where true positive corresponds to the number of the true positive samples and positive is the

number of positive samples

Specificity measures how well the classifier can recognize negative samples It is defined

as

Speci f icity=true negative

where true negative corresponds to the number of the true negative examples and negative the

number of samples that is negative

The measure that assesses the percentage of samples classiﬁed as positive that are actually

positive is known as precision That is,

Trang 8

31 Quality Assessment Approaches in Data Mining 617

Precision= true positive

true positive + f alse positive (31.3) Based on above definitions the accuracy can be defined as a function of sensitivity and specificity:

Accuracy = Sensitivity · positive

positive + negative+ (31.4) Speci f icity · negative

positive + negative

In the classiﬁcation problem discussed above it is considered that each training sample belongs to only one class, i.e the data are uniquely classiﬁed However, there are cases where

it is more reasonable to assume that a sample may belong to more than one class It is then necessary to derive models that assign data to classes with an attached degree of belief Thus classiﬁers return a probability class distribution rather than a class label The accuracy measure

is not appropriate in this case since it assumes the unique classiﬁcation of samples for its deﬁnition An alternative is to use heuristics where a class prediction is considered correct if

it agrees with the ﬁrst or second most probable class

31.2.2 Evaluating the Accuracy of Classiﬁcation Algorithms

A classification (learning) algorithm is a function that given a set of examples and their classes constructs a classifier On the other hand a classifier is a function that given an example as-signs it to one of the predefined classes A variety of classification methods have already been developed (Han and Kamber, 2001) The main question that arises in the development and application of these algorithms is about the accuracy of the classifiers they produce

Below we shall discuss some of the most common statistical methods proposed

(Diet-terich, 1998) for answering the following question: Given two classiﬁcation algorithms A and

B and a data set S, which algorithm will produce more accurate classiﬁers when trained on data sets of the same size?

McNemar’s Test

Let S be the available set of data, which is divided into a training set R, and a test set T Then

we consider two algorithms A and B trained on the training set and the result is the deﬁnition

of two classiﬁers and These classiﬁers are tested on T and for each example x ∈ T we record

how it was classiﬁed Thus the contingency table presented in Table 31.1 is constructed

Table 31.1 McNemar ’s test:Contingency table

Number of examples misclassiﬁed Number of examples misclassiﬁed by

by both classiﬁers (n00) ˆfA but not by ˆfB(n01)

Number of examples misclassiﬁed Number of examples misclassiﬁed

by ˆfB but not by ˆfA(n10) neither by ˆfA nor ˆfB(n11)

The two algorithms should have the same error rate under the null hypothesis, Ho Mc-Nemar’s test is based on aχ2test for goodness-of-ﬁt that compares the distribution of counts

Trang 9

Table 31.2 Expected counts under Ho

n00 (n01 +n10)/2) (n01 +n10)/2) n11)

expected under null hypothesis to the observed counts The expected counts under Ho are presented in Table 31.2

The following statistic, s, is distributed asχ2with 1 degree of freedom It incorporates a

”continuity correction” term (of -1 in the numerator) to account for the fact that the statistic is discrete while theχ2distribution is continuous:

s=(|n10− n01| − 1)2

n10+ n01 According to the probabilistic theory (Athanasopoulos, 1991), if the null hypothesis is correct, the probability that the value of the statistic, s, is greater thanχ2

,0.95 is less than 0.05, i.e

P(|s| >χ2

,0.95 ) < 0.05 Then to compare the algorithms A and the deﬁnied classiﬁers ˆf Aand

ˆf B are tested on T and the value of s is estimated as described above Then if |s| >χ2

,0.95 ,

the null hypothesis could be rejected in favor of the hypothesis that the two algorithms have

different performance when trained on the particular training set R.

The shortcomings of this test are:

1 It does not directly measure variability due to the choice of the training set or the internal randomness of the learning algorithm The algorithms are compared using a single train-ing set R Thus McNemar’s test should be only applied if we consider that the sources of variability are small

2 It compares the performance of the algorithms on training sets, which are substantially smaller than the size of the whole data set Hence we must assume that the relative dif-ference observed on training sets will still hold for training sets of size equal to the whole data set

A Test for the Difference of Two Proportions

This statistical test is based on measuring the difference between the error rate of algorithm

A and the error rate of algorithm B (Snedecor and Cochran, 1989) More speciﬁcally, let

p A = (n00+ n01)/n be the proportion of test examples incorrectly classiﬁed by algorithm

A and let p B = (n00+ n10)/n be the proportion of test examples incorrectly classiﬁed by

algorithm B The assumption underlying this statistical test is that when algorithm A classiﬁes

an example x from the test set T, the probability of misclassiﬁcation is p A Then the number

of misclassiﬁcations of n test examples is a binomial random variable with mean np A and

variance p A (1 − p A )n.

The binomial distribution can be well approximated by a normal distribution for reason-able values of n The difference between two independent normally distributed random

vari-ables is itself normally distributed Thus, the quantity p A − p B can be viewed as normally

distributed if we assume that the measured error rates p A and p Bare independent Under the

null hypothesis, Ho, it will have a mean of zero and a standard deviation error of

se=

2p ·

% 1−p A + p B

2

&

/n

Trang 10

31 Quality Assessment Approaches in Data Mining 619

where n is the number of test examples.

Based on the above analysis, we obtain the statistic

z= p A − p B 2p(1 − p)/n

which has a standard normal distribution According to the probabilistic theory if the z value is

greater than Z0.975the probability of incorrectly rejecting the null hypothesis is less than 0.05 Thus the null hypothesis could be rejected if|z| > Z0.975 = 1.96 in favor of the hypothesis that

the two algorithms have different performances There are several problems with this statistic, two of the most important being:

1 The probabilities p A and p B are measured on the same test set and thus they are not independent

2 The test does not measure variation due to the choice of the training set or the internal variation of the learning algorithm Also it measures the performance of the algorithms

on training sets of size signiﬁcantly smaller than the whole data set

The Resampled Paired t Test

The resampled paired t test is the most popular in machine learning Usually, the test conducts

a series of 30 trials In each trial, the available sample S is randomly divided into a training set R (it is typically two thirds of the data) and a test set T The algorithms A and B are both

trained on R and the resulting classiﬁers are tested on T Let p (i)

A and p (i)

B be the observed

proportions of test examples misclassiﬁed by algorithm A and B respectively during the i-th trial If we assume that the 30 differences p (i) = p (i) A − p (i) B were drawn independently from a normal distribution, then we can apply Student’s t test by computing the statistic

t= p·¯

√ n

)

∑n

i=1(p (i) − ¯p)2

n−1

where ¯p= 1

n · ∑ n

i=1p (i) Under null hypothesis this statistic has a t distribution with n − 1

degrees of freedom Then for 30 trials, the null hypothesis could be rejected if|t| > t29,0.975= 2.045 The main drawbacks of this approach are:

1 Since p (i)

A and p (i)

B are not independent, the difference p (i)will not have a normal

distri-bution

2 The p (i)’s are not independent, because the test and training sets in the trials overlap.

The k-fold Cross-validated Pairedt Test

This approach is similar with the resampled paired t test except that instead of constructing each pair of training and test sets by randomly dividing S, the data set is randomly divided into

k disjoint sets of equal size, T1,T2, ,T k Then k trials are conducted In each trial, the test set

is T i and the training set is the union of all of the others T j , j = i The t statistic is computed as

described in Section 31.2.2 The advantage of this approach is that each test set is independent

of the others However, there is the problem that the training sets overlap This overlap may prevent this statistical test from obtaining a good estimation of the amount of variation that would be observed if each training set were completely independent of the others training sets

Định dạng
Số trang	10
Dung lượng	382,36 KB