In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 263–266, Portland, OR, USA.. In Proceedings of the Third International Conference on K
Trang 1610 Sigal Sahar
Bayardo Jr., R J., Agrawal, R., and Gunopulos, D (1999) Constraint-based rule mining
in large, dense databases In Proceedings of the Fifteenth IEEE ICDE International Conference on Data Engineering, pages 188–197, Sydney, Australia.
Brin, S., Motwani, R., and Silverstein, C (1997) Beyond market baskets: Generalizing
asso-ciation rules to correlations In Proceedings of ACM SIGMOD International Conference
on Management of Data, pages 265–276, Tucson, AZ, USA.
Fayyad, U M., Piatetsky-Shapiro, G., and Smyth, P (1996) Advances in Knowledge Dis-covery and Data Mining, chapter 1: From Data Mining to Knowledge DisDis-covery: An Overview, pages 1–34 AAAI Press.
Hilderman, R J and Hamilton, H J (2000) Principles for mining summaries using objective
measures of interestingness In Proceedings of the Twelfth IEEE International Confer-ence on Tools with Artificial IntelligConfer-ence (ICTAI), pages 72–81, Vancouver, Canada Hilderman, R J and Hamilton, H J (2001) Knowledge Discovery and Measures of Interest.
Kluwer Academic Publishers
Hipp, J and G¨unter, U (2002) Is pushing constraints deeply into the mining algorithms
really what we want? SIGKDD Explorations, 4(1): 50–55.
Kamber, M and Shinghal, R (1996) Evaluating the interestingness of characteristic rules
In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 263–266, Portland, OR, USA.
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkam, A I (1994)
Find-ing interestFind-ing rules from large sets of discovered association rules In ProceedFind-ings of the Third ACM CIKM International Conference on Information and Knowledge Man-agement, pages 401–407, Orlando, FL, USA ACM Press.
Kl¨osgen, W (1996) Advances in Knowledge Discovery and Data Mining, chapter 10: Ex-plora: a Multipattern and Multistrategy Discovery Assistant, pages 249–271 AAAI
Press
Liu, B., Hsu, W., and Chen, S (1997) Using general impressions to analyze discovered
classification rules In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 31–36, Newport Beach, CA, USA AAAI Press.
Liu, B., Hsu, W., and Ma, Y (1999) Pruning and summarizing the discovered associations
In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 125–134, San Diego, CA, USA.
Liu, B., Hsu, W., and Ma, Y (2001a) Discovery the set of fundamental rule changes In
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 335–340, San Francisco, CA, USA.
Liu, B., Hsu, W., and Ma, Y (2001b) Identifying non-actionable association rules In
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 329–334, San Francisco, CA, USA.
Liu, B., Hu, M., and Hsu, W (2000) Multi-level organization and summarization of the
discovered rules In Proceedings of the Sixth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 208–217, Boston, MA, USA.
Major, J A and Mangano, J J (1995) Selecting among rules induced from a hurricane
databases Journal of Intelligent Information Systems, 4:39–52.
Ng, R T., Lakshmanan, L V S., Han, J., and Pang, A (1998) Exploratory mining and
pruning optimizations of constrained association rules In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 13–24.
Padmanabhan, B and Tuzhilin, A (2000) Small is beautiful: Discovering the minimal
set of unexpected patterns In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 54–63, Boston, MA, USA.
Trang 230 Interestingness Measures 611 Piatetsky-Shapiro, G (1991) Knowledge Discovery in Databases, chapter 13: Discovery, Analysis, and Presentation of Strong Rules, pages 248–292 AAAI/MIT Press
Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-tive reports Lecture notes in artificial intelligence, 3055 pp 217-228, Springer-Verlag (2004)
Sahar, S (1999) Interestingness via what is not interesting In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 332–336, San Diego, CA, USA
Sahar, S (2001) Interestingness preprocessing In Proceedings of the IEEE ICDM Interna-tional Conference on Data Mining, pages 489–496, San Jose, CA, USA.
Sahar, S (2002a) Exploring interestingness through clustering: A framework In Proceed-ings of the IEEE ICDM International Conference on Data Mining, pages 677–680,
Mae-bashi City, Japan
Sahar, S (2002b) On incorporating subjective interestingness into the mining process In
Proceedings of the IEEE ICDM International Conference on Data Mining, pages 681–
684, Maebashi City, Japan
Sahar, S and Mansour, Y (1999) An empirical evaluation of objective interestingness
cri-teria In SPIE Conference on Data Mining and Knowledge Discovery, pages 63–74,
Orlando, FL, USA
Shah, D., Lakshmanan, L V S., Ramamritham, K., and Sudarshan, S (1999) Interestingness
and pruning of mined patterns In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Philadelphia, PA,
USA
Silberschatz, A and Tuzhilin, A (1996) What makes patterns interesting in knowledge
discovery systems IEEE Transactions on Knowledge and Data Engineering (TKDE),
8(6):970–974
Spiliopoulou, M and Roddick, J F (2000) Higher order mining: Modeling and mining
the results of knowledge discovery In Proceedings of the Second Conference on Data Mining Methods and Databases, pages 309–320, Cambridge, UK WIT Press.
Srikant, R., Vu, Q., and Agrawal, R (1997) Mining association rules with item constraints
In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 67–73, Newport Beach, CA, USA AAAI Press.
Subramonian, R (1998) Defining diff as a Data Mining primitive In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 334–
338, New York City, NY, USA AAAI Press
Tan, P.-N., Kumar, V., and Srivastava, J (2002) Selecting the right interestingness measure
for association patterns In Proceedings of the Eight ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, pages 32–41, Edmonton, Alberta,
Canada
Toivonen, H., Klemettinen, M., Ronkainen, P., H¨at¨onen, K., and Mannila, H (1995) Pruning
and grouping discovered association rules In Proceedings of the MLnet Familiariza-tion Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases,
pages 47–52, Heraklion, Crete, Greece
Tuzhilin, A and Adomavicius, G (2002) Handling very large numbers of association rules
in the analysis of microarray data In Proceedings of the Eight ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, pages 396–404,
Edmon-ton, Alberta, Canada
Trang 3612 Sigal Sahar
Zaki, M J (2000) Generating non-redundant association rules In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 34–43, Boston, MA, USA
Trang 4Quality Assessment Approaches in Data Mining
Maria Halkidi1and Michalis Vazirgiannis2
1 Department of Computer Science and Engineering, University of California at Riverside, USA,
Department of Informatics, Athens University of Economics and Business, Greece mhalkidi@cs.ucr.edu
2 Department of Informatics, Athens University of Economics and Business, Greece mvazirg@aueb.gr
Summary The Data Mining process encompasses many different specific techniques and
algorithms that can be used to analyze the data and derive the discovered knowledge An important problem regarding the results of the Data Mining process is the development of efficient indicators of assessing the quality of the results of the analysis This, the quality
as-sessment problem, is a cornerstone issue of the whole process because: i) The analyzed data may hide interesting patterns that the Data Mining methods are called to reveal Due to the
size of the data, the requirement for automatically evaluating the validity of the extracted pat-terns is stronger than ever
ii)A number of algorithms and techniques have been proposed which under different assump-tions can lead to different results iii)The number of patterns generated during the Data Mining process is very large but only a few of these patterns are likely to be of any interest to the
do-main expert who is analyzing the data In this chapter we will introduce the do-main concepts and quality criteria in Data Mining Also we will present an overview of approaches that have been proposed in the literature for evaluating the Data Mining results
Key words: cluster validity, quality assessment, unsupervised learning, clustering
Introduction
Data Mining is mainly concerned with methodologies for extracting patterns from large data
repositories There are many Data Mining methods which accomplishing a limited set of tasks produces a particular enumeration of patterns over data sets The main tasks of Data Mining
which have already been discussed in previous sections are: i) Clustering, ii) Classification, iii) Association Rule Extraction, iv)Time Series, v) Regression, and vi) Summarization.
Since a Data Mining system could generate under different conditions thousands or mil-lion of patterns, questions arise for the quality of the Data Mining results, such as which of the extracted patterns are interesting and which of them represent knowledge
In general terms, a pattern is interesting if it is easily understood, valid, potentially useful and novel A pattern also is considered as interesting if it validates a hypothesis that a user
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_31, © Springer Science+Business Media, LLC 2010
Trang 5614 Maria Halkidi and Michalis Vazirgiannis
seeks to confirm An interesting pattern represents knowledge The quality of patterns depends both on the quality of the analysed data and the quality of the Data Mining results Thus several techniques have been developed aiming at evaluating and preparing the data used as input to the Data Mining process Also a number of techniques and measures have been developed aiming at evaluating and interpreting the extracted patterns
Generally, the term ‘Quality’ in Data Mining corresponds to the following issues:
• Representation of the ‘real’ knowledge included in the analyzed data The analyzed data
hides interesting information that the Data Mining methods are called to reveal The re-quirement for evaluating the validity of the extracted knowledge and representing it to be exploitable by the experts of domain is stronger than ever
• Algorithms tuning A number of algorithms and techniques have been proposed which
under different assumptions could lead to different results Also, there are Data Mining approaches considered as more suitable for specific application domains (e.g spatial data, business, marketing etc.) The selection of a suitable method for a specific data analysis task in terms of their performance and the quality of its results is one of the major problems
in Data Mining
• Selection of the most interesting and representative patterns for the data The number of
patterns generated during the Data Mining process is very large but only a few of these patterns are likely to be of any interest to the domain expert analyzing the data Many
of the patterns are either irrelevant or obvious and do not provide new knowledge The selection of the most representative patterns for a data set is another important issue in terms of the quality assessment
Depending on the data-mining task the quality assessment approaches aim at estimating different aspects of quality Thus, in the case of classification the quality refers to i) the ability
of the designed classification model to correctly classify new data samples, ii) the ability of
an algorithm to define classification models with high accuracy, and iii) the interestingness of the patterns extracted during the classification process In clustering, the quality of extracted patterns is estimated in terms of their validity and their fitness to the analyzed data The number
of groups into which the analyzed data can be partitioned is another important problem in the clustering process On the other hand the quality in association rules corresponds to the significance and interestingness of the extracted rules Another quality criterion for association rules is the proportion of data that the extracted rules represent Since the quality assessment
is widely recognized as a major issue in Data Mining, techniques for evaluating the relevance and usefulness of discovered patterns attract the interests of researchers These techniques are broadly referred to as:
• Interestingness measures in case of classification or association rules applications.
• Cluster validity indices (or measures) in case of clustering.
In the following section, there is a brief discussion about the role of pre-processing in the quality assessment Then we proceed with the presentation of quality assessment techniques related to the Data Mining tasks These techniques, depending on the Data Mining tasks refer
to, are organized into the following categories: i) Classifiers accuracy techniques and related measures, ii) Classification rules interestingness measures, iii) Association Rules Interesting-ness Measures, and iv) Cluster Validity approaches.
Trang 631 Quality Assessment Approaches in Data Mining 615 31.1 Data Pre-processing and Quality Assessment
Data in the real world tends to be ‘dirty’ Database users frequently report errors, unusual values, and inconsistencies in the stored data Then, it is usual in the real world for the analyzed data to be:
• incomplete (i.e lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data),
• noisy: containing errors or outliers,
• inconsistent: containing discrepancies in codes used to categorize items or the names used
to refer to the same data items
Based on a set of data that lacks quality the results of the mining process unavoidably tend to be inaccurate and lack any interest for the expert of domain In other words quality decisions must be based on quality data Data pre-processing is a major step in the knowledge discovery process Data pre-processing techniques applied prior to Data Mining step could help to improve the quality of analyzed data and consequently the accuracy and efficiency of the subsequent mining processes
There are a number of data pre-processing techniques aimed at substantially improving the overall quality of the extracted patterns (i.e information included in analyzed data) The most widely used are summarized below (Han and Kamber, 2001):
• Data cleaning which can be applied to remove noise and correct inconsistencies in the
data
• Data transformation A common transformation technique is the normalization It is
ap-plied to improve the accuracy and efficiency of mining algorithms involving distance mea-surements
• Data reduction It is applied to reduce the data size by aggregating or eliminating
redun-dant features
31.2 Evaluation of Classification Methods
Classification is one of the most commonly applied Data Mining tasks and a number of clas-sification approaches has been proposed in literature These approaches can be compared and evaluated based on the following criteria (Han and Kamber, 2001):
• Classification model accuracy: The ability of the classification model to correctly predict
the class into which new or previously unseen data are classified
• Speed: It refers to the computation costs in building and using a classification model.
• Robustness: The ability of the model to handle noise or data with missing values and make
correct predictions
• Scalability: The method ability to construct the classification model efficiently given large
amounts of data
• Interpretability: It refers to the level of understanding that the constructed model provides.
31.2.1 Classification Model Accuracy
The accuracy of a classification model designed according to a set of training data is one of the most important and widely used criteria in the classification process It allows one to evaluate how accurately the designed model (classifier) will classify future data (i.e data on which the model has not been trained) Accuracy also helps in the comparison of different classifiers The most common techniques for assessing the accuracy of a classifier are:
Trang 7616 Maria Halkidi and Michalis Vazirgiannis
1 Hold-out method The given data set is randomly partitioned into two independent sets, a
training set and a test set Usually, two thirds of the data are considered for the training set and the remaining data are allocated to the test set The training data are used to define the classification model (classifier) Then the classifiers accuracy is estimated based on the test data Since only a proportion of the data is used to derive the model the estimate
of accuracy tends to be pessimistic A variation of the hold-out method is the random
sub-sampling technique In this case the hold-out method is repeated k times and the overall
accuracy is estimated as the average of the accuracies obtained from each iteration
2 k-fold cross-validation The initial data set is partitioned into k subsets, called ‘folds’, let S = {S1, !S k } These subsets are mutually exclusive and have approximately equal size The classifier is iteratively trained and tested k times In iteration i, the S isubset is reserved as the test set while the remaining subsets are used to train the classifier Then the
accuracy is estimated as the overall number of correct classifications from the k iterations,
divided by the total number of samples in the initial data A variation of this method is the stratified cross-validation in which the subsets are stratified so that the class distribution
of the samples in each subset is approximately the same as that in the initial data set
3 Bootstraping This method is k-fold cross validation, with k set to the number of initial
samples It samples the training instances uniformly with replacement and leave-one-out
In each iteration, the classifier is trained on the set of k − 1 samples that is randomly
selected from the set of initial samples, S The testing is performed using the remaining subset
Though the use of the above-discussed techniques for estimating classification model ac-curacy increases the overall computation time, they are useful for assessing the quality of classification models and/or selecting among the several classifiers
Alternatives to the Accuracy Measure
There are cases that the estimation of an accuracy rate may mislead one about the quality
of a derived classifier For instance, assume a classifier is trained to classify a set of data
as ‘positive’ or ‘negative’ A high accuracy rate may not be acceptable since the classifier could correctly classify only the negative samples giving no indication about the ability of the classifier to recognize positive and negative samples In this case, the sensitivity and specificity measures can be used as an alternative to the accuracy measures (Han and Kamber, 2001)
Sensitivity assesses how well the classifier can recognize positive samples and is defined
as
Sensitivity=true positive
where true positive corresponds to the number of the true positive samples and positive is the
number of positive samples
Specificity measures how well the classifier can recognize negative samples It is defined
as
Speci f icity=true negative
where true negative corresponds to the number of the true negative examples and negative the
number of samples that is negative
The measure that assesses the percentage of samples classified as positive that are actually
positive is known as precision That is,
Trang 831 Quality Assessment Approaches in Data Mining 617
Precision= true positive
true positive + f alse positive (31.3) Based on above definitions the accuracy can be defined as a function of sensitivity and specificity:
Accuracy = Sensitivity · positive
positive + negative+ (31.4) Speci f icity · negative
positive + negative
In the classification problem discussed above it is considered that each training sample belongs to only one class, i.e the data are uniquely classified However, there are cases where
it is more reasonable to assume that a sample may belong to more than one class It is then necessary to derive models that assign data to classes with an attached degree of belief Thus classifiers return a probability class distribution rather than a class label The accuracy measure
is not appropriate in this case since it assumes the unique classification of samples for its definition An alternative is to use heuristics where a class prediction is considered correct if
it agrees with the first or second most probable class
31.2.2 Evaluating the Accuracy of Classification Algorithms
A classification (learning) algorithm is a function that given a set of examples and their classes constructs a classifier On the other hand a classifier is a function that given an example as-signs it to one of the predefined classes A variety of classification methods have already been developed (Han and Kamber, 2001) The main question that arises in the development and application of these algorithms is about the accuracy of the classifiers they produce
Below we shall discuss some of the most common statistical methods proposed
(Diet-terich, 1998) for answering the following question: Given two classification algorithms A and
B and a data set S, which algorithm will produce more accurate classifiers when trained on data sets of the same size?
McNemar’s Test
Let S be the available set of data, which is divided into a training set R, and a test set T Then
we consider two algorithms A and B trained on the training set and the result is the definition
of two classifiers and These classifiers are tested on T and for each example x ∈ T we record
how it was classified Thus the contingency table presented in Table 31.1 is constructed
Table 31.1 McNemar ’s test:Contingency table
Number of examples misclassified Number of examples misclassified by
by both classifiers (n00) ˆfA but not by ˆfB(n01)
Number of examples misclassified Number of examples misclassified
by ˆfB but not by ˆfA(n10) neither by ˆfA nor ˆfB(n11)
The two algorithms should have the same error rate under the null hypothesis, Ho Mc-Nemar’s test is based on aχ2test for goodness-of-fit that compares the distribution of counts
Trang 9618 Maria Halkidi and Michalis Vazirgiannis
Table 31.2 Expected counts under Ho
n00 (n01 +n10)/2) (n01 +n10)/2) n11)
expected under null hypothesis to the observed counts The expected counts under Ho are presented in Table 31.2
The following statistic, s, is distributed asχ2with 1 degree of freedom It incorporates a
”continuity correction” term (of -1 in the numerator) to account for the fact that the statistic is discrete while theχ2distribution is continuous:
s=(|n10− n01| − 1)2
n10+ n01 According to the probabilistic theory (Athanasopoulos, 1991), if the null hypothesis is correct, the probability that the value of the statistic, s, is greater thanχ2
,0.95 is less than 0.05, i.e
P(|s| >χ2
,0.95 ) < 0.05 Then to compare the algorithms A and the definied classifiers ˆf Aand
ˆf B are tested on T and the value of s is estimated as described above Then if |s| >χ2
,0.95 ,
the null hypothesis could be rejected in favor of the hypothesis that the two algorithms have
different performance when trained on the particular training set R.
The shortcomings of this test are:
1 It does not directly measure variability due to the choice of the training set or the internal randomness of the learning algorithm The algorithms are compared using a single train-ing set R Thus McNemar’s test should be only applied if we consider that the sources of variability are small
2 It compares the performance of the algorithms on training sets, which are substantially smaller than the size of the whole data set Hence we must assume that the relative dif-ference observed on training sets will still hold for training sets of size equal to the whole data set
A Test for the Difference of Two Proportions
This statistical test is based on measuring the difference between the error rate of algorithm
A and the error rate of algorithm B (Snedecor and Cochran, 1989) More specifically, let
p A = (n00+ n01)/n be the proportion of test examples incorrectly classified by algorithm
A and let p B = (n00+ n10)/n be the proportion of test examples incorrectly classified by
algorithm B The assumption underlying this statistical test is that when algorithm A classifies
an example x from the test set T, the probability of misclassification is p A Then the number
of misclassifications of n test examples is a binomial random variable with mean np A and
variance p A (1 − p A )n.
The binomial distribution can be well approximated by a normal distribution for reason-able values of n The difference between two independent normally distributed random
vari-ables is itself normally distributed Thus, the quantity p A − p B can be viewed as normally
distributed if we assume that the measured error rates p A and p Bare independent Under the
null hypothesis, Ho, it will have a mean of zero and a standard deviation error of
se=
2p ·
% 1−p A + p B
2
&
/n
Trang 1031 Quality Assessment Approaches in Data Mining 619
where n is the number of test examples.
Based on the above analysis, we obtain the statistic
z= p A − p B 2p(1 − p)/n
which has a standard normal distribution According to the probabilistic theory if the z value is
greater than Z0.975the probability of incorrectly rejecting the null hypothesis is less than 0.05 Thus the null hypothesis could be rejected if|z| > Z0.975 = 1.96 in favor of the hypothesis that
the two algorithms have different performances There are several problems with this statistic, two of the most important being:
1 The probabilities p A and p B are measured on the same test set and thus they are not independent
2 The test does not measure variation due to the choice of the training set or the internal variation of the learning algorithm Also it measures the performance of the algorithms
on training sets of size significantly smaller than the whole data set
The Resampled Paired t Test
The resampled paired t test is the most popular in machine learning Usually, the test conducts
a series of 30 trials In each trial, the available sample S is randomly divided into a training set R (it is typically two thirds of the data) and a test set T The algorithms A and B are both
trained on R and the resulting classifiers are tested on T Let p (i)
A and p (i)
B be the observed
proportions of test examples misclassified by algorithm A and B respectively during the i-th trial If we assume that the 30 differences p (i) = p (i) A − p (i) B were drawn independently from a normal distribution, then we can apply Student’s t test by computing the statistic
t= p·¯
√ n
)
∑n
i=1(p (i) − ¯p)2
n−1
where ¯p= 1
n · ∑ n
i=1p (i) Under null hypothesis this statistic has a t distribution with n − 1
degrees of freedom Then for 30 trials, the null hypothesis could be rejected if|t| > t29,0.975= 2.045 The main drawbacks of this approach are:
1 Since p (i)
A and p (i)
B are not independent, the difference p (i)will not have a normal
distri-bution
2 The p (i)’s are not independent, because the test and training sets in the trials overlap.
The k-fold Cross-validated Pairedt Test
This approach is similar with the resampled paired t test except that instead of constructing each pair of training and test sets by randomly dividing S, the data set is randomly divided into
k disjoint sets of equal size, T1,T2, ,T k Then k trials are conducted In each trial, the test set
is T i and the training set is the union of all of the others T j , j = i The t statistic is computed as
described in Section 31.2.2 The advantage of this approach is that each test set is independent
of the others However, there is the problem that the training sets overlap This overlap may prevent this statistical test from obtaining a good estimation of the amount of variation that would be observed if each training set were completely independent of the others training sets