Reduction in number of patterns by prototype selection based on large data clustering approaches; optimal selection of prototypes, dimensionality reduc- tion through optimal selection of[r]
Trang 1Advances in Computer Vision and Pattern Recognition
Trang 2For further volumes:
www.springer.com/series/4205
Trang 3T Ravindra Babu r M Narasimha Murty rS.V Subrahmanya
Compression Schemes
for Mining Large Datasets
A Machine Learning Perspective
Trang 4Prof Sameer Singh
Rail Vision Europe Ltd
Castle Donington
Leicestershire, UK
Dr Sing Bing KangInteractive Visual Media GroupMicrosoft Research
Redmond, WA, USA
ISSN 2191-6586 ISSN 2191-6594 (electronic)
Advances in Computer Vision and Pattern Recognition
ISBN 978-1-4471-5606-2 ISBN 978-1-4471-5607-9 (eBook)
DOI 10.1007/978-1-4471-5607-9
Springer London Heidelberg New York Dordrecht
Library of Congress Control Number: 2013954523
© Springer-Verlag London 2013
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect
pub-to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media ( www.springer.com )
Trang 5We come across a number of celebrated text books on Data Mining covering tiple aspects of the topic since its early development, such as those on databases,pattern recognition, soft computing, etc We did not find any consolidated work ondata mining in compression domain The book took shape from this realization Ourwork relates to this area of data mining with a focus on compaction We presentschemes that work in compression domain and demonstrate their working on one ormore practical datasets in each case In this process, we cover important data miningparadigms This is intended to provide a practitioners’ view point of compressionschemes in data mining The work presented is based on the authors’ work on related
mul-areas over the last few years We organized each chapter to contain context setting, background work as part of discussion, proposed algorithm and scheme, implemen- tation intricacies, experimentation by implementing the scheme on a large dataset, and discussion of results At the end of each chapter, as part of bibliographic notes,
we discuss relevant literature and directions for further study.
Data Mining focuses on efficient algorithms to generate abstraction from largedatasets The objective of these algorithms is to find interesting patterns for furtheruse by the least number of visits of entire dataset, ideal being a single visit Sim-ilarly, since the data sizes are large, effort is made in arriving at a much smallersubset of the original dataset that is a representative of entire data and contains at-tributes characterizing the data The ability to generate an abstraction from a smallrepresentative set of patterns and features that is as accurate as that can be obtainedwith entire dataset leads to efficiency in terms of both space and time Importantdata mining paradigms include clustering, classification, association rule mining,etc We present a discussion on data mining paradigms in Chap.2
In our present work, in addition to data mining paradigms discussed in Chap.2,
we also focus on another paradigm, viz., the ability to generate abstraction in thecompressed domain without having to decompress Such a compression would lead
to less storage and improve the computation cost In the book, we consider bothlossy and nonlossy compression schemes In Chap.3, we present a nonlossy com-pression scheme based on run-length encoding of patterns with binary-valued fea-tures The scheme is also applicable to floating-point-valued features that are suit-
v
Trang 6ably quantized to binary values The chapter presents an algorithm that computesthe dissimilarity in the compressed domain directly Theoretical notes are providedfor the work We present applications of the scheme in multiple domains.
It is interesting to explore when one is prepared to lose some part of pattern resentation, whether we obtain better generalization and compaction We examinethis aspect in Chap.4 The work in the chapter exploits the concept of minimumfeature or item-support The concept of support relates to the conventional associa-tion rule framework We consider patterns as sequences, form subsequences of shortlength, and identify and eliminate repeating subsequences We represent the pattern
rep-by those unique subsequences leading to significant compaction Such unique quences are further reduced by replacing less frequent unique subsequences by morefrequent subsequences, thereby achieving further compaction We demonstrate theworking of the scheme on large handwritten digit data
subse-Pattern clustering can be construed as compaction of data Feature selection alsoreduces dimensionality, thereby resulting in pattern compression It is interesting toexplore whether they can be simultaneously achieved We examine this in Chap.5
We consider an efficient clustering scheme that requires a single database visit togenerate prototypes We consider a lossy compression scheme for feature reduc-tion We also examine whether there is preference in sequencing prototype selectionand feature selection in achieving compaction, as well as good classification accu-racy on unseen patterns We examine multiple combinations of such sequencing
We demonstrate working of the scheme on handwritten digit data and intrusion tection data
de-Domain knowledge forms an important input for efficient compaction Suchknowledge could either be provided by a human expert or generated through anappropriate preliminary statistical analysis In Chap.6, we exploit domain knowl-edge obtained both by expert inference and through statistical analysis and classify
a 10-class data through a proposed decision tree of depth of 4 We make use of class classifiers, AdaBoost and Support Vector Machine, to demonstrate working ofsuch a scheme
2-Dimensionality reduction leads to compaction With algorithms such as length encoded compression, it is educative to study whether one can achieve ef-ficiency in obtaining optimal feature set that provides high classification accuracy
run-In Chap.7, we discuss concepts and methods of feature selection and extraction
We propose an efficient implementation of simple genetic algorithms by integratingcompressed data classification and frequent features We provide insightful discus-sion on the sensitivity of various genetic operators and frequent-item support on thefinal selection of optimal feature set
Divide-and-conquer has been one important direction to deal with large datasets.With reducing cost and increasing ability to collect and store enormous amounts ofdata, we have massive databases at our disposal for making sense out of them andgenerate abstraction that could be of potential business exploitation The term BigData has been synonymous with streaming multisource data such as numerical data,messages, and audio and video data There is increasing need for processing suchdata in real or near-real time and generate business value in this process In Chap.8,
Trang 7Preface vii
we propose schemes that exploit multiagent systems to solve these problems Wediscuss concepts of big data, MapReduce, PageRank, agents, and multiagent sys-tems before proposing multiagent systems to solve big data problems
The authors would like to express their sincere gratitude to their respective ilies for their cooperation
fam-T Ravindra Babu and S.V Subrahmanya are grateful to Infosys Limited for viding an excellent research environment in the Education and Research Unit (E&R)that enabled them to carry out academic and applied research resulting in articlesand books
pro-T Ravindra Babu likes to express his sincere thanks to his family membersPadma, Ramya, Kishore, and Rahul for their encouragement and support He dedi-cates his contribution of the work to the fond memory of his parents Butchiramaiahand Ramasitamma M Narasimha Murty likes to acknowledge support of his par-ents S.V Subrahmanya likes to thank his wife D.R Sudha for her patient support.The authors would like to record their sincere appreciation for Springer team, WayneWheeler and Simon Rees, for their support and encouragement
T Ravindra Babu
M Narasimha MurtyS.V SubrahmanyaBangalore, India
Trang 81 Introduction 1
1.1 Data Mining and Data Compression 1
1.1.1 Data Mining Tasks 1
1.1.2 Data Compression 2
1.1.3 Compression Using Data Mining Tasks 2
1.2 Organization 3
1.2.1 Data Mining Tasks 3
1.2.2 Abstraction in Nonlossy Compression Domain 5
1.2.3 Lossy Compression Scheme and Dimensionality Reduction 6
1.2.4 Compaction Through Simultaneous Prototype and Feature Selection 6
1.2.5 Use of Domain Knowledge in Data Compaction 7
1.2.6 Compression Through Dimensionality Reduction 7
1.2.7 Big Data, Multiagent Systems, and Abstraction 8
1.3 Summary 9
1.4 Bibliographical Notes 9
References 9
2 Data Mining Paradigms 11
2.1 Introduction 11
2.2 Clustering 12
2.2.1 Clustering Algorithms 13
2.2.2 Single-Link Algorithm 14
2.2.3 k-Means Algorithm 15
2.3 Classification 17
2.4 Association Rule Mining 22
2.4.1 Frequent Itemsets 23
2.4.2 Association Rules 25
2.5 Mining Large Datasets 26
ix
Trang 9x Contents
2.5.1 Possible Solutions 27
2.5.2 Clustering 28
2.5.3 Classification 34
2.5.4 Frequent Itemset Mining 39
2.6 Summary 42
2.7 Bibliographic Notes 43
References 44
3 Run-Length-Encoded Compression Scheme 47
3.1 Introduction 47
3.2 Compression Domain for Large Datasets 48
3.3 Run-Length-Encoded Compression Scheme 49
3.3.1 Discussion on Relevant Terms 49
3.3.2 Important Properties and Algorithm 50
3.4 Experimental Results 55
3.4.1 Application to Handwritten Digit Data 55
3.4.2 Application to Genetic Algorithms 57
3.4.3 Some Applicable Scenarios in Data Mining 59
3.5 Invariance of VC Dimension in the Original and the Compressed Forms 60
3.6 Minimum Description Length 63
3.7 Summary 65
3.8 Bibliographic Notes 65
References 66
4 Dimensionality Reduction by Subsequence Pruning 67
4.1 Introduction 67
4.2 Lossy Data Compression for Clustering and Classification 67
4.3 Background and Terminology 68
4.4 Preliminary Data Analysis 73
4.4.1 Huffman Coding and Lossy Compression 74
4.4.2 Analysis of Subsequences and Their Frequency in a Class 79 4.5 Proposed Scheme 81
4.5.1 Initialization 83
4.5.2 Frequent Item Generation 83
4.5.3 Generation of Coded Training Data 84
4.5.4 Subsequence Identification and Frequency Computation 84 4.5.5 Pruning of Subsequences 85
4.5.6 Generation of Encoded Test Data 85
4.5.7 Classification Using Dissimilarity Based on Rough Set Concept 86
4.5.8 Classification Using k-Nearest Neighbor Classifier 87
4.6 Implementation of the Proposed Scheme 87
4.6.1 Choice of Parameters 87
4.6.2 Frequent Items and Subsequences 88
Trang 104.6.3 Compressed Data and Pruning of Subsequences 89
4.6.4 Generation of Compressed Training and Test Data 91
4.7 Experimental Results 91
4.8 Summary 92
4.9 Bibliographic Notes 93
References 94
5 Data Compaction Through Simultaneous Selection of Prototypes and Features 95
5.1 Introduction 95
5.2 Prototype Selection, Feature Selection, and Data Compaction 96
5.2.1 Data Compression Through Prototype and Feature Selection 99
5.3 Background Material 100
5.3.1 Computation of Frequent Features 103
5.3.2 Distinct Subsequences 104
5.3.3 Impact of Support on Distinct Subsequences 104
5.3.4 Computation of Leaders 105
5.3.5 Classification of Validation Data 105
5.4 Preliminary Analysis 105
5.5 Proposed Approaches 107
5.5.1 Patterns with Frequent Items Only 107
5.5.2 Cluster Representatives Only 108
5.5.3 Frequent Items Followed by Clustering 109
5.5.4 Clustering Followed by Frequent Items 109
5.6 Implementation and Experimentation 110
5.6.1 Handwritten Digit Data 110
5.6.2 Intrusion Detection Data 116
5.6.3 Simultaneous Selection of Patterns and Features 120
5.7 Summary 122
5.8 Bibliographic Notes 123
References 123
6 Domain Knowledge-Based Compaction 125
6.1 Introduction 125
6.2 Multicategory Classification 126
6.3 Support Vector Machine (SVM) 126
6.4 Adaptive Boosting 128
6.4.1 Adaptive Boosting on Prototypes for Data Mining Applications 129
6.5 Decision Trees 130
6.6 Preliminary Analysis Leading to Domain Knowledge 131
6.6.1 Analytical View 132
6.6.2 Numerical Analysis 133
6.6.3 Confusion Matrix 134
Trang 11xii Contents
6.7 Proposed Method 136
6.7.1 Knowledge-Based (KB) Tree 136
6.8 Experimentation and Results 137
6.8.1 Experiments Using SVM 138
6.8.2 Experiments Using AdaBoost 140
6.8.3 Results with AdaBoost on Benchmark Data 141
6.9 Summary 143
6.10 Bibliographic Notes 144
References 144
7 Optimal Dimensionality Reduction 147
7.1 Introduction 147
7.2 Feature Selection 149
7.2.1 Based on Feature Ranking 149
7.2.2 Ranking Features 150
7.3 Feature Extraction 152
7.3.1 Performance 154
7.4 Efficient Approaches to Large-Scale Feature Selection Using Genetic Algorithms 154
7.4.1 An Overview of Genetic Algorithms 155
7.4.2 Proposed Schemes 158
7.4.3 Preliminary Analysis 161
7.4.4 Experimental Results 163
7.4.5 Summary 170
7.5 Bibliographical Notes 171
References 171
8 Big Data Abstraction Through Multiagent Systems 173
8.1 Introduction 173
8.2 Big Data 173
8.3 Conventional Massive Data Systems 174
8.3.1 Map-Reduce 174
8.3.2 PageRank 176
8.4 Big Data and Data Mining 176
8.5 Multiagent Systems 177
8.5.1 Agent Mining Interaction 177
8.5.2 Big Data Analytics 178
8.6 Proposed Multiagent Systems 178
8.6.1 Multiagent System for Data Reduction 178
8.6.2 Multiagent System for Attribute Reduction 179
8.6.3 Multiagent System for Heterogeneous Data Access 180
8.6.4 Multiagent System for Agile Processing 181
8.7 Summary 182
8.8 Bibliographic Notes 182
References 183
Trang 12Appendix Intrusion Detection Dataset—Binary Representation 185
A.1 Data Description and Preliminary Analysis 185
A.2 Bibliographic Notes 189
References 189
Glossary 191
Index 193
Trang 13BIRCH Balanced Iterative Reducing and Clustering using Hierarchies
CART Classification and regression trees
CLARANS Clustering Large Applications based on RANdomized Search
kNNC k-Nearest-Neighbor Classifier
MAD Analysis Magnetic, Agile, and Deep Analysis
xv
Trang 14PCA Principal Component Analysis
Trang 15Chapter 1
Introduction
In this book, we deal with data mining and compression; specifically, we deal with
using several data mining tasks directly on the compressed data
1.1 Data Mining and Data Compression
Data mining is concerned with generating an abstraction of the input dataset using
a mining task
1.1.1 Data Mining Tasks
Important data mining tasks are:
1 Clustering Clustering is the process of grouping data points so that points in
each group or cluster are similar to each other than points belonging to two ormore different clusters Each resulting cluster is abstracted using one or morerepresentative patterns So, clustering is some kind of compression where de-tails of the data are ignored and only cluster representatives are used in furtherprocessing or decision making
2 Classification In classification a labeled training dataset is used to learn a model
or classifier This learnt model is used to label a test (unlabeled) pattern; thisprocess is called classification
3 Dimensionality Reduction A majority of the classification and clustering
algo-rithms fail to produce expected results in dealing with high-dimensional datasets.Also, computational requirements in the form of time and space can increaseenormously with dimensionality This prompts reduction of the dimensionality
of the dataset; it is reduced either by using feature selection or feature extraction
In feature selection, an appropriate subset of features is selected, and in featureextraction, a subset in some transformed space is selected
T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_1 , © Springer-Verlag London 2013
1
Trang 164 Regression or Function Prediction Here a functional form for variable y is learnt (where y = f (X)) from given pairs (X, y); the learnt function is used to predict the values of y for new values of X This problem may be viewed as a general-
ization of the classification problem In classification, the number of class labels
is finite, where as in the regression setting, y can have infinite values, typically,
y ∈ R.
5 Association Rule Mining Even though it is of relatively recent origin, it is the
earliest introduced task in data mining and is responsible for bringing visibility
to the area of data mining In association rule mining, we are interested in findingout how frequently two subsets of items are associated
1.1.2 Data Compression
Another important topic in this book is data compression A compression scheme
CS may be viewed as a function from the set of patterns X to a set of compressed
patternsX It may be viewed as
CS : X ⇒ X Specifically, CS(x) = x for x ∈ X and x∈ X In a more general setting, we
may view CS as giving output x using x and some knowledge structure or a
dic-tionary K So, CS(x, K) = x for x ∈ X and x∈ X Sometimes, a dictionary
is used in compressing and uncompressing the data Schemes for compressing dataare the following:
• Lossless Schemes These schemes are such that CS(x) = x and there is an
inverse CS−1 such that CS−1(x) = x For example, consider a binary string
00001111 (x) as an input; the corresponding run-length-coded string is 44 (x),
where the first 4 corresponds to a run of 4 zeros, and the second 4 corresponds
to a run of 4 ones Also, from the run-length-coded string 44 we can get back
the input string 00001111 Note that such a representation is lossless as we get x
from x using run-length encoding and x from xusing decoding.
• Lossy Schemes In a lossy compression scheme, it is not possible in general to get back the original data point x from the compressed pattern x Pattern recognition
and data mining are areas in which there are a plenty of examples where lossycompression schemes are used
We show some example compression schemes in Fig.1.1
1.1.3 Compression Using Data Mining Tasks
Among the lossy compression schemes, we considered the data mining tasks Each
of them is a compression scheme as:
• Association rule mining deals with generating frequently cooccurring
items/pat-terns from the given data It ignores the infrequent items Rules of association are
Trang 171.2 Organization 3
Fig 1.1 Compression schemes
generated from the frequent itemsets So, association rules in general cannot beused to obtain the original input data points provided
• Clustering is lossy because the output of clustering is a collection of cluster
repre-sentatives From the cluster representatives we cannot get back the original data
points For example, in K-means clustering, each cluster is represented by the
centroid of the data points in it; it is not possible to get back the original datapoints from the centroids
• Classification is lossy as the models learnt from the training data cannot be used
to reproduce the input data points For example, in the case of Support VectorMachines, a subset of the training patterns called support vectors are used to getthe classifier; it is not possible to generate the input data points from the supportvectors
• Dimensionality reduction schemes can ignore some of the input features So, they
are lossy because it is not possible to get the training patterns back from thedimensionality-reduced ones
So, each of the mining tasks is lossy in terms of its output obtained from the givendata In addition, in this book, we deal with data mining tasks working on com-pressed data, not the original data We consider data compression schemes thatcould be either lossy or nonlossy Some of the nonlossy data compression schemesare also shown in Fig.1.1 These include run-length coding, Huffman coding, andthe zip utility used by the operating systems
1.2 Organization
Material in this book is organized as follows
1.2.1 Data Mining Tasks
We briefly discuss some data mining tasks We provide a detailed discussion inChap.2
Trang 18The data mining tasks considered are the following.
• Clustering Clustering algorithms generate either a hard or soft partition of the
input dataset Hard clustering algorithms are either partitional or hierarchical.Partitional algorithms generate a single partition of the dataset The number of all
possible partitions of a set of n points into K clusters can be shown to be equal to
(i) n
So, exhaustive enumeration of all possible partitions of a dataset could be hibitively expensive For example, even for a small dataset of 19 patterns to bepartitioned into four clusters, we may have to consider around 11,259,666,000partitions In order to reduce the computational load, each of the clustering al-gorithms restricts these possibilities by selecting an appropriate subset of the set
pro-of all possible K-partitions In Chap.2, we consider two partitional algorithms
for clustering One of them is the K-means algorithm, which is the most
popu-lar clustering algorithm; the other is the leader clustering algorithm, which is thesimplest possible algorithm for partitional clustering
A hierarchical clustering algorithm generates a hierarchy of partitions; titions at different levels of the hierarchy are of different sizes We describe thesingle-link algorithm, which has been classically used in a variety of areas includ-ing numerical taxonomy Another hierarchical algorithm discussed is BIRCH,which is a very efficient hierarchical algorithm Both leader and BIRCH are effi-cient as they need to scan the dataset only once to generate the clusters
par-• Classification We describe two classifiers in Chap.2 Nearest-neighbor classifier
is the simplest classifier in terms of learning In fact, it does not learn a model;
it employs all the training data points to label a test pattern Even though it has
no training time requirement, it can take a long time for labeling a test pattern
if the training dataset is large in size Its performance deteriorates as the sionality of the data points increases; also, it is sensitive to noise in the training
dimen-data A popular variant is the K-nearest-neighbor classifier (KNNC), which bels a test pattern based on labels of K nearest neighbors of the test pattern Even though KNNC is robust to noise, it can fail to perform well in high-dimensional
la-spaces Also, it takes a longer time to classify a test pattern
Another efficient and state-of-the-art classifier is based on Support Vector chines (SVMs) and is popularly used in two-class problems An SVM learns asubset of the set of training patterns, called the set of support vectors These cor-respond to patterns falling on two parallel hyperplanes; these planes, called thesupport planes, are separated by a maximum margin One can design the clas-sifier using the support vectors The decision boundary separating patterns fromthe two classes is located between the two support planes, one per each class It iscommonly used in high-dimensional spaces, and it classifies a test pattern using
Ma-a single dot product computMa-ation
• Association rule mining A popular scheme for finding frequent itemsets and sociation rules based on them is Apriori This was the first association rule mining
Trang 19as-1.2 Organization 5
algorithm; perhaps, it is responsible for the emergence of the area of data miningitself Even though it is initiated in market-basket analysis, it can be also used inother pattern classification and clustering applications We use it in the classifi-cation of hand-written digits in the book We describe the Apriori algorithm inChap.2
Naturally, in data mining, we need to analyze large-scale datasets; in Chap.2, wediscuss three different schemes for dealing with large datasets These include:
1 Incremental Mining Here, we use abstraction A K and the (K + 1)th point X K+1
to generate the abstractionA K+1 Here,A K is the abstraction generated after
examining the first K points It is useful in dealing with stream data mining; in big data analytics, it deals with velocity in the three-V model.
2 Divide-and-Conquer Approach: It is a popular scheme used in designing efficient algorithms Also, the popular and state-of-the-art Map-Reduce scheme is based
on this strategy It is associated with dealing volume requirements in the three-V
model
3 Mining based on an intermediate representation: Here an abstraction is learnt
based on accessing the dataset once or twice; this abstraction is an intermediaterepresentation Once an intermediate representation is available, the mining isperformed on this abstraction rather than on the dataset, which reduces the com-
putational burden This scheme also is associated with the volume feature of the three-V model.
1.2.2 Abstraction in Nonlossy Compression Domain
In Chap.3, we provide a nonlossy compression scheme and ability to cluster andclassify data in the compressed domain without having to uncompress
The scheme employs run-length coding of binary patterns So, it is useful in ing with either binary input patterns or even numerical vectors that could be viewed
deal-as binary sequences Specifically, it considers handwritten digits that could be sented as binary patterns and compresses the strings using run-length coding Now
repre-the compressed patterns are input to a KNNC for classification It requires a tion of the distance dbetween a pair of run-length-coded strings to use the KNNC
defini-on the compressed data
It is shown that the distance d(x, y) between two binary strings x and y and the modified distance d(x, y)between the corresponding run-length-coded (com-
pressed) strings xand yare equal; that is d(x, y) = d(x, y) It is shown that the
KNNC using the modified distance on the compressed strings reduces the space and time requirements by a factor of more than 3 compared to the application of KNNC
on the given original (uncompressed) data
Such a scheme can be used in a number of applications that involve dissimilaritycomputation in patterns with binary-valued features It should be noted that evenreal-valued features can be quantized into binary-valued features by specifying ap-propriate range and scale factors Our earlier experience of such conversation on
Trang 20intrusion detection dataset is that it does not affect the accuracy In this chapter, weprovide an application of the scheme in classification of handwritten digit data andcompare improvement obtained in size as well as computation time Second applica-tion is related to efficient implementation of genetic algorithms Genetic algorithmsare robust methods to obtain near-optimal solutions The compression scheme can
be gainfully employed in situations where the evaluation function in Genetic rithms is the classification accuracy of the nearest-neighbor classifier (NNC) NNCinvolves computation of dissimilarity a number of times depending on the size oftraining data or prototype pattern set as well as test data size The method can beused for optimal prototype and feature selection We discuss an indicative example.The Vapnik–Chervonenkis (VC) dimension characterizes the complexity of
Algo-a clAlgo-ass of clAlgo-assifiers It is importAlgo-ant to control the V C dimension to improve the performance of a classifier Here, we show that the V C dimension is not affected by
using the classifier on compressed data
1.2.3 Lossy Compression Scheme and Dimensionality Reduction
We propose a lossy compression scheme in Chap.4 Such compressed data can
be used in both clustering and classification The proposed scheme compresses thegiven data by using frequent items and then considering distinct subsequences Oncethe training data is compressed using this scheme, it is also required to appropriatelydeal with test data; it is possible that some of the subsequences present in the testdata are absent in the training data summary One of the successful schemes em-ployed to deal with this issue is based on replacing a subsequence in the test data byits nearest neighbor in the training data
The pruning and transformation scheme employed in achieving compression duces the dataset size significantly However, the classification accuracy improvesbecause of the possible generalization resulting due to compressed representation
re-It is possible to integrate rough set theory to put a threshold on the dissimilaritybetween a test pattern and a training pattern represented in the compressed form Ifthe distance is below a threshold, then the test pattern is assumed to be in the lowerapproximation (proper core region) of the class of the training data; otherwise, it isplaced in the upper approximation (possible reject region)
1.2.4 Compaction Through Simultaneous Prototype and Feature Selection
Simultaneous selection of prototypical patterns and features is considered inChap 5 Here data compression is achieved by ignoring some of the rows andcolumns in the data matrix; the rows correspond to patterns, and the columns arefeatures in the data matrix Some of the important directions explored in this chapterare:
Trang 21by clustering Both schemes are explored in evaluating the resulting simultaneousprototype and feature selection Here the leader clustering algorithm is used forprototype selection and frequent itemset-based approaches are used for featureselection.
1.2.5 Use of Domain Knowledge in Data Compaction
Domain knowledge-based compaction is provided in Chap.6 We make use of main knowledge of the data under consideration to design efficient pattern classi-fication schemes We design a domain knowledge-based decision tree of depth 4that can classify 10-category data with high accuracy The classification approachesbased on support vector machines and AdaBoost are used
do-We carry out preliminary analysis on datasets and demonstrate deriving domainknowledge from the data and from a human expert In order that the classificationwould be carried out on representative patterns and not on complete data, we makeuse of the condensed nearest-neighbor approach and the leader clustering algorithm
We demonstrate working of the proposed schemes on large datasets and publicdomain machine learning datasets
1.2.6 Compression Through Dimensionality Reduction
Optimal dimensionality reduction for lossy data compression is discussed inChap.7 Here both feature selection and feature extraction schemes are described
In feature selection, both sequential selection schemes and genetic algorithm (GA)based schemes are discussed In sequential selection, features are selected one af-ter the other based on some ranking scheme; here each of the remaining features
is ranked based on their performance along with the already selected features ing some validation data These sequential schemes are greedy in nature and donot guarantee globally optimal selection It is possible to show that the GA-basedschemes are globally optimal under some conditions; however, most of practicalimplementations may not be able to exploit this global optimality
us-Two popular schemes for feature selection are based on Fisher’s score and tual information (MI) Fisher’s score could be used to select features that can assume
Trang 22Mu-continuous values, whereas the MI-based scheme is the most successful for ing features that are discrete or categorical; it has been used in selecting features inclassification of documents where the given set of features is very large.
select-Another popular set of feature selection schemes employ performance of theclassifiers on selected feature subsets Most popularly used classifiers in such featureselection include the NNC, SVM, and Decision Tree classifier Some of the popularfeature extraction schemes are:
• Principal Component Analysis (PCA) Here the extracted features are linear
com-binations of the given features Signal processing community has successfullyused PCA-based compression in image and speech data reconstruction It hasalso been used by search engines for capturing semantic similarity between thequery and the documents
• Nonnegative Matrix Factorization (NMF) Most of the data one typically uses are
nonnegative In such cases, it is possible to use NMF to reduce the dimensionality.This reduction in dimensionality is helpful in building effective classifiers to work
on the reduced-dimensional data even though the given data is high-dimensional
• Random projections (RP) It is another scheme that extracts features that are linear
combinations of the given features; the weights used in the linear combinationsare random values here
In this chapter, it is also shown as to how to exploit GAs in large-scale feature lection, and the proposed scheme is demonstrated using the handwritten digit data
se-A problem with about 200-feature vector is considered for obtaining optimal subset
of features The implementation integrates frequent features and genetic algorithmsand brings out sensitivity of genetic operators in achieving optimal set It is prac-tically shown on how the choice of probability of initialization of the population,which is not often found in the literature, impacts the number of the final set offeatures with other control parameters remaining the same
1.2.7 Big Data, Multiagent Systems, and Abstraction
Chapter8contains ways to generate abstraction from massive datasets Big data ischaracterized by large volumes of heterogeneous types of datasets that need to beprocessed to generate abstraction efficiently Equivalently, big data is characterized
by three v’s, viz., volume, variety, and velocity Occasionally, the importance of value is articulated through another v Big data analytics is multidisciplinary with a
host of topics such as machine learning, statistics, parallel processing, algorithms,data visualization, etc The contents include discussion on big data and related topicssuch as conventional methods of analyzing big data, MapReduce, PageRank, agents,and multiagent systems A detailed discussion on agents and multiagent systems isprovided Case studies for generating abstraction with big data using multiagentsystems are provided
Trang 231.3 Summary 9
1.3 Summary
In this chapter, we have provided a brief introduction to data compression and ing compressed data It is possible to use all the data mining tasks on the compresseddata directly Then we have given how the material is organized in different chapters.Most of the popular and state-of-the-art mining algorithms are covered in detail inthe subsequent chapters Various schemes considered and proposed are applied ontwo datasets, handwritten digit dataset and the network intrusion detection dataset.Details of the intrusion detection dataset are provided inAppendix
min-1.4 Bibliographical Notes
A detailed description of the bibliography is presented at the end of each ter, and notes on the bibliography are provided in the respective chapters Thisbook deals with data mining and data compression There is no major effort sofar in dealing with the application of data mining algorithms directly on the com-pressed data Some of the important books on compression are by Sayood (2000)and Salomon et al (2009) An early book on Data Mining was by Hand et al.(2001) For a good introduction to data mining, a good source is the book byTan et al (2005) A detailed description of various data mining task is given
chap-by Han et al (2011) The book by Witten et al (2011) discusses various tical issues and shows how to use the Weka machine learning workbench de-veloped by the authors One of the recent books is by Rajaraman and Ullman(2011)
prac-Some of the important journals on data mining are:
1 IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE)
2 ACM Transactions on Knowledge Discovery from Data (ACM TKDD)
3 Data Mining and Knowledge Discovery (DMKD)
Some of the important conferences on this topic are:
1 Knowledge Discovery and Data Mining (KDD)
2 International Conference on Data Engineering (ICDE)
3 IEEE International Conference on Data Mining (ICDM)
4 SIAM International Conference on Data Mining (SDM)
References
J Han, M Kamber, J Pei, Data Mining: Concepts and Techniques, 3rd edn (Morgan Kaufmann,
San Mateo, 2011)
D.J Hand, H Mannila, P Smyth, Principles of Data Mining (MIT Press, Cambridge, 2001)
A Rajaraman, J.D Ullman, Mining Massive Datasets (Cambridge University Press, Cambridge,
2011)
Trang 24D Salomon, G Motta, D Bryant, Handbook of Data Compression (Springer, Berlin, 2009)
K Sayood, Introduction to Data Compression, 2nd edn (Morgan Kaufmann, San Mateo, 2000) P.-N Tan, M Steinbach, V Kumar, Introduction to Data Mining (Pearson, Upper Saddle River,
2005)
I.H Witten, E Frank, M.A Hall, Data Mining: Practical Machine Learning Tools and Techniques,
3rd edn (Morgan Kaufmann, San Mateo, 2011)
Trang 25Chapter 2
Data Mining Paradigms
2.1 Introduction
In data mining, the size of the dataset involved is large It is convenient to visualize
such a dataset as a matrix of size n × d, where n is the number of data points, and d
is the number of features Typically, it is possible that either n or d or both are large.
In mining such datasets, important issues are:
• The dataset cannot be accommodated in the main memory of the machine So, weneed to store the data on a secondary storage medium like a disk and transfer thedata in parts to the main memory for processing; such an activity could be time-consuming Because disk access can be more expensive compared to accessingthe data from the memory, the number of database scans is an important param-eter So, when we analyze data mining algorithms, it is important to consider thenumber of database scans required
• The dimensionality of the data can be very large In such a case, several of theconventional algorithms that use the Euclidean distance like metrics to charac-terize proximity between a pair of patterns may not play a meaningful role insuch high-dimensional spaces where the data is sparsely distributed So, differenttechniques to deal with such high-dimensional datasets become important
• Three important data mining tasks are:
1 Clustering Here a collection of patterns is partitioned into two or more
clus-ters Typically, clusters of patterns are represented using cluster tives; a centroid of the points in the cluster is one of the most popularly usedcluster representatives Typically, a partition or a clustering is represented by
representa-k representatives, where k is the number of clusters; such a process leads to lossy data compression Instead of dealing with all the n data points in the collection, one can just use the k cluster representatives (where k n in the
data mining context) for further decision making
2 Classification In classification, a machine learning algorithm is used on a given collection of training data to obtain an appropriate abstraction of the
dataset Decision trees and probability distributions of points in various classes
T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_2 , © Springer-Verlag London 2013
11
Trang 26Fig 2.1 Clustering
are examples of such abstractions These abstractions are used to classify a testpattern
3 Association Rule Mining This activity has played a major role in giving a
dis-tinct status to the field of data mining itself By convention, an association rule
is an implication of the form A → B, where A and B are two disjoint
item-sets It was initiated in the context of market-basket analysis to characterize
how frequently items in A are bought along with items in B However,
generi-cally it is possible to view classification and clustering rules also as associationrules
In order to run these tasks on large datasets, it is important to consider techniquesthat could lead to scalable mining algorithms Before we examine these tech-niques, we briefly consider some of the popular algorithms for carrying out thesedata mining tasks
2.2 Clustering
Clustering is the process of partitioning a set of patterns into cohesive groups orclusters Such a process is carried out so that intra-cluster patterns are similar andinter-cluster patterns are dissimilar This is illustrated using a set of two-dimensionalpoints shown in Fig.2.1 There are three clusters in this figure, and patterns arerepresented as two-dimensional points The Euclidean distance between a pair ofpoints belonging to the same cluster is smaller than that between any two pointschosen from different clusters
The Euclidean distance between two points X and Y in the p-dimensional space, where x i and y i are the ith components of X and Y , respectively, is given by
Trang 272.2 Clustering 13
Fig 2.2 Representing clusters
This notion characterizes similarity; the intra-cluster distance (similarity) is small(high), and the inter-cluster distance (similarity) is large (low) There could be otherways of characterizing similarity
Clustering is useful in generating data abstraction The process of data tion may be explained using Fig.2.2 There are two dense clusters; the first has 22points, and the second has 9 points Further, there is a singleton cluster in the figure
abstrac-Here, a cluster of points is represented by its centroid or its leader The centroid
stands for the sample mean of the points in the cluster, and it need not coincidewith any one of the input points as indicated in the figure There is another point inthe figure, which is far off from any of the other points, and it belongs to the third
cluster This could be an outlier Typically, these outliers are ignored, and each of the remaining clusters is represented by one or more points, called the cluster repre- sentatives, to achieve the abstraction The most popular cluster representative is its
centroid
Here, if each cluster is represented by its centroid, then there is a reduction in thedataset size One can use only the two centroids for further decision making Forexample, in order to classify a test pattern using the nearest-neighbor classifier, onerequires 32 distance computations if all the data points are used However, usingthe two centroids requires just two distance computations to compute the nearestcentroid of the test pattern It is possible that classifiers using the cluster centroidscan be optimal under some conditions The above discussion illustrates the role of
clustering in lossy data compression.
2.2.1 Clustering Algorithms
Typically, a grouping of patterns is meaningful when the within-group similarity
is high and the between-group similarity is low This may be illustrated using thegroupings of the seven two-dimensional points shown in Fig.2.3
Algorithms for clustering can be broadly grouped into hierarchical and titional categories A hierarchical scheme forms a nested grouping of patterns,
Trang 28par-Fig 2.3 A clustering of the
Typ-A hierarchical algorithm would result in a dendrogram representing the nested
grouping of patterns and similarity levels at which groupings change
In this section, we describe two popular clustering algorithms; one of them ishierarchical, and the other is partitional
1 Input: n data points; Output: A dendrogram depicting the hierarchy.
2 Form the n × n proximity matrix by using the Euclidean distance between all pairs of points Assign each point to a separate cluster; this step results in n
singleton clusters
3 Merge a pair of most similar clusters to form a bigger cluster The distance
be-tween two clusters C i and C j to be merged is given by
Distance(C i , C j ) = Min X,Y d(X, Y ) where X ∈ C i and Y ∈ C j
4 Repeat step 3 till the partition of required size is obtained; a k-partition is tained if the number of clusters k is given; otherwise, merging continues till a single cluster of all the n points is obtained.
ob-We illustrate the single-link algorithm using the data shown in Fig.2.3 The imity matrix showing the Euclidean distance between each pair of patterns is shown
Trang 29Fig 2.4 The dendrogram
obtained using the single-link
algorithm
in Table2.1 A dendrogram of the seven points in Fig.2.3(obtained from the link algorithm) is shown in Fig.2.4 Note that there are seven leaves with each leafcorresponding to a singleton cluster in the tree structure The smallest distance be-tween a pair of such clusters is 0.8, which leads to merging {F} and {G} to form{F, G} Next merger leads to {D, E} based on a distance of 0.9 units This is fol-lowed by merging {B} and {C}, then {A} and {B, C} at a distance of 1 unit each
single-At this point we have three clusters By merging clusters further we get ultimately asingle cluster as shown in the figure The dendrogram can be broken at different lev-els to yield different clusterings of the data The partition of three clusters obtainedusing the dendrogram is the same as the partition shown in Fig.2.3 A major is-sue with the hierarchical algorithm is that computation and storage of the proximity
matrix requires O(n2)time and space
2.2.3 k-Means Algorithm
The k-means algorithm is the most popular clustering algorithm It is a partitional
clustering algorithm and produces clusters by optimizing a criterion function Themost acceptable criterion function is the squared-error criterion as it can be used
to generate compact clusters The k-means algorithm is the most successfully used squared-error clustering algorithm The k-means algorithm is popular because it
is easy to implement and its time complexity is O(n), where n is the number of patterns We give a description of the k-means algorithm below.
Trang 30Fig 2.5 An optimal
clustering of the points
k-Means Algorithm
1 Select k initial centroids One possibility is to select k out of the n points
ran-domly as the initial centroids Each of them represents a cluster
2 Assign each of the remaining n − k points to one of these k clusters; a pattern is assigned to a cluster if the centroid of the cluster is the nearest, among all the k
centroids, to the pattern
3 Update the centroids of the clusters based on the assignment of the patterns
4 Assign each of the n patterns to the nearest cluster using the current set of
• centroid1: (1.33, 1.66) t ; centroid2: (6.45, 2) t ; centroid3: (6.4, 2) t
• The corresponding value of the squared error is around 2 units
The popularity of the k-means algorithm may be attributed to its simplicity It requires O(n) time as it computes nk distances in each pass and the number of passes may be assumed to be a constant Also, the number of clusters k is a constant Further, it needs to store k centroids in the memory So, the space requirement is also
small
However, it is possible that the algorithm generates a nonoptimal partition by
choosing A, B, and C as the initial centroids as depicted in Fig.2.6 In this case, thethree centroids are:
Trang 312.3 Classification 17
Fig 2.6 A nonoptimal
clustering of the
two-dimensional points
• centroid1: (1, 1) t ; centroid2: (1.5, 2) t ; centroid3: (6.4, 4) t
• The corresponding squared error value is around 17 units
2.3 Classification
There are a variety of classifiers Typically, a set of labeled patterns is used toclassify an unlabeled test pattern Classification involves labeling a test pattern; inthe process, either the labeled training dataset is directly used, or an abstraction
or model learnt from the training dataset is used Typically, classifiers learnt fromthe training dataset are categorized as either generative or discriminative The Bayes
classifier is a well-known generative model where a test pattern X is classified or signed to class C i , based on the a posteriori probabilities P (C j /X) for j = 1, , C
as-if
P (C i /X) ≥ P (C j /X) for all j.
These posterior probabilities are obtained using the Bayes rule using prior ities and the probability distributions of patterns in each of the classes It is possible
probabil-to show that the Bayes classifier is optimal; it can minimize the average probability
of error Support Vector Machine (SVM) is a popular discriminative classifier, and
it learns a weight vector W and a threshold b from the training patterns from two classes It assigns the test pattern X to class C1(positive class) if W t X + b ≥ 0, else it assigns X to class C2(negative class)
The Nearest-Neighbor Classifier (NNC) is the simplest and popular classifier;
it classifies the test pattern by using the training patterns directly An importantproperty of the NNC is that its error rate is less than twice the error rate of the Bayesclassifier when the number of training patterns is asymptotically large We brieflydescribe the NNC, which employs the nearest-neighbor rule for classification
Trang 32Table 2.2 Data matrix
Pattern ID feature1 feature2 feature3 feature4 Class label
Output: Class label for the test pattern X.
Decision: Assign X to class C i if d(X, X i )= minj d(X, X j ).
We illustrate the NNC using the four-dimensional dataset shown in Table2.2
There are eight patterns, X1, , X8, from two classes C1and C2, four patterns fromeach class The patterns are four-dimensional, and the dimensions are characterized
by feature1, feature2, feature3, and feature4, respectively In addition to the fourfeatures, there is an additional column that provides the class label of each pattern
Let the test pattern X = (2.0, 2.0, 2.0, 2.0) t The Euclidean distances between
Xand each of the eight patterns are given by
d(X, X1) = 2.0; d(X, X2) = 8.0; d(X, X3) = 10.0;
d(X, X4) = 1.41; d(X, X5) = 1.0; d(X, X6) = 9.05;
d(X, X7) = 1.41; d(X, X8) = 9.05.
So, the Nearest Neighbor (NN) of X is X5because d(X, X5)is the smallest (it is
1.0) among all the eight distances So, NN(X) = X5, and the class label assigned
to X is the class label of X5, which is C1here, which means that X is assigned to class C1 Note that NNC requires eight distances to be calculated in this example In
general, if there are n training patterns, then the number of distances to be calculated
to classify a test pattern is O(n).
The nearest-neighbor classifier is popular because:
1 It is easy to understand and implement
2 There is no learning or training phase; it uses the whole training data to classifythe test pattern
3 Unlike the Bayes classifier, it does not require the probability structure of theclasses
4 It shows good performance If optimal accuracy is 99.99 %, then with a largetraining data, it can give at least 99.80 % accuracy
Even though it is popular, there are some negative aspects They include:
Trang 333 The distance between a pair of points may not be meaningful in high-dimensionalspaces It is known that, as the dimensionality increases, the distance between a
point X and its nearest neighbor tends toward the distance between X and its
farthest neighbor As a consequence, NNC may perform poorly in the context ofhigh-dimensional spaces
Some of the possible solutions to the above problems are:
1 In order to tolerate noise, a modification to NNC is popularly used; it is called the
k-Nearest Neighbor Classifier (kNNC) Instead of deciding the class label of X using the class label of the NN(X), X is labeled using the class labels of k nearest neighbors of X In the case of kNNC, the class label of X is the label of the class that is the most frequent among the class labels of the k nearest neighbors In other words, X is assigned to the class to which majority of its k nearest neigh- bors belong; the value of k is to be fixed appropriately In the example dataset
shown in Table2.2, the three nearest neighbors of X = (2.0, 2.0, 2.0, 2.0) t are
X5, X4, and X7 All the three neighbors are from class C1; so X is assigned to class C1
2 NNC requires O(n) time to compute the n distances, and also it requires O(n)
space It is possible to reduce the effort by compressing the training data Thereare several algorithms for performing this compression; we consider here a
scheme based on clustering We cluster the n patterns into k clusters using the k-means algorithm and use the k resulting centroids instead of the n training pat-
terns Labeling the centroids is done by using the majority class label in eachcluster
By clustering the example dataset shown in Table2.2using the k-means gorithm, with a value of k = 2, we get the following clusters:
al-• Cluster1: {X1, X4, X5, X7} – Centroid: (1.0, 1.5, 1.75, 1.5) t
• Cluster2: {X2, X3, X6, X8} – Centroid: (6.5, 6.5, 6.5, 6.5) t
Note that Cluster1 contains four patterns from C1and Cluster2 has the four
pat-terns from C2 So, by using these two representatives instead of the eight trainingpatterns, the number of distance computations and memory requirements will
reduce Specifically, Centroid of Cluster1 is nearer to X than the Centroid of Cluster2 So, X is assigned to C1using two distance computations
3 In order to reduce the dimensionality, several feature selection/extraction niques are used We use a feature set partitioning scheme that we explain in detail
tech-in the sequel
Another important classifier is based on Support Vector Machine We consider it
next
Trang 34Support Vector Machine The support vector machine (SVM) is a very lar classifier Some of the important properties of the SVM-based classificationare:
popu-• The SVM classifier is a discriminative classifier It can be used to discriminatebetween two classes Intrinsically, it supports binary classification
• It obtains a linear discriminant function of the form W t X + b from the training data Here, W is called the weight vector of the same size as the data points, and
b is a scalar Learning the SVM classifier amounts to obtaining the values of W and b from the training data.
• It is ideally associated with a binary classification problem Typically, one of them
is called the negative class, and the other is called the positive class
• If X is from the positive class, then W t X + b > 0, and if X is from the negative class, then W t X + b < 0.
• It finds the parameters W and b so that the margin between the two classes is
maximized
• It identifies a subset of the training patterns, which are called support vectors.
These support vectors lie on parallel hyperplanes; negative and positive
hyper-planes correspond respectively to the negative and positive classes A point X on the negative hyperplane satisfies W t X + b = −1, and similarly, a point X on the positive hyperplane satisfies W t X + b = 1.
• The margin between the two support planes is maximized in the process of
finding out W and b In other words, the normal distance between the port planes W t X + b = −1 and W t X + b = 1 is maximized The distance is
sup-2
W It is maximized using the constraints that every pattern X from the
pos-itive class satisfies W t X + b ≥ +1 and every pattern X from the negative class satisfies W t X + b ≤ −1 Instead of maximizing the margin, we mini-
mize its inverse This may be viewed as a constrained optimization problem givenby
s.t y i
W t X i + b≥ 1, i = 1, 2, , n, where y i = 1 if X i is in the positive class and y i = −1 if X i is in the negativeclass
• The Lagrangian for the optimization problem is
In order to minimize the Lagrangian, we take the derivative with respect to
b and gradient with respect to W , and equating to 0, we get α is that isfy
Trang 35• It is possible to view the decision boundary as W t X + b = 0 and W is orthogonal
to the decision boundary
We illustrate the working of the SVM using an example in the two-dimensional
space Let us consider two points, X1= (2, 1) t from the negative class and X2=
(6, 3) t from the positive class We have the following:
• Using α1y1+ α2y2= 0 and observing that y1= −1 and y2= 1, we get α1= α2
So, we use α instead of α1or α2
• As a consequence, W = −αX1+ αX2 = (4α, 2α) t
• We know that W t X1+ b = −1 and W t X2+ b = 1; substituting the values of W ,
X1, and X2, we get
8α + 2α + b = −1, 24α + 6α + b = 1.
By solving the above, we get 20α = 2 or α = 1
10, from which and from one of
the above equations we get b= −2
• From W = (4α, 2α) t and α= 1
10 we get W = (2
5,15) t
• In this simple example, we have started with two support vectors in the
two-dimensional case So, it was easy to solve for αs In general, there are efficient
schemes for finding these values
• If we consider a point X = (x1, x2) t on the line x2= −2x1+ 5, for example, the
point (1, 3) t , then W t (1, 3) t − 2 = −1 as W = (2
5,15) t This line is the supportline for the points in the negative class In a higher-dimensional space, it is ahyperplane
• In a similar manner, any point on the parallel line x2= −2x1+ 15, for
exam-ple, (5, 5) t satisfies the property that W t (5, 5)− 2 = 1, and this parallel line isthe support plane for the positive class Again in a higher-dimensional space, itbecomes a hyperplane parallel to the negative class plane
• Note that the decision boundary is given by
2
5,
15
X − 2 = 0.
So, the decision boundary25x1+1
5x2− 2 = 0 lies exactly in the middle of the two
support lines and is parallel to both Note that (4, 2) t is located on the decisionboundary
• A point (7, 6) t is in the positive class as W t (7, 6) t − 2 = 2 > 0 Similarly,
W t (1, 1) t − 2 = −1.4 < 0; so, (1, 1) tis in the negative class
• We have discussed what is known as the linear SVM If the two classes are early separable, then the linear SVM is sufficient
Trang 36lin-• If the classes are not linearly separable, then we map the points to a dimensional space with a hope to find linear separability in the new space For-tunately, one can implicitly make computations in the high-dimensional spacewithout having to work explicitly in it It is possible by using a class of kernelfunctions that characterize similarity between patterns.
high-• However, in large-scale applications involving high-dimensional data like in textmining, linear SVMs are used by default for their simplicity in training
2.4 Association Rule Mining
This is an activity that is not a part of either pattern recognition or machine learning conventionally An association rule is an implication of the form A → B, where A and B are disjoint itemsets; A is called the antecedent, and B is called the conse- quent Typically, this activity became popular in the context of market-basket anal-
ysis, where one is concerned with the set of items available in a super market, andtransactions are made by various customers In such a context, an association ruleprovides information on the association between two sets of items that are frequentlybought together; this facilitates in strategic decisions that may have a positive com-mercial impact in displaying the related items on appropriate shelves to avoid con-gestion or in terms of offering incentives to customers on some products/items.Some of the features of the association rule mining activity are:
1 The rule A → B is not like the conventional implication used in a classical logic,
for example, the propositional logic Here, the rule does not guarantee the
pur-chase of items in B in the same transaction where items in A are bought; it depicts a kind of frequent association between A and B in terms of buying pat-
terns
2 It is assumed that there is a global set of items I ; in the case of market-basket analysis, I is the set of all items/product lines available for sale in a supermarket Note that A and B are disjoint subsets of I So, if the cardinality of I is d, then the number of all possible rules is of O(3 d ); this is because an item in I can be
a part of A or B or none of the two and there are d items In order to reduce
the mining effort, only a subset of the rules that are based on frequently boughtitems is examined
3 Popularly, the quantity of an item bought is not used; it is important to considerwhether an item is bought in a transaction or not For example, if a customer buys1.2 kilograms of Sugar, 3 loafs of Bread, and a tin of Jam in the same transaction,then the corresponding transaction is represented as {Sugar, Bread, Jam} Such
a representation helps in viewing a transaction as a subset of I
4 In order to mine useful rules, only rules of the form A → B, where A and B
are subsets of frequent itemsets, are explored So, it is important to consideralgorithms for frequent itemset mining Once all the frequent itemsets are mined,
it is required to obtain the corresponding association rules
Trang 372.4 Association Rule Mining 23
Table 2.3 Transaction data
The support of X is given by the cardinality of Support-set(X) or |Support-set(X)|.
An itemset X is a frequent itemset if Support(X) ≥ Minsup, where Minsup is a
If we use a Minsup value of 4, then the itemset {a, d} is frequent Further, {a, b, c}
is not frequent; we call such itemsets infrequent There is a systematic way of merating all the frequent itemsets; this is done by an algorithm called Apriori This
enu-algorithm enumerates a relevant subset of the itemsets for examining whether theyare frequent or not It is based on the following observations
1 Any subset of a frequent itemset is frequent This is because if A and B are two itemsets such that A is a subset B, then Support(A) ≥ Support(B) because Support-set(A) ⊆ Support-set(B) For example, knowing that itemset {a, d} is
frequent, we can infer that the itemsets {a} and {d} are frequent Note that in
the data shown in Table 2.3, Support( {a}) = 6 and Support({d}) = 6 and both exceed the Minsup value.
2 Any superset of an infrequent itemset is infrequent If A and B are two itemsets such that A is a superset B, then Support(A) ≤ Support(B) In the example, {a, c} is infrequent; one of its supersets {a, c, d} is also infrequent Note that Support({a, c, d}) = 2 and it is less than the Minsup value.
Trang 38Table 2.4 Printed characters
• Generating Candidate itemsets of size k These itemsets are obtained by looking
at frequent itemsets of size k− 1
• Generating Frequent itemsets of size k This is achieved by scanning the tion database once to check whether a candidate of size k is frequent or not.
transac-It starts with the empty set (φ), which is frequent because the empty set is a subset
of every transaction So, Support(φ) = |T |, where T is the set of transactions Note that φ is a size 0 itemset as there are no items in it It then generates candidate itemsets of size 1; we call such itemsets 1-itemsets Note that every 1-itemset is
a candidate In the example data shown in Table2.3, the candidate 1-itemsets are
{a}, {b}, {c}, {d}, {e} Now it scans the database once to obtain the supports of these
1-itemsets The supports are:
Using a Minsup value of 4, we can observe that frequent 1-itemsets are {a}, {b},
and{d} From these frequent 1-itemsets we generate candidate 2-itemsets The
can-didates are{a, b}, {a, d}, and {b, d} Note that the other 2-itemsets need not be
con-sidered as candidates because they are supersets of infrequent itemsets and hencecannot be frequent For example, {a, c} is infrequent because {c} is infrequent.
A second database scan is used to find the support values of these candidates The
supports are Support( {a, b}) = 3, Support({a, d}) = 5, and Support({b, d}) = 3.
So, only{a, d} is a frequent 2-itemset So, there can not be any candidates of size 3.
For example,{a, b, d} is not frequent because {a, b} is infrequent.
It is important to note that transactions need not be associated with supermarketbuying patterns only It is possible to view a wide variety of patterns as transactions.For example, consider printed characters of size 3× 3 corresponding to character 1shown in Table2.4; there are two 1s In the left-side one is present in the thirdcolumn of the matrix and the right-side matrix, the pattern is present in column 1
By labeling the locations in such 3× 3 matrices using 1 to 9 in a row-major fashion,the two patterns may be viewed as transactions based on the 9 items Specifically,
the transactions are t1: {3, 6, 9} and t2: {1, 4, 7}, where t1corresponds to the
left-side pattern, and t2corresponds the right-side pattern in Table 2.4 Let us call theleft side 1 as Type1 1, and the right side 1 as Type2 1
Trang 392.4 Association Rule Mining 25
Table 2.5 Transactions for
characters of 1 TID 1 2 3 4 5 6 7 8 9 Class
There are six transactions, each of them corresponding to a 1 By using a Minsup
value of 3, we get the frequent itemset{1, 4, 7} for Type1 1 and the frequent itemset {3, 6, 9} for Type2 1 Naturally subsets of these frequent itemsets also are frequent.
2.4.2 Association Rules
In association rule mining there are two important phases:
1 Generating Frequent Itemsets This requires one or more dataset scans Based on the discussion in the previous subsection, Apriori requires k+ 1 dataset scans if
the largest frequent itemset is of size k.
2 Obtaining Association Rules This step generates association rules based on
fre-quent itemsets Once frefre-quent itemsets are obtained from the transaction dataset,association rules can be obtained without any more dataset scans, provided thatthe support of each of the frequent itemsets is stored So, this step is computa-tionally simpler
If X is a frequent itemset, then rules of the form A → B where A ⊂ X and
B = X − A are considered Such a rule is accepted if the confidence of the rule exceeds a user-specified confidence value called Minconf The confidence of a rule
For example, in the dataset shown in Table2.3,{a, d} is a frequent itemset So,
there are two possible association rules They are:
Trang 401 {a} → {d}; its confidence is5
6
2 {d} → {a}; its confidence is5
6
So, if the Minconf value is 0.5, then both these rules satisfy the confidence threshold.
In the case of character data shown in Table2.5, it is appropriate to consider rules
of the form:
• {1, 4, 7} → Type1 1
• {3, 6, 9} → Type2 1
Typically, the antecedent of such an association rule or a classification rule is a
disjunction of one or more maximally frequent itemsets A frequent itemset A is maximal if there is no frequent itemset B such that A is a subset of B This illustrates
the role of frequent itemsets in classification
2.5 Mining Large Datasets
There are several applications where the size of the pattern matrix is large By large,
we mean that the entire pattern matrix cannot be accommodated in the main ory of the computer So, we store the input data on a secondary storage mediumlike the disk and transfer the data in parts to the main memory for processing Forexample, a transaction database of a supermarket chain may consist of trillions oftransactions, and each transaction is a sparse vector of a very high dimensionality;the dimensionality depends on the number of product-lines Similarly, in a networkintrusion detection application, the number of connections could be prohibitivelylarge, and the number of packets to be analyzed or classified could be even larger.Another application is the clustering of click-streams; this forms an important part
mem-of web usage mining Other applications include genome sequence mining, wherethe dimensionality could be running into millions, social network analysis, text min-ing, and biometrics
An objective way of characterizing largeness of a data set is by specifying bounds
on the number of patterns and features present For example, a data set having more
than billion patterns and/or more than million features is large However, such a
characterization is not universally acceptable and is bound to change with the opments in technology For example, in the 1960s, “large” meant several hundreds
devel-of patterns So, it is good to consider a more pragmatic characterization; large datasets are those that may not fit the main memory of the computer; so, largeness of thedata varies with the technological developments Such large data sets are typicallystored on a disk, and each point in the set is accessed from the disk based on pro-cessing needs Note that disk access can be several orders slower compared to thememory access; this property remains in tact even though memory and disk sizes atdifferent points time in the past are different So, characterizing largeness using thisproperty could be more meaningful
The above discussion motivates the need for integrating various algorithmic sign techniques along with the existing mining algorithms so that they can handle