Compression schemes for mining large datasets: A machine learning perspective

Reduction in number of patterns by prototype selection based on large data clustering approaches; optimal selection of prototypes, dimensionality reduction through optimal selection of[r]

Trang 1

Advances in Computer Vision and Pattern Recognition

Trang 2

For further volumes:

www.springer.com/series/4205

Trang 3

T Ravindra Babu r M Narasimha Murty rS.V Subrahmanya

Compression Schemes

for Mining Large Datasets

A Machine Learning Perspective

Trang 4

Prof Sameer Singh

Rail Vision Europe Ltd

Castle Donington

Leicestershire, UK

Dr Sing Bing KangInteractive Visual Media GroupMicrosoft Research

Redmond, WA, USA

ISSN 2191-6586 ISSN 2191-6594 (electronic)

Advances in Computer Vision and Pattern Recognition

ISBN 978-1-4471-5606-2 ISBN 978-1-4471-5607-9 (eBook)

DOI 10.1007/978-1-4471-5607-9

Springer London Heidelberg New York Dordrecht

Library of Congress Control Number: 2013954523

This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law.

The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

While the advice and information in this book are believed to be true and accurate at the date of lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect

pub-to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media ( www.springer.com )

Trang 5

We come across a number of celebrated text books on Data Mining covering tiple aspects of the topic since its early development, such as those on databases,pattern recognition, soft computing, etc We did not find any consolidated work ondata mining in compression domain The book took shape from this realization Ourwork relates to this area of data mining with a focus on compaction We presentschemes that work in compression domain and demonstrate their working on one ormore practical datasets in each case In this process, we cover important data miningparadigms This is intended to provide a practitioners’ view point of compressionschemes in data mining The work presented is based on the authors’ work on related

mul-areas over the last few years We organized each chapter to contain context setting, background work as part of discussion, proposed algorithm and scheme, implementation intricacies, experimentation by implementing the scheme on a large dataset, and discussion of results At the end of each chapter, as part of bibliographic notes,

we discuss relevant literature and directions for further study.

Data Mining focuses on efficient algorithms to generate abstraction from largedatasets The objective of these algorithms is to find interesting patterns for furtheruse by the least number of visits of entire dataset, ideal being a single visit Sim-ilarly, since the data sizes are large, effort is made in arriving at a much smallersubset of the original dataset that is a representative of entire data and contains at-tributes characterizing the data The ability to generate an abstraction from a smallrepresentative set of patterns and features that is as accurate as that can be obtainedwith entire dataset leads to efficiency in terms of both space and time Importantdata mining paradigms include clustering, classification, association rule mining,etc We present a discussion on data mining paradigms in Chap.2

In our present work, in addition to data mining paradigms discussed in Chap.2,

we also focus on another paradigm, viz., the ability to generate abstraction in thecompressed domain without having to decompress Such a compression would lead

to less storage and improve the computation cost In the book, we consider bothlossy and nonlossy compression schemes In Chap.3, we present a nonlossy com-pression scheme based on run-length encoding of patterns with binary-valued fea-tures The scheme is also applicable to floating-point-valued features that are suit-

v

Trang 6

ably quantized to binary values The chapter presents an algorithm that computesthe dissimilarity in the compressed domain directly Theoretical notes are providedfor the work We present applications of the scheme in multiple domains.

It is interesting to explore when one is prepared to lose some part of pattern resentation, whether we obtain better generalization and compaction We examinethis aspect in Chap.4 The work in the chapter exploits the concept of minimumfeature or item-support The concept of support relates to the conventional associa-tion rule framework We consider patterns as sequences, form subsequences of shortlength, and identify and eliminate repeating subsequences We represent the pattern

rep-by those unique subsequences leading to significant compaction Such unique quences are further reduced by replacing less frequent unique subsequences by morefrequent subsequences, thereby achieving further compaction We demonstrate theworking of the scheme on large handwritten digit data

subse-Pattern clustering can be construed as compaction of data Feature selection alsoreduces dimensionality, thereby resulting in pattern compression It is interesting toexplore whether they can be simultaneously achieved We examine this in Chap.5

We consider an efficient clustering scheme that requires a single database visit togenerate prototypes We consider a lossy compression scheme for feature reduc-tion We also examine whether there is preference in sequencing prototype selectionand feature selection in achieving compaction, as well as good classification accu-racy on unseen patterns We examine multiple combinations of such sequencing

We demonstrate working of the scheme on handwritten digit data and intrusion tection data

de-Domain knowledge forms an important input for efficient compaction Suchknowledge could either be provided by a human expert or generated through anappropriate preliminary statistical analysis In Chap.6, we exploit domain knowl-edge obtained both by expert inference and through statistical analysis and classify

a 10-class data through a proposed decision tree of depth of 4 We make use of class classifiers, AdaBoost and Support Vector Machine, to demonstrate working ofsuch a scheme

2-Dimensionality reduction leads to compaction With algorithms such as length encoded compression, it is educative to study whether one can achieve ef-ficiency in obtaining optimal feature set that provides high classification accuracy

run-In Chap.7, we discuss concepts and methods of feature selection and extraction

We propose an efficient implementation of simple genetic algorithms by integratingcompressed data classification and frequent features We provide insightful discus-sion on the sensitivity of various genetic operators and frequent-item support on thefinal selection of optimal feature set

Divide-and-conquer has been one important direction to deal with large datasets.With reducing cost and increasing ability to collect and store enormous amounts ofdata, we have massive databases at our disposal for making sense out of them andgenerate abstraction that could be of potential business exploitation The term BigData has been synonymous with streaming multisource data such as numerical data,messages, and audio and video data There is increasing need for processing suchdata in real or near-real time and generate business value in this process In Chap.8,

Trang 7

Preface vii

we propose schemes that exploit multiagent systems to solve these problems Wediscuss concepts of big data, MapReduce, PageRank, agents, and multiagent sys-tems before proposing multiagent systems to solve big data problems

The authors would like to express their sincere gratitude to their respective ilies for their cooperation

fam-T Ravindra Babu and S.V Subrahmanya are grateful to Infosys Limited for viding an excellent research environment in the Education and Research Unit (E&R)that enabled them to carry out academic and applied research resulting in articlesand books

pro-T Ravindra Babu likes to express his sincere thanks to his family membersPadma, Ramya, Kishore, and Rahul for their encouragement and support He dedi-cates his contribution of the work to the fond memory of his parents Butchiramaiahand Ramasitamma M Narasimha Murty likes to acknowledge support of his par-ents S.V Subrahmanya likes to thank his wife D.R Sudha for her patient support.The authors would like to record their sincere appreciation for Springer team, WayneWheeler and Simon Rees, for their support and encouragement

T Ravindra Babu

M Narasimha MurtyS.V SubrahmanyaBangalore, India

Trang 8

1 Introduction 1

1.1 Data Mining and Data Compression 1

1.1.1 Data Mining Tasks 1

1.1.2 Data Compression 2

1.1.3 Compression Using Data Mining Tasks 2

1.2 Organization 3

1.2.1 Data Mining Tasks 3

1.2.2 Abstraction in Nonlossy Compression Domain 5

1.2.3 Lossy Compression Scheme and Dimensionality Reduction 6

1.2.4 Compaction Through Simultaneous Prototype and Feature Selection 6

1.2.5 Use of Domain Knowledge in Data Compaction 7

1.2.6 Compression Through Dimensionality Reduction 7

1.2.7 Big Data, Multiagent Systems, and Abstraction 8

1.3 Summary 9

1.4 Bibliographical Notes 9

References 9

2 Data Mining Paradigms 11

2.1 Introduction 11

2.2 Clustering 12

2.2.1 Clustering Algorithms 13

2.2.2 Single-Link Algorithm 14

2.2.3 k-Means Algorithm 15

2.3 Classification 17

2.4 Association Rule Mining 22

2.4.1 Frequent Itemsets 23

2.4.2 Association Rules 25

2.5 Mining Large Datasets 26

ix

Trang 9

x Contents

2.5.1 Possible Solutions 27

2.5.2 Clustering 28

2.5.3 Classification 34

2.5.4 Frequent Itemset Mining 39

2.6 Summary 42

2.7 Bibliographic Notes 43

References 44

3 Run-Length-Encoded Compression Scheme 47

3.1 Introduction 47

3.2 Compression Domain for Large Datasets 48

3.3 Run-Length-Encoded Compression Scheme 49

3.3.1 Discussion on Relevant Terms 49

3.3.2 Important Properties and Algorithm 50

3.4 Experimental Results 55

3.4.1 Application to Handwritten Digit Data 55

3.4.2 Application to Genetic Algorithms 57

3.4.3 Some Applicable Scenarios in Data Mining 59

3.5 Invariance of VC Dimension in the Original and the Compressed Forms 60

3.6 Minimum Description Length 63

3.7 Summary 65

References 66

4 Dimensionality Reduction by Subsequence Pruning 67

4.1 Introduction 67

4.2 Lossy Data Compression for Clustering and Classification 67

4.3 Background and Terminology 68

4.4 Preliminary Data Analysis 73

4.4.1 Huffman Coding and Lossy Compression 74

4.4.2 Analysis of Subsequences and Their Frequency in a Class 79 4.5 Proposed Scheme 81

4.5.1 Initialization 83

4.5.2 Frequent Item Generation 83

4.5.3 Generation of Coded Training Data 84

4.5.4 Subsequence Identification and Frequency Computation 84 4.5.5 Pruning of Subsequences 85

4.5.6 Generation of Encoded Test Data 85

4.5.7 Classification Using Dissimilarity Based on Rough Set Concept 86

4.5.8 Classification Using k-Nearest Neighbor Classifier 87

4.6 Implementation of the Proposed Scheme 87

4.6.1 Choice of Parameters 87

4.6.2 Frequent Items and Subsequences 88

Trang 10

4.6.3 Compressed Data and Pruning of Subsequences 89

4.6.4 Generation of Compressed Training and Test Data 91

4.7 Experimental Results 91

4.8 Summary 92

References 94

5 Data Compaction Through Simultaneous Selection of Prototypes and Features 95

5.1 Introduction 95

5.2 Prototype Selection, Feature Selection, and Data Compaction 96

5.2.1 Data Compression Through Prototype and Feature Selection 99

5.3 Background Material 100

5.3.1 Computation of Frequent Features 103

5.3.2 Distinct Subsequences 104

5.3.3 Impact of Support on Distinct Subsequences 104

5.3.4 Computation of Leaders 105

5.3.5 Classification of Validation Data 105

5.4 Preliminary Analysis 105

5.5 Proposed Approaches 107

5.5.1 Patterns with Frequent Items Only 107

5.5.2 Cluster Representatives Only 108

5.5.3 Frequent Items Followed by Clustering 109

5.5.4 Clustering Followed by Frequent Items 109

5.6 Implementation and Experimentation 110

5.6.1 Handwritten Digit Data 110

5.6.2 Intrusion Detection Data 116

5.6.3 Simultaneous Selection of Patterns and Features 120

5.7 Summary 122

References 123

6 Domain Knowledge-Based Compaction 125

6.1 Introduction 125

6.2 Multicategory Classification 126

6.3 Support Vector Machine (SVM) 126

6.4 Adaptive Boosting 128

6.4.1 Adaptive Boosting on Prototypes for Data Mining Applications 129

6.5 Decision Trees 130

6.6 Preliminary Analysis Leading to Domain Knowledge 131

6.6.1 Analytical View 132

6.6.2 Numerical Analysis 133

6.6.3 Confusion Matrix 134

Trang 11

xii Contents

6.7 Proposed Method 136

6.7.1 Knowledge-Based (KB) Tree 136

6.8 Experimentation and Results 137

6.8.1 Experiments Using SVM 138

6.8.2 Experiments Using AdaBoost 140

6.8.3 Results with AdaBoost on Benchmark Data 141

6.9 Summary 143

References 144

7 Optimal Dimensionality Reduction 147

7.2 Feature Selection 149

7.2.1 Based on Feature Ranking 149

7.2.2 Ranking Features 150

7.3 Feature Extraction 152

7.3.1 Performance 154

7.4 Efficient Approaches to Large-Scale Feature Selection Using Genetic Algorithms 154

7.4.1 An Overview of Genetic Algorithms 155

7.4.2 Proposed Schemes 158

7.4.3 Preliminary Analysis 161

7.4.4 Experimental Results 163

7.4.5 Summary 170

7.5 Bibliographical Notes 171

References 171

8 Big Data Abstraction Through Multiagent Systems 173

8.2 Big Data 173

8.3 Conventional Massive Data Systems 174

8.3.1 Map-Reduce 174

8.3.2 PageRank 176

8.4 Big Data and Data Mining 176

8.5 Multiagent Systems 177

8.5.1 Agent Mining Interaction 177

8.5.2 Big Data Analytics 178

8.6 Proposed Multiagent Systems 178

8.6.1 Multiagent System for Data Reduction 178

8.6.2 Multiagent System for Attribute Reduction 179

8.6.3 Multiagent System for Heterogeneous Data Access 180

8.6.4 Multiagent System for Agile Processing 181

8.7 Summary 182

References 183

Trang 12

Appendix Intrusion Detection Dataset—Binary Representation 185

A.1 Data Description and Preliminary Analysis 185

A.2 Bibliographic Notes 189

References 189

Glossary 191

Index 193

Trang 13

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies

CART Classification and regression trees

CLARANS Clustering Large Applications based on RANdomized Search

kNNC k-Nearest-Neighbor Classifier

MAD Analysis Magnetic, Agile, and Deep Analysis

xv

Trang 14

PCA Principal Component Analysis

Trang 15

Chapter 1

Introduction

In this book, we deal with data mining and compression; specifically, we deal with

using several data mining tasks directly on the compressed data

1.1 Data Mining and Data Compression

Data mining is concerned with generating an abstraction of the input dataset using

a mining task

1.1.1 Data Mining Tasks

Important data mining tasks are:

1 Clustering Clustering is the process of grouping data points so that points in

each group or cluster are similar to each other than points belonging to two ormore different clusters Each resulting cluster is abstracted using one or morerepresentative patterns So, clustering is some kind of compression where de-tails of the data are ignored and only cluster representatives are used in furtherprocessing or decision making

2 Classification In classification a labeled training dataset is used to learn a model

or classifier This learnt model is used to label a test (unlabeled) pattern; thisprocess is called classification

3 Dimensionality Reduction A majority of the classification and clustering

algo-rithms fail to produce expected results in dealing with high-dimensional datasets.Also, computational requirements in the form of time and space can increaseenormously with dimensionality This prompts reduction of the dimensionality

of the dataset; it is reduced either by using feature selection or feature extraction

In feature selection, an appropriate subset of features is selected, and in featureextraction, a subset in some transformed space is selected

T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,

Advances in Computer Vision and Pattern Recognition,

DOI 10.1007/978-1-4471-5607-9_1 , © Springer-Verlag London 2013

1

Trang 16

4 Regression or Function Prediction Here a functional form for variable y is learnt (where y = f (X)) from given pairs (X, y); the learnt function is used to predict the values of y for new values of X This problem may be viewed as a general-

ization of the classification problem In classification, the number of class labels

is finite, where as in the regression setting, y can have infinite values, typically,

y ∈ R.

5 Association Rule Mining Even though it is of relatively recent origin, it is the

earliest introduced task in data mining and is responsible for bringing visibility

to the area of data mining In association rule mining, we are interested in findingout how frequently two subsets of items are associated

1.1.2 Data Compression

Another important topic in this book is data compression A compression scheme

CS may be viewed as a function from the set of patterns X to a set of compressed

patternsX It may be viewed as

CS : X ⇒ X Specifically, CS(x) = x for x ∈ X and x∈ X In a more general setting, we

may view CS as giving output x using x and some knowledge structure or a

dic-tionary K So, CS(x, K) = x for x ∈ X and x∈ X Sometimes, a dictionary

is used in compressing and uncompressing the data Schemes for compressing dataare the following:

• Lossless Schemes These schemes are such that CS(x) = x and there is an

inverse CS−1 such that CS−1(x) = x For example, consider a binary string

00001111 (x) as an input; the corresponding run-length-coded string is 44 (x),

where the first 4 corresponds to a run of 4 zeros, and the second 4 corresponds

to a run of 4 ones Also, from the run-length-coded string 44 we can get back

the input string 00001111 Note that such a representation is lossless as we get x

from x using run-length encoding and x from xusing decoding.

• Lossy Schemes In a lossy compression scheme, it is not possible in general to get back the original data point x from the compressed pattern x Pattern recognition

and data mining are areas in which there are a plenty of examples where lossycompression schemes are used

We show some example compression schemes in Fig.1.1

1.1.3 Compression Using Data Mining Tasks

Among the lossy compression schemes, we considered the data mining tasks Each

of them is a compression scheme as:

• Association rule mining deals with generating frequently cooccurring

items/pat-terns from the given data It ignores the infrequent items Rules of association are

Trang 17

1.2 Organization 3

Fig 1.1 Compression schemes

generated from the frequent itemsets So, association rules in general cannot beused to obtain the original input data points provided

• Clustering is lossy because the output of clustering is a collection of cluster

repre-sentatives From the cluster representatives we cannot get back the original data

points For example, in K-means clustering, each cluster is represented by the

centroid of the data points in it; it is not possible to get back the original datapoints from the centroids

• Classification is lossy as the models learnt from the training data cannot be used

to reproduce the input data points For example, in the case of Support VectorMachines, a subset of the training patterns called support vectors are used to getthe classifier; it is not possible to generate the input data points from the supportvectors

• Dimensionality reduction schemes can ignore some of the input features So, they

are lossy because it is not possible to get the training patterns back from thedimensionality-reduced ones

So, each of the mining tasks is lossy in terms of its output obtained from the givendata In addition, in this book, we deal with data mining tasks working on com-pressed data, not the original data We consider data compression schemes thatcould be either lossy or nonlossy Some of the nonlossy data compression schemesare also shown in Fig.1.1 These include run-length coding, Huffman coding, andthe zip utility used by the operating systems

1.2 Organization

Material in this book is organized as follows

1.2.1 Data Mining Tasks

We briefly discuss some data mining tasks We provide a detailed discussion inChap.2

Trang 18

The data mining tasks considered are the following.

• Clustering Clustering algorithms generate either a hard or soft partition of the

input dataset Hard clustering algorithms are either partitional or hierarchical.Partitional algorithms generate a single partition of the dataset The number of all

possible partitions of a set of n points into K clusters can be shown to be equal to

(i) n

So, exhaustive enumeration of all possible partitions of a dataset could be hibitively expensive For example, even for a small dataset of 19 patterns to bepartitioned into four clusters, we may have to consider around 11,259,666,000partitions In order to reduce the computational load, each of the clustering al-gorithms restricts these possibilities by selecting an appropriate subset of the set

pro-of all possible K-partitions In Chap.2, we consider two partitional algorithms

for clustering One of them is the K-means algorithm, which is the most

popu-lar clustering algorithm; the other is the leader clustering algorithm, which is thesimplest possible algorithm for partitional clustering

A hierarchical clustering algorithm generates a hierarchy of partitions; titions at different levels of the hierarchy are of different sizes We describe thesingle-link algorithm, which has been classically used in a variety of areas includ-ing numerical taxonomy Another hierarchical algorithm discussed is BIRCH,which is a very efficient hierarchical algorithm Both leader and BIRCH are effi-cient as they need to scan the dataset only once to generate the clusters

par-• Classification We describe two classifiers in Chap.2 Nearest-neighbor classifier

is the simplest classifier in terms of learning In fact, it does not learn a model;

it employs all the training data points to label a test pattern Even though it has

no training time requirement, it can take a long time for labeling a test pattern

if the training dataset is large in size Its performance deteriorates as the sionality of the data points increases; also, it is sensitive to noise in the training

dimen-data A popular variant is the K-nearest-neighbor classifier (KNNC), which bels a test pattern based on labels of K nearest neighbors of the test pattern Even though KNNC is robust to noise, it can fail to perform well in high-dimensional

la-spaces Also, it takes a longer time to classify a test pattern

Another efficient and state-of-the-art classifier is based on Support Vector chines (SVMs) and is popularly used in two-class problems An SVM learns asubset of the set of training patterns, called the set of support vectors These cor-respond to patterns falling on two parallel hyperplanes; these planes, called thesupport planes, are separated by a maximum margin One can design the clas-sifier using the support vectors The decision boundary separating patterns fromthe two classes is located between the two support planes, one per each class It iscommonly used in high-dimensional spaces, and it classifies a test pattern using

Ma-a single dot product computMa-ation

• Association rule mining A popular scheme for finding frequent itemsets and sociation rules based on them is Apriori This was the first association rule mining

Trang 19

as-1.2 Organization 5

algorithm; perhaps, it is responsible for the emergence of the area of data miningitself Even though it is initiated in market-basket analysis, it can be also used inother pattern classification and clustering applications We use it in the classifi-cation of hand-written digits in the book We describe the Apriori algorithm inChap.2

Naturally, in data mining, we need to analyze large-scale datasets; in Chap.2, wediscuss three different schemes for dealing with large datasets These include:

1 Incremental Mining Here, we use abstraction A K and the (K + 1)th point X K+1

to generate the abstractionA K+1 Here,A K is the abstraction generated after

examining the first K points It is useful in dealing with stream data mining; in big data analytics, it deals with velocity in the three-V model.

2 Divide-and-Conquer Approach: It is a popular scheme used in designing efficient algorithms Also, the popular and state-of-the-art Map-Reduce scheme is based

on this strategy It is associated with dealing volume requirements in the three-V

model

3 Mining based on an intermediate representation: Here an abstraction is learnt

based on accessing the dataset once or twice; this abstraction is an intermediaterepresentation Once an intermediate representation is available, the mining isperformed on this abstraction rather than on the dataset, which reduces the com-

putational burden This scheme also is associated with the volume feature of the three-V model.

1.2.2 Abstraction in Nonlossy Compression Domain

In Chap.3, we provide a nonlossy compression scheme and ability to cluster andclassify data in the compressed domain without having to uncompress

The scheme employs run-length coding of binary patterns So, it is useful in ing with either binary input patterns or even numerical vectors that could be viewed

deal-as binary sequences Specifically, it considers handwritten digits that could be sented as binary patterns and compresses the strings using run-length coding Now

repre-the compressed patterns are input to a KNNC for classification It requires a tion of the distance dbetween a pair of run-length-coded strings to use the KNNC

defini-on the compressed data

It is shown that the distance d(x, y) between two binary strings x and y and the modified distance d(x, y)between the corresponding run-length-coded (com-

pressed) strings xand yare equal; that is d(x, y) = d(x, y) It is shown that the

KNNC using the modified distance on the compressed strings reduces the space and time requirements by a factor of more than 3 compared to the application of KNNC

on the given original (uncompressed) data

Such a scheme can be used in a number of applications that involve dissimilaritycomputation in patterns with binary-valued features It should be noted that evenreal-valued features can be quantized into binary-valued features by specifying ap-propriate range and scale factors Our earlier experience of such conversation on

Trang 20

intrusion detection dataset is that it does not affect the accuracy In this chapter, weprovide an application of the scheme in classification of handwritten digit data andcompare improvement obtained in size as well as computation time Second applica-tion is related to efficient implementation of genetic algorithms Genetic algorithmsare robust methods to obtain near-optimal solutions The compression scheme can

be gainfully employed in situations where the evaluation function in Genetic rithms is the classification accuracy of the nearest-neighbor classifier (NNC) NNCinvolves computation of dissimilarity a number of times depending on the size oftraining data or prototype pattern set as well as test data size The method can beused for optimal prototype and feature selection We discuss an indicative example.The Vapnik–Chervonenkis (VC) dimension characterizes the complexity of

Algo-a clAlgo-ass of clAlgo-assifiers It is importAlgo-ant to control the V C dimension to improve the performance of a classifier Here, we show that the V C dimension is not affected by

using the classifier on compressed data

1.2.3 Lossy Compression Scheme and Dimensionality Reduction

We propose a lossy compression scheme in Chap.4 Such compressed data can

be used in both clustering and classification The proposed scheme compresses thegiven data by using frequent items and then considering distinct subsequences Oncethe training data is compressed using this scheme, it is also required to appropriatelydeal with test data; it is possible that some of the subsequences present in the testdata are absent in the training data summary One of the successful schemes em-ployed to deal with this issue is based on replacing a subsequence in the test data byits nearest neighbor in the training data

The pruning and transformation scheme employed in achieving compression duces the dataset size significantly However, the classification accuracy improvesbecause of the possible generalization resulting due to compressed representation

re-It is possible to integrate rough set theory to put a threshold on the dissimilaritybetween a test pattern and a training pattern represented in the compressed form Ifthe distance is below a threshold, then the test pattern is assumed to be in the lowerapproximation (proper core region) of the class of the training data; otherwise, it isplaced in the upper approximation (possible reject region)

1.2.4 Compaction Through Simultaneous Prototype and Feature Selection

Simultaneous selection of prototypical patterns and features is considered inChap 5 Here data compression is achieved by ignoring some of the rows andcolumns in the data matrix; the rows correspond to patterns, and the columns arefeatures in the data matrix Some of the important directions explored in this chapterare:

Trang 21

by clustering Both schemes are explored in evaluating the resulting simultaneousprototype and feature selection Here the leader clustering algorithm is used forprototype selection and frequent itemset-based approaches are used for featureselection.

1.2.5 Use of Domain Knowledge in Data Compaction

Domain knowledge-based compaction is provided in Chap.6 We make use of main knowledge of the data under consideration to design efficient pattern classi-fication schemes We design a domain knowledge-based decision tree of depth 4that can classify 10-category data with high accuracy The classification approachesbased on support vector machines and AdaBoost are used

do-We carry out preliminary analysis on datasets and demonstrate deriving domainknowledge from the data and from a human expert In order that the classificationwould be carried out on representative patterns and not on complete data, we makeuse of the condensed nearest-neighbor approach and the leader clustering algorithm

We demonstrate working of the proposed schemes on large datasets and publicdomain machine learning datasets

1.2.6 Compression Through Dimensionality Reduction

Optimal dimensionality reduction for lossy data compression is discussed inChap.7 Here both feature selection and feature extraction schemes are described

In feature selection, both sequential selection schemes and genetic algorithm (GA)based schemes are discussed In sequential selection, features are selected one af-ter the other based on some ranking scheme; here each of the remaining features

is ranked based on their performance along with the already selected features ing some validation data These sequential schemes are greedy in nature and donot guarantee globally optimal selection It is possible to show that the GA-basedschemes are globally optimal under some conditions; however, most of practicalimplementations may not be able to exploit this global optimality

us-Two popular schemes for feature selection are based on Fisher’s score and tual information (MI) Fisher’s score could be used to select features that can assume

Trang 22

Mu-continuous values, whereas the MI-based scheme is the most successful for ing features that are discrete or categorical; it has been used in selecting features inclassification of documents where the given set of features is very large.

select-Another popular set of feature selection schemes employ performance of theclassifiers on selected feature subsets Most popularly used classifiers in such featureselection include the NNC, SVM, and Decision Tree classifier Some of the popularfeature extraction schemes are:

• Principal Component Analysis (PCA) Here the extracted features are linear

com-binations of the given features Signal processing community has successfullyused PCA-based compression in image and speech data reconstruction It hasalso been used by search engines for capturing semantic similarity between thequery and the documents

• Nonnegative Matrix Factorization (NMF) Most of the data one typically uses are

nonnegative In such cases, it is possible to use NMF to reduce the dimensionality.This reduction in dimensionality is helpful in building effective classifiers to work

on the reduced-dimensional data even though the given data is high-dimensional

• Random projections (RP) It is another scheme that extracts features that are linear

combinations of the given features; the weights used in the linear combinationsare random values here

In this chapter, it is also shown as to how to exploit GAs in large-scale feature lection, and the proposed scheme is demonstrated using the handwritten digit data

se-A problem with about 200-feature vector is considered for obtaining optimal subset

of features The implementation integrates frequent features and genetic algorithmsand brings out sensitivity of genetic operators in achieving optimal set It is prac-tically shown on how the choice of probability of initialization of the population,which is not often found in the literature, impacts the number of the final set offeatures with other control parameters remaining the same

1.2.7 Big Data, Multiagent Systems, and Abstraction

Chapter8contains ways to generate abstraction from massive datasets Big data ischaracterized by large volumes of heterogeneous types of datasets that need to beprocessed to generate abstraction efficiently Equivalently, big data is characterized

by three v’s, viz., volume, variety, and velocity Occasionally, the importance of value is articulated through another v Big data analytics is multidisciplinary with a

host of topics such as machine learning, statistics, parallel processing, algorithms,data visualization, etc The contents include discussion on big data and related topicssuch as conventional methods of analyzing big data, MapReduce, PageRank, agents,and multiagent systems A detailed discussion on agents and multiagent systems isprovided Case studies for generating abstraction with big data using multiagentsystems are provided

Trang 23

1.3 Summary 9

1.3 Summary

In this chapter, we have provided a brief introduction to data compression and ing compressed data It is possible to use all the data mining tasks on the compresseddata directly Then we have given how the material is organized in different chapters.Most of the popular and state-of-the-art mining algorithms are covered in detail inthe subsequent chapters Various schemes considered and proposed are applied ontwo datasets, handwritten digit dataset and the network intrusion detection dataset.Details of the intrusion detection dataset are provided inAppendix

min-1.4 Bibliographical Notes

A detailed description of the bibliography is presented at the end of each ter, and notes on the bibliography are provided in the respective chapters Thisbook deals with data mining and data compression There is no major effort sofar in dealing with the application of data mining algorithms directly on the com-pressed data Some of the important books on compression are by Sayood (2000)and Salomon et al (2009) An early book on Data Mining was by Hand et al.(2001) For a good introduction to data mining, a good source is the book byTan et al (2005) A detailed description of various data mining task is given

chap-by Han et al (2011) The book by Witten et al (2011) discusses various tical issues and shows how to use the Weka machine learning workbench de-veloped by the authors One of the recent books is by Rajaraman and Ullman(2011)

prac-Some of the important journals on data mining are:

1 IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE)

2 ACM Transactions on Knowledge Discovery from Data (ACM TKDD)

3 Data Mining and Knowledge Discovery (DMKD)

Some of the important conferences on this topic are:

1 Knowledge Discovery and Data Mining (KDD)

2 International Conference on Data Engineering (ICDE)

3 IEEE International Conference on Data Mining (ICDM)

4 SIAM International Conference on Data Mining (SDM)

References

J Han, M Kamber, J Pei, Data Mining: Concepts and Techniques, 3rd edn (Morgan Kaufmann,

San Mateo, 2011)

D.J Hand, H Mannila, P Smyth, Principles of Data Mining (MIT Press, Cambridge, 2001)

A Rajaraman, J.D Ullman, Mining Massive Datasets (Cambridge University Press, Cambridge,

2011)

Trang 24

D Salomon, G Motta, D Bryant, Handbook of Data Compression (Springer, Berlin, 2009)

K Sayood, Introduction to Data Compression, 2nd edn (Morgan Kaufmann, San Mateo, 2000) P.-N Tan, M Steinbach, V Kumar, Introduction to Data Mining (Pearson, Upper Saddle River,

2005)

I.H Witten, E Frank, M.A Hall, Data Mining: Practical Machine Learning Tools and Techniques,

3rd edn (Morgan Kaufmann, San Mateo, 2011)

Trang 25

Chapter 2

Data Mining Paradigms

2.1 Introduction

In data mining, the size of the dataset involved is large It is convenient to visualize

such a dataset as a matrix of size n × d, where n is the number of data points, and d

is the number of features Typically, it is possible that either n or d or both are large.

In mining such datasets, important issues are:

• The dataset cannot be accommodated in the main memory of the machine So, weneed to store the data on a secondary storage medium like a disk and transfer thedata in parts to the main memory for processing; such an activity could be time-consuming Because disk access can be more expensive compared to accessingthe data from the memory, the number of database scans is an important param-eter So, when we analyze data mining algorithms, it is important to consider thenumber of database scans required

• The dimensionality of the data can be very large In such a case, several of theconventional algorithms that use the Euclidean distance like metrics to charac-terize proximity between a pair of patterns may not play a meaningful role insuch high-dimensional spaces where the data is sparsely distributed So, differenttechniques to deal with such high-dimensional datasets become important

• Three important data mining tasks are:

1 Clustering Here a collection of patterns is partitioned into two or more

clus-ters Typically, clusters of patterns are represented using cluster tives; a centroid of the points in the cluster is one of the most popularly usedcluster representatives Typically, a partition or a clustering is represented by

representa-k representatives, where k is the number of clusters; such a process leads to lossy data compression Instead of dealing with all the n data points in the collection, one can just use the k cluster representatives (where k n in the

data mining context) for further decision making

2 Classification In classification, a machine learning algorithm is used on a given collection of training data to obtain an appropriate abstraction of the

dataset Decision trees and probability distributions of points in various classes

T Ravindra Babu et al., Compression Schemes for Mining Large Datasets,

Advances in Computer Vision and Pattern Recognition,

DOI 10.1007/978-1-4471-5607-9_2 , © Springer-Verlag London 2013

11

Trang 26

Fig 2.1 Clustering

are examples of such abstractions These abstractions are used to classify a testpattern

3 Association Rule Mining This activity has played a major role in giving a

dis-tinct status to the field of data mining itself By convention, an association rule

is an implication of the form A → B, where A and B are two disjoint

item-sets It was initiated in the context of market-basket analysis to characterize

how frequently items in A are bought along with items in B However,

generi-cally it is possible to view classification and clustering rules also as associationrules

In order to run these tasks on large datasets, it is important to consider techniquesthat could lead to scalable mining algorithms Before we examine these tech-niques, we briefly consider some of the popular algorithms for carrying out thesedata mining tasks

2.2 Clustering

Clustering is the process of partitioning a set of patterns into cohesive groups orclusters Such a process is carried out so that intra-cluster patterns are similar andinter-cluster patterns are dissimilar This is illustrated using a set of two-dimensionalpoints shown in Fig.2.1 There are three clusters in this figure, and patterns arerepresented as two-dimensional points The Euclidean distance between a pair ofpoints belonging to the same cluster is smaller than that between any two pointschosen from different clusters

The Euclidean distance between two points X and Y in the p-dimensional space, where x i and y i are the ith components of X and Y , respectively, is given by

Trang 27

2.2 Clustering 13

Fig 2.2 Representing clusters

This notion characterizes similarity; the intra-cluster distance (similarity) is small(high), and the inter-cluster distance (similarity) is large (low) There could be otherways of characterizing similarity

Clustering is useful in generating data abstraction The process of data tion may be explained using Fig.2.2 There are two dense clusters; the first has 22points, and the second has 9 points Further, there is a singleton cluster in the figure

abstrac-Here, a cluster of points is represented by its centroid or its leader The centroid

stands for the sample mean of the points in the cluster, and it need not coincidewith any one of the input points as indicated in the figure There is another point inthe figure, which is far off from any of the other points, and it belongs to the third

cluster This could be an outlier Typically, these outliers are ignored, and each of the remaining clusters is represented by one or more points, called the cluster representatives, to achieve the abstraction The most popular cluster representative is its

centroid

Here, if each cluster is represented by its centroid, then there is a reduction in thedataset size One can use only the two centroids for further decision making Forexample, in order to classify a test pattern using the nearest-neighbor classifier, onerequires 32 distance computations if all the data points are used However, usingthe two centroids requires just two distance computations to compute the nearestcentroid of the test pattern It is possible that classifiers using the cluster centroidscan be optimal under some conditions The above discussion illustrates the role of

clustering in lossy data compression.

2.2.1 Clustering Algorithms

Typically, a grouping of patterns is meaningful when the within-group similarity

is high and the between-group similarity is low This may be illustrated using thegroupings of the seven two-dimensional points shown in Fig.2.3

Algorithms for clustering can be broadly grouped into hierarchical and titional categories A hierarchical scheme forms a nested grouping of patterns,

Trang 28

par-Fig 2.3 A clustering of the

Typ-A hierarchical algorithm would result in a dendrogram representing the nested

grouping of patterns and similarity levels at which groupings change

In this section, we describe two popular clustering algorithms; one of them ishierarchical, and the other is partitional

1 Input: n data points; Output: A dendrogram depicting the hierarchy.

2 Form the n × n proximity matrix by using the Euclidean distance between all pairs of points Assign each point to a separate cluster; this step results in n

singleton clusters

3 Merge a pair of most similar clusters to form a bigger cluster The distance

be-tween two clusters C i and C j to be merged is given by

Distance(C i , C j ) = Min X,Y d(X, Y ) where X ∈ C i and Y ∈ C j

4 Repeat step 3 till the partition of required size is obtained; a k-partition is tained if the number of clusters k is given; otherwise, merging continues till a single cluster of all the n points is obtained.

ob-We illustrate the single-link algorithm using the data shown in Fig.2.3 The imity matrix showing the Euclidean distance between each pair of patterns is shown

Trang 29

Fig 2.4 The dendrogram

obtained using the single-link

algorithm

in Table2.1 A dendrogram of the seven points in Fig.2.3(obtained from the link algorithm) is shown in Fig.2.4 Note that there are seven leaves with each leafcorresponding to a singleton cluster in the tree structure The smallest distance be-tween a pair of such clusters is 0.8, which leads to merging {F} and {G} to form{F, G} Next merger leads to {D, E} based on a distance of 0.9 units This is fol-lowed by merging {B} and {C}, then {A} and {B, C} at a distance of 1 unit each

single-At this point we have three clusters By merging clusters further we get ultimately asingle cluster as shown in the figure The dendrogram can be broken at different lev-els to yield different clusterings of the data The partition of three clusters obtainedusing the dendrogram is the same as the partition shown in Fig.2.3 A major is-sue with the hierarchical algorithm is that computation and storage of the proximity

matrix requires O(n2)time and space

2.2.3 k-Means Algorithm

The k-means algorithm is the most popular clustering algorithm It is a partitional

clustering algorithm and produces clusters by optimizing a criterion function Themost acceptable criterion function is the squared-error criterion as it can be used

to generate compact clusters The k-means algorithm is the most successfully used squared-error clustering algorithm The k-means algorithm is popular because it

is easy to implement and its time complexity is O(n), where n is the number of patterns We give a description of the k-means algorithm below.

Trang 30

Fig 2.5 An optimal

clustering of the points

k-Means Algorithm

1 Select k initial centroids One possibility is to select k out of the n points

ran-domly as the initial centroids Each of them represents a cluster

2 Assign each of the remaining n − k points to one of these k clusters; a pattern is assigned to a cluster if the centroid of the cluster is the nearest, among all the k

centroids, to the pattern

3 Update the centroids of the clusters based on the assignment of the patterns

4 Assign each of the n patterns to the nearest cluster using the current set of

• centroid1: (1.33, 1.66) t ; centroid2: (6.45, 2) t ; centroid3: (6.4, 2) t

• The corresponding value of the squared error is around 2 units

The popularity of the k-means algorithm may be attributed to its simplicity It requires O(n) time as it computes nk distances in each pass and the number of passes may be assumed to be a constant Also, the number of clusters k is a constant Further, it needs to store k centroids in the memory So, the space requirement is also

small

However, it is possible that the algorithm generates a nonoptimal partition by

choosing A, B, and C as the initial centroids as depicted in Fig.2.6 In this case, thethree centroids are:

Trang 31

2.3 Classification 17

Fig 2.6 A nonoptimal

clustering of the

two-dimensional points

• centroid1: (1, 1) t ; centroid2: (1.5, 2) t ; centroid3: (6.4, 4) t

• The corresponding squared error value is around 17 units

2.3 Classification

There are a variety of classifiers Typically, a set of labeled patterns is used toclassify an unlabeled test pattern Classification involves labeling a test pattern; inthe process, either the labeled training dataset is directly used, or an abstraction

or model learnt from the training dataset is used Typically, classifiers learnt fromthe training dataset are categorized as either generative or discriminative The Bayes

classifier is a well-known generative model where a test pattern X is classified or signed to class C i , based on the a posteriori probabilities P (C j /X) for j = 1, , C

as-if

P (C i /X) ≥ P (C j /X) for all j.

These posterior probabilities are obtained using the Bayes rule using prior ities and the probability distributions of patterns in each of the classes It is possible

probabil-to show that the Bayes classifier is optimal; it can minimize the average probability

of error Support Vector Machine (SVM) is a popular discriminative classifier, and

it learns a weight vector W and a threshold b from the training patterns from two classes It assigns the test pattern X to class C1(positive class) if W t X + b ≥ 0, else it assigns X to class C2(negative class)

The Nearest-Neighbor Classifier (NNC) is the simplest and popular classifier;

it classifies the test pattern by using the training patterns directly An importantproperty of the NNC is that its error rate is less than twice the error rate of the Bayesclassifier when the number of training patterns is asymptotically large We brieflydescribe the NNC, which employs the nearest-neighbor rule for classification

Trang 32

Table 2.2 Data matrix

Pattern ID feature1 feature2 feature3 feature4 Class label

Output: Class label for the test pattern X.

Decision: Assign X to class C i if d(X, X i )= minj d(X, X j ).

We illustrate the NNC using the four-dimensional dataset shown in Table2.2

There are eight patterns, X1, , X8, from two classes C1and C2, four patterns fromeach class The patterns are four-dimensional, and the dimensions are characterized

by feature1, feature2, feature3, and feature4, respectively In addition to the fourfeatures, there is an additional column that provides the class label of each pattern

Let the test pattern X = (2.0, 2.0, 2.0, 2.0) t The Euclidean distances between

Xand each of the eight patterns are given by

d(X, X1) = 2.0; d(X, X2) = 8.0; d(X, X3) = 10.0;

d(X, X4) = 1.41; d(X, X5) = 1.0; d(X, X6) = 9.05;

d(X, X7) = 1.41; d(X, X8) = 9.05.

So, the Nearest Neighbor (NN) of X is X5because d(X, X5)is the smallest (it is

1.0) among all the eight distances So, NN(X) = X5, and the class label assigned

to X is the class label of X5, which is C1here, which means that X is assigned to class C1 Note that NNC requires eight distances to be calculated in this example In

general, if there are n training patterns, then the number of distances to be calculated

to classify a test pattern is O(n).

The nearest-neighbor classifier is popular because:

1 It is easy to understand and implement

2 There is no learning or training phase; it uses the whole training data to classifythe test pattern

3 Unlike the Bayes classifier, it does not require the probability structure of theclasses

4 It shows good performance If optimal accuracy is 99.99 %, then with a largetraining data, it can give at least 99.80 % accuracy

Even though it is popular, there are some negative aspects They include:

Trang 33

3 The distance between a pair of points may not be meaningful in high-dimensionalspaces It is known that, as the dimensionality increases, the distance between a

point X and its nearest neighbor tends toward the distance between X and its

farthest neighbor As a consequence, NNC may perform poorly in the context ofhigh-dimensional spaces

Some of the possible solutions to the above problems are:

1 In order to tolerate noise, a modification to NNC is popularly used; it is called the

k-Nearest Neighbor Classifier (kNNC) Instead of deciding the class label of X using the class label of the NN(X), X is labeled using the class labels of k nearest neighbors of X In the case of kNNC, the class label of X is the label of the class that is the most frequent among the class labels of the k nearest neighbors In other words, X is assigned to the class to which majority of its k nearest neighbors belong; the value of k is to be fixed appropriately In the example dataset

shown in Table2.2, the three nearest neighbors of X = (2.0, 2.0, 2.0, 2.0) t are

X5, X4, and X7 All the three neighbors are from class C1; so X is assigned to class C1

2 NNC requires O(n) time to compute the n distances, and also it requires O(n)

space It is possible to reduce the effort by compressing the training data Thereare several algorithms for performing this compression; we consider here a

scheme based on clustering We cluster the n patterns into k clusters using the k-means algorithm and use the k resulting centroids instead of the n training pat-

terns Labeling the centroids is done by using the majority class label in eachcluster

By clustering the example dataset shown in Table2.2using the k-means gorithm, with a value of k = 2, we get the following clusters:

al-• Cluster1: {X1, X4, X5, X7} – Centroid: (1.0, 1.5, 1.75, 1.5) t

• Cluster2: {X2, X3, X6, X8} – Centroid: (6.5, 6.5, 6.5, 6.5) t

Note that Cluster1 contains four patterns from C1and Cluster2 has the four

pat-terns from C2 So, by using these two representatives instead of the eight trainingpatterns, the number of distance computations and memory requirements will

reduce Specifically, Centroid of Cluster1 is nearer to X than the Centroid of Cluster2 So, X is assigned to C1using two distance computations

3 In order to reduce the dimensionality, several feature selection/extraction niques are used We use a feature set partitioning scheme that we explain in detail

tech-in the sequel

Another important classifier is based on Support Vector Machine We consider it

Support Vector Machine The support vector machine (SVM) is a very lar classifier Some of the important properties of the SVM-based classificationare:

popu-• The SVM classifier is a discriminative classifier It can be used to discriminatebetween two classes Intrinsically, it supports binary classification

• It obtains a linear discriminant function of the form W t X + b from the training data Here, W is called the weight vector of the same size as the data points, and

b is a scalar Learning the SVM classifier amounts to obtaining the values of W and b from the training data.

• It is ideally associated with a binary classification problem Typically, one of them

is called the negative class, and the other is called the positive class

• If X is from the positive class, then W t X + b > 0, and if X is from the negative class, then W t X + b < 0.

• It finds the parameters W and b so that the margin between the two classes is

maximized

• It identifies a subset of the training patterns, which are called support vectors.

These support vectors lie on parallel hyperplanes; negative and positive

hyper-planes correspond respectively to the negative and positive classes A point X on the negative hyperplane satisfies W t X + b = −1, and similarly, a point X on the positive hyperplane satisfies W t X + b = 1.

• The margin between the two support planes is maximized in the process of

finding out W and b In other words, the normal distance between the port planes W t X + b = −1 and W t X + b = 1 is maximized The distance is

sup-2

W It is maximized using the constraints that every pattern X from the

pos-itive class satisfies W t X + b ≥ +1 and every pattern X from the negative class satisfies W t X + b ≤ −1 Instead of maximizing the margin, we mini-

mize its inverse This may be viewed as a constrained optimization problem givenby

s.t y i

W t X i + b≥ 1, i = 1, 2, , n, where y i = 1 if X i is in the positive class and y i = −1 if X i is in the negativeclass

• The Lagrangian for the optimization problem is

In order to minimize the Lagrangian, we take the derivative with respect to

b and gradient with respect to W , and equating to 0, we get α is that isfy

Trang 35

• It is possible to view the decision boundary as W t X + b = 0 and W is orthogonal

to the decision boundary

We illustrate the working of the SVM using an example in the two-dimensional

space Let us consider two points, X1= (2, 1) t from the negative class and X2=

(6, 3) t from the positive class We have the following:

• Using α1y1+ α2y2= 0 and observing that y1= −1 and y2= 1, we get α1= α2

So, we use α instead of α1or α2

• As a consequence, W = −αX1+ αX2 = (4α, 2α) t

• We know that W t X1+ b = −1 and W t X2+ b = 1; substituting the values of W ,

X1, and X2, we get

8α + 2α + b = −1, 24α + 6α + b = 1.

By solving the above, we get 20α = 2 or α = 1

10, from which and from one of

the above equations we get b= −2

• From W = (4α, 2α) t and α= 1

10 we get W = (2

5,15) t

• In this simple example, we have started with two support vectors in the

two-dimensional case So, it was easy to solve for αs In general, there are efficient

schemes for finding these values

• If we consider a point X = (x1, x2) t on the line x2= −2x1+ 5, for example, the

point (1, 3) t , then W t (1, 3) t − 2 = −1 as W = (2

5,15) t This line is the supportline for the points in the negative class In a higher-dimensional space, it is ahyperplane

• In a similar manner, any point on the parallel line x2= −2x1+ 15, for

exam-ple, (5, 5) t satisfies the property that W t (5, 5)− 2 = 1, and this parallel line isthe support plane for the positive class Again in a higher-dimensional space, itbecomes a hyperplane parallel to the negative class plane

• Note that the decision boundary is given by

2

5,

15

X − 2 = 0.

So, the decision boundary25x1+1

5x2− 2 = 0 lies exactly in the middle of the two

support lines and is parallel to both Note that (4, 2) t is located on the decisionboundary

• A point (7, 6) t is in the positive class as W t (7, 6) t − 2 = 2 > 0 Similarly,

W t (1, 1) t − 2 = −1.4 < 0; so, (1, 1) tis in the negative class

• We have discussed what is known as the linear SVM If the two classes are early separable, then the linear SVM is sufficient

Trang 36

lin-• If the classes are not linearly separable, then we map the points to a dimensional space with a hope to find linear separability in the new space For-tunately, one can implicitly make computations in the high-dimensional spacewithout having to work explicitly in it It is possible by using a class of kernelfunctions that characterize similarity between patterns.

high-• However, in large-scale applications involving high-dimensional data like in textmining, linear SVMs are used by default for their simplicity in training

2.4 Association Rule Mining

This is an activity that is not a part of either pattern recognition or machine learning conventionally An association rule is an implication of the form A → B, where A and B are disjoint itemsets; A is called the antecedent, and B is called the conse- quent Typically, this activity became popular in the context of market-basket anal-

ysis, where one is concerned with the set of items available in a super market, andtransactions are made by various customers In such a context, an association ruleprovides information on the association between two sets of items that are frequentlybought together; this facilitates in strategic decisions that may have a positive com-mercial impact in displaying the related items on appropriate shelves to avoid con-gestion or in terms of offering incentives to customers on some products/items.Some of the features of the association rule mining activity are:

1 The rule A → B is not like the conventional implication used in a classical logic,

for example, the propositional logic Here, the rule does not guarantee the

pur-chase of items in B in the same transaction where items in A are bought; it depicts a kind of frequent association between A and B in terms of buying pat-

terns

2 It is assumed that there is a global set of items I ; in the case of market-basket analysis, I is the set of all items/product lines available for sale in a supermarket Note that A and B are disjoint subsets of I So, if the cardinality of I is d, then the number of all possible rules is of O(3 d ); this is because an item in I can be

a part of A or B or none of the two and there are d items In order to reduce

the mining effort, only a subset of the rules that are based on frequently boughtitems is examined

3 Popularly, the quantity of an item bought is not used; it is important to considerwhether an item is bought in a transaction or not For example, if a customer buys1.2 kilograms of Sugar, 3 loafs of Bread, and a tin of Jam in the same transaction,then the corresponding transaction is represented as {Sugar, Bread, Jam} Such

a representation helps in viewing a transaction as a subset of I

4 In order to mine useful rules, only rules of the form A → B, where A and B

are subsets of frequent itemsets, are explored So, it is important to consideralgorithms for frequent itemset mining Once all the frequent itemsets are mined,

it is required to obtain the corresponding association rules

Trang 37

Table 2.3 Transaction data

The support of X is given by the cardinality of Support-set(X) or |Support-set(X)|.

An itemset X is a frequent itemset if Support(X) ≥ Minsup, where Minsup is a

If we use a Minsup value of 4, then the itemset {a, d} is frequent Further, {a, b, c}

is not frequent; we call such itemsets infrequent There is a systematic way of merating all the frequent itemsets; this is done by an algorithm called Apriori This

enu-algorithm enumerates a relevant subset of the itemsets for examining whether theyare frequent or not It is based on the following observations

1 Any subset of a frequent itemset is frequent This is because if A and B are two itemsets such that A is a subset B, then Support(A) ≥ Support(B) because Support-set(A) ⊆ Support-set(B) For example, knowing that itemset {a, d} is

frequent, we can infer that the itemsets {a} and {d} are frequent Note that in

the data shown in Table 2.3, Support( {a}) = 6 and Support({d}) = 6 and both exceed the Minsup value.

2 Any superset of an infrequent itemset is infrequent If A and B are two itemsets such that A is a superset B, then Support(A) ≤ Support(B) In the example, {a, c} is infrequent; one of its supersets {a, c, d} is also infrequent Note that Support({a, c, d}) = 2 and it is less than the Minsup value.

Trang 38

Table 2.4 Printed characters

• Generating Candidate itemsets of size k These itemsets are obtained by looking

at frequent itemsets of size k− 1

• Generating Frequent itemsets of size k This is achieved by scanning the tion database once to check whether a candidate of size k is frequent or not.

transac-It starts with the empty set (φ), which is frequent because the empty set is a subset

of every transaction So, Support(φ) = |T |, where T is the set of transactions Note that φ is a size 0 itemset as there are no items in it It then generates candidate itemsets of size 1; we call such itemsets 1-itemsets Note that every 1-itemset is

a candidate In the example data shown in Table2.3, the candidate 1-itemsets are

{a}, {b}, {c}, {d}, {e} Now it scans the database once to obtain the supports of these

1-itemsets The supports are:

Using a Minsup value of 4, we can observe that frequent 1-itemsets are {a}, {b},

and{d} From these frequent 1-itemsets we generate candidate 2-itemsets The

can-didates are{a, b}, {a, d}, and {b, d} Note that the other 2-itemsets need not be

con-sidered as candidates because they are supersets of infrequent itemsets and hencecannot be frequent For example, {a, c} is infrequent because {c} is infrequent.

A second database scan is used to find the support values of these candidates The

supports are Support( {a, b}) = 3, Support({a, d}) = 5, and Support({b, d}) = 3.

So, only{a, d} is a frequent 2-itemset So, there can not be any candidates of size 3.

For example,{a, b, d} is not frequent because {a, b} is infrequent.

It is important to note that transactions need not be associated with supermarketbuying patterns only It is possible to view a wide variety of patterns as transactions.For example, consider printed characters of size 3× 3 corresponding to character 1shown in Table2.4; there are two 1s In the left-side one is present in the thirdcolumn of the matrix and the right-side matrix, the pattern is present in column 1

By labeling the locations in such 3× 3 matrices using 1 to 9 in a row-major fashion,the two patterns may be viewed as transactions based on the 9 items Specifically,

the transactions are t1: {3, 6, 9} and t2: {1, 4, 7}, where t1corresponds to the

left-side pattern, and t2corresponds the right-side pattern in Table 2.4 Let us call theleft side 1 as Type1 1, and the right side 1 as Type2 1

Trang 39

Table 2.5 Transactions for

characters of 1 TID 1 2 3 4 5 6 7 8 9 Class

There are six transactions, each of them corresponding to a 1 By using a Minsup

value of 3, we get the frequent itemset{1, 4, 7} for Type1 1 and the frequent itemset {3, 6, 9} for Type2 1 Naturally subsets of these frequent itemsets also are frequent.

2.4.2 Association Rules

In association rule mining there are two important phases:

1 Generating Frequent Itemsets This requires one or more dataset scans Based on the discussion in the previous subsection, Apriori requires k+ 1 dataset scans if

the largest frequent itemset is of size k.

2 Obtaining Association Rules This step generates association rules based on

fre-quent itemsets Once frefre-quent itemsets are obtained from the transaction dataset,association rules can be obtained without any more dataset scans, provided thatthe support of each of the frequent itemsets is stored So, this step is computa-tionally simpler

If X is a frequent itemset, then rules of the form A → B where A ⊂ X and

B = X − A are considered Such a rule is accepted if the confidence of the rule exceeds a user-specified confidence value called Minconf The confidence of a rule

For example, in the dataset shown in Table2.3,{a, d} is a frequent itemset So,

there are two possible association rules They are:

Trang 40

1 {a} → {d}; its confidence is5

6

2 {d} → {a}; its confidence is5

6

So, if the Minconf value is 0.5, then both these rules satisfy the confidence threshold.

In the case of character data shown in Table2.5, it is appropriate to consider rules

of the form:

• {1, 4, 7} → Type1 1

• {3, 6, 9} → Type2 1

Typically, the antecedent of such an association rule or a classification rule is a

disjunction of one or more maximally frequent itemsets A frequent itemset A is maximal if there is no frequent itemset B such that A is a subset of B This illustrates

the role of frequent itemsets in classification

2.5 Mining Large Datasets

There are several applications where the size of the pattern matrix is large By large,

we mean that the entire pattern matrix cannot be accommodated in the main ory of the computer So, we store the input data on a secondary storage mediumlike the disk and transfer the data in parts to the main memory for processing Forexample, a transaction database of a supermarket chain may consist of trillions oftransactions, and each transaction is a sparse vector of a very high dimensionality;the dimensionality depends on the number of product-lines Similarly, in a networkintrusion detection application, the number of connections could be prohibitivelylarge, and the number of packets to be analyzed or classified could be even larger.Another application is the clustering of click-streams; this forms an important part

mem-of web usage mining Other applications include genome sequence mining, wherethe dimensionality could be running into millions, social network analysis, text min-ing, and biometrics

An objective way of characterizing largeness of a data set is by specifying bounds

on the number of patterns and features present For example, a data set having more

than billion patterns and/or more than million features is large However, such a

characterization is not universally acceptable and is bound to change with the opments in technology For example, in the 1960s, “large” meant several hundreds

devel-of patterns So, it is good to consider a more pragmatic characterization; large datasets are those that may not fit the main memory of the computer; so, largeness of thedata varies with the technological developments Such large data sets are typicallystored on a disk, and each point in the set is accessed from the disk based on pro-cessing needs Note that disk access can be several orders slower compared to thememory access; this property remains in tact even though memory and disk sizes atdifferent points time in the past are different So, characterizing largeness using thisproperty could be more meaningful

The above discussion motivates the need for integrating various algorithmic sign techniques along with the existing mining algorithms so that they can handle

Định dạng
Số trang	208
Dung lượng	2,81 MB