A NOVEL APPROACH FOR MINING EMERGING PATTERNS IN DATA

A NOVEL APPROACH FOR MINING EMERGING PATTERNS IN DATA STREAMS Hamad Alhammady Etisalat University College - UAE hamad@euc.ac.ae ABSTRACT Streaming data mining is one of the most diffi

Trang 1

A NOVEL APPROACH FOR MINING EMERGING PATTERNS IN DATA

STREAMS

Hamad Alhammady

Etisalat University College - UAE hamad@euc.ac.ae

ABSTRACT

Streaming data mining is one of the most difficult tasks in

Knowledge Discovery in Databases (KDD) This task is

essential in many applications such as financial

applications, network monitoring, marketing and others

In this model, data arrives in multiple, continuous, rapid,

time-varying data streams These characteristics make it

infeasible for traditional mining techniques to deal with

data streams In this paper, we propose a new approach

for mining emerging patterns [EPs] in streaming data

EPs are those itemsets whose frequencies in one class

are significantly higher than their frequencies in the

other classes We experimentally prove that our new

method for mining EPs has an excellent impact on the

process of classifying data streams

1 INTRODUCTION

Many recent studies show that the major challenge in

streaming data is its unbounded size [1] [2] [3] This

makes it infeasible to store the entire data on disk There

are two important problems arising from this fact

Firstly, multi-pass algorithms, which need the entire data

to be stored in conventional relations, cannot deal

directly with data streams Secondly, obtaining the exact

answers from data streams is too expensive [4]

EPs are a new kind of patterns introduced recently [5]

They have been proved to have a great impact in many

applications [6] [7] [8] [9] [10] EPs can capture

significant changes between datasets They are defined

as itemsets whose supports increase significantly from

one class to another The discriminating power of EPs

can be measured by their growth rates The growth rate

of an EP is the ratio of its support in a certain class over

that in another class Usually the discriminating power of

an EP is proportional to its growth rate

For example, the Mushroom dataset, from the UCI

Machine Learning Repository [11], contains a large

number of EPs between the poisonous and the edible

mushroom classes Table 1 shows two examples of these

EPs These two EPs consist of 3 items e1 is an EP from

the poisonous mushroom class to the edible mushroom

class It never exists in the poisonous mushroom class,

and exists in 63.9% of the instances in the edible mushroom class; hence, its growth rate is ∞ (63.9 / 0)

It has a very high predictive power to contrast edible mushrooms against poisonous mushrooms On the other hand, e2 is an EP from the edible mushroom class to the poisonous mushroom class It exists in 3.8% of the instances in the edible mushroom class, and in 81.4% of the instances in the poisonous mushroom class; hence, its growth rate is 21.4 (81.4 / 3.8) It has a high predictive power to contrast poisonous mushrooms against edible mushrooms

Table1 Examples of emerging patterns

Current approaches for mining EPs [12] [13] cannot

be used directly in data streams because they are based

on multi-pass algorithms Work in [4] introduces a new type of EPs, approximate EPs (AEPs) This type of EPs enables current mining techniques to operate on streaming data AEPs and the AEP-tree method have shown a good accuracy in classifying streaming data In this paper, we propose another new type of EPs, matching EPs (MEPs) These MEPs has two advantages over AEPs in terms of mining complexity and classification accuracy (details are discussed later)

2 RELATED WORK

The data stream model differs from the conventional stored relation model in a number of ways The most significant difference is that data streams are unbounded

in size Most of the instances from a data stream have to

be discarded after processing However, a certain number of instances can be stored for future analysis This number is proportional to the available memory That is, the data stream model does not preclude the presence of some data stored in conventional relations [1]

The idea of storing some instances from a data stream

in conventional relations is fundamental to many techniques used in the data stream model These techniques include sliding windows, sampling, and synopsis data structures They are the basic features of

EP Support in poisonous

mushrooms Support in edible mushrooms Growth rate e1

e2

e1 = {(ODOR = none), (GILL_SIZE = broad), (RING_NUMBER = one)} e2 = {(BRUISES = no), (GILL_SPACING = close), (VEIL_COLOR = white)}

Trang 2

any Data Stream Management System (DSMS) such as

STREAM [14]

Sliding windows [1] have a noticeable power to obtain

approximate answers to data stream queries This

technique involves using a sliding window of recent data

from the data stream rather than operating over the entire

range of data For example, if an unlabeled instance

arrives from a data stream, and it needs to be classified to

one of the classes associated with this data stream, then,

only a certain number of recent instances (a window) will

be used to train the classifier

Figure 1 Sliding window technique

Figure 1 sketches the idea behind this technique The

sliding window technique has the advantage of being

well-defined In addition, it is a deterministic method

which avoids the problem of bad approximation caused

by random sampling Most essentially, it accentuates

recent data which is considered to be the most interesting

data in a large number of the real-life applications [1]

However, the sliding window technique suffers from the

elimination of some important information contained in

old (discarded) data That is, sliding windows do not

represent the whole range of knowledge contained in the

data, but rather a portion (proportional to the size of the

sliding window) of that knowledge This problem may

affect the quality of approximation

Sampling [15] is another technique for approximation

in the data stream model In this case, the streaming data

is randomly sampled to a certain number of instances

This number is proportional to the available memory In

contrast with the sliding window technique, sampling has

the advantage of representing the whole range of old

data The representation level is proportional to the

sampling rate On the other hand, sampling may suffer

from problems caused by noisy instances being selected

during the random sampling process

Synopsis data structures [1] aim at summarizing the

most important characteristics of the whole range of data

These important characteristics play a key role in

classifying future unlabeled instances Synopsis data

structures, like the sampling technique, have the

advantage of representing the whole range of old data

Moreover, these structures avoid the problem caused by

noisy instances The reason is that they store the

important characteristics rather than the data itself EPs

can be thought of as synopsis data structures because

they represent the discriminating characteristics of the

data they are related to

Approximate emerging patterns (AEPs) adopt

approximation to mine EPs from data streams [4]

Mining AEPs is based on mining EPs from blocks of

streaming data and merging the resulting EP sets to get a

fixed number of AEPs These special EPs are described

as approximate because they are not mined from the complete range of data The AEP tree is a new type of decision trees to classify streaming data This tree uses AEPs rather than data instances to make decisions on the classes of unlabelled data

3 EMERGING PATTERNS AND

CLASSIFICATION

Let obj = {a1 , a 2 , a 3 , a n } be a data object following the schema {A1 , A 2 , A 3 , A n } A 1 , A 2 , A 3 A n are called attributes, and a1 , a 2 , a 3 , a n are values related to these

attributes We call each pair (attribute, value) an item

Let I denote the set of all items in an encoding dataset

D Itemsets are subsets of I We say an instance Y contains an itemset X, if X ⊆ Y

Definition 1 Given a dataset D, and an itemset X, the

support of X in D, sD (X), is defined as

|

) ( )

(

D

X count X

D = (1)

where countD (X) is the number of instances in D containing X

Definition 2 Given two different classes of datasets

D 1 and D2 Let si (X) denote the support of the itemset X

in the dataset Di The growth rate of an itemset X from

2

grD→D , is defined as











≠

=

∞

=

→

otherwise , ) ( ) (

0 ) ( and 0 ) ( if ,

0 ) ( and 0 ) ( if , 0 ) (

1 2

2 1

X s X s

X s X

s

X s X

s X

Definition 3 Given a growth rate threshold ρ>1, an

itemset X is said to be a ρ-emerging pattern (ρ-EP or

simply EP) from D1 to D2 if grD1→D2( X ) ≥ ρ

Let C = {c1 , … c k } is a set of class labels A training dataset is a set of data objects such that, for each object obj, there exists a class label c obj ∈ C associated with it

A classifier is a function from attributes {A1 , A 2, A 3 ,

A n } to class labels {c 1 , … c k }, that assigns class labels to

unseen examples

4 MINING MATCHING EMERGING PATTERNS

We adopt the streaming data model presented in [4] This model is shown in figure 2 Assume that the data

stream consists of two classes; C1 and C2 Data is received in blocks of size N, where N is decided according to the memory available in the system Bt,j is a block of instances related to class j (C1 or C2) at time t

After receiving and processing a number of data blocks, we need to gain information to classify the future unlabeled instances in the data streams This information

Sliding Window

Past Data

(Discarded) Recent Data

Future Data

Trang 3

can be expressed as EPs However, mining EPs from a

dataset requires the availability of all instances in this

dataset This is infeasible in data streams as data is

arriving continuously

Figure 2 Data stream model

Our method is based on mining EPs from selected

blocks of data and matching these EPs with the future

data As data is streaming in blocks of size N, blocks of

all classes related to period t (Bt,1 and Bt,2) are stored,

processed and then discarded EPs for both classes are

mined from data blocks Bt,1 and Bt,2 before discarding

them These EPs are EPt,1 (for C1 ) and EPt,2 (for C2 )

These EPs are moved to new sets of EPs called matching

EPs, MEPs That is, MEP1 represents the MEPs of class

C 1 and MEP2 represents the MEPs of class C2

In the following stage, new data blocks arrive, Bt+1,1

and Bt+1,2 EPs are not mined from these new blocks

Instead, current MEPs are matched with the new data

instances to check if they are still EPs for the new data

If MEPs retain their EP characteristics (existing in one

class more than the other) they remain in the set of

MEPs Otherwise, they are eliminated from the set For

example, EPs in MEP1 are matched with the new data

blocks If they exist in Bt+1,1 more frequently than in

B t+1,2 then they are still valid EPs and remain in MEP1

otherwise they are eliminated The same is applied to

MEP 2

The above process continues for the future blocks of

data until the number of EPs in MEP1 (or MEP2) is

reduced to a predefined number, α , because of the

elimination process In this case, EPs are mined from the

new blocks of data and best EPs are chosen to refill

MEP 1 (or MEP2) set For example, suppose that the

number of EPs in MEP1 has been reduced to α at time

t+8, then, EPs are mined from block Bt+9,1 to create its set

of EPs, EPt+9,1 The strongest1 EPs in EPt+9,1 are used to

refill MEP1 again Algorithm 1 explains the idea of

mining MEPs

We call the emerging patterns resulted from the above

algorithm matching emerging patterns (MEPs) These

EPs are mined from selected blocks of data rather than

the complete range of data In spite of that, our approach

guarantees that these MEPs are inherited from all the old

discarded data That is, for each class, we have to store

its limited set of MEPs rather than its growing number of

data instances

MEPs are motivated by the following points:

1 The strength of an EP is proportional to its support and growth

rate

1 If an old EP does not exist any more in the future blocks of data it is eliminated

2 If an EP exists in the future data blocks, it remains in the MEP set

Algorithm 1 Mining MEPs from streaming data

MEP 1= Φ, MEP 2= Φ, t = 0

As data is streaming Do

t = t + 1

If EPs in MEP 1 < α

EP t,1 = mined EPs from B t,1

MEP 1 = MEP 1 U strongest EPs in EP t,1

Else

For each EP e in MEP 1

Match e with B t,1 and B t,2

If e is no more an EP

Remove e from MEP 1

End if

End for End if

If EPs in MEP 2 < α

EP t,2 = mined EPs from B t,2 MEP 2 = MEP 2 U strongest EPs in EP t,2

Else

Match e with B t,2 and B t,1

If e is no more an EP

Remove e from MEP 2

End if

End for End if End do

The above two points support the importance of the recent data which is the main advantage of the sliding window technique Moreover, they prove that the MEPs are related to all the previous data which is the advantage

of the sampling technique Furthermore, MEPs overcome the problem of mining EPs from all blocks of streaming data in the AEP method by applying a selective approach to choose certain blocks of data This ensures that the mining process is conducted when necessary only Our approach ensures that at each period of time we have limited sets of MEPs that best represents all the discarded data These sets can be used at any time to classify unlabeled data instances using the AEP-tree proposed in [4]

5 EXPERIMENTAL EVALUATION

In this section, we apply five techniques to the data streaming model described in section 4 These techniques are AEP-tree using MEPs, AEP-tree using AEPs, random sampling, sliding window, and a traditional classifier (C4.5 decision tree) Beside applying it alone, the C4.5 decision tree will be the base classifier for the random sampling, and sliding window techniques

The testing method is 10-fold-cross validation This method is adapted to agree with the data stream model adopted in our experiments The data is divided into ten

N

Data Stream

Trang 4

folds For each round of the 10-fold-cross validation,

one fold is used for testing and the other nine folds are

used for training The training folds act as the blocks of

data explained in the data stream model

Table 2 shows the performance of the previous

techniques on 10 real-life data from the UCI repository

[11] The last row in the table indicates the average

accuracy of each technique The results show that our

proposed method, AEP-tree using MEPs, has the highest

accuracy average It outperforms the sliding window and

sampling techniques on all datasets It also outperforms

the AEP-tree using AEPs on 8 datasets This indicates

that our proposed method for mining MEPs is capable of

gaining accurate knowledge from streaming data

Table 2 Experimental results

Dataset

AEP tree (AEPs)

AEP tree (MEPs)

* C4.5 with complete knowledge

6 CONCLUSIONS

In this paper, we study the mining of emerging patterns

in data streams by introducing a special type of emerging

patterns, matching emerging pattern (MEPs) This type of

EPs can be easily mined from data streams by applying a

selective approach to conduct the mining process.Our

experiments prove that MEPs is capable of gaining

important information from streaming data This

information increases the accuracy of classification.Our

research opens a wide avenue for the applications of

emerging patterns (EPs) EPs can now be used in

different data stream applications Our future work will

focus on designing new techniques for mining EPs from

data streams These techniques might be useful for

mining other types of patterns which are currently

infeasible in the data stream model

REFERENCES

[1] B Babcock, S Babu, M Datar, R Motwani, and J

Widom Models and Issues in Data Stream Systems In

Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS’02), Madison, Wisconsin,

USA

[2] G Dong, J Han, L.V.S Lakshmanan, J Pei, H

Wang and P.S Yu Online Mining of Changes from

Data Streams: Research Problems and Preliminary

Results In Proceedings of the 2003 ACM SIGMOD

Workshop on Management and Processing of Data

Streams, San Diego, CA, USA

[3] M Garofalakis, J Gehrke, and R Rastogi Querying and Mining Data Streams: You Only Get One Look In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02), Hong Kong, China [4] Alhammady, H., & Ramamohanarao, K (2005) Mining emerging patterns and classification in data streams In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), Compiegne, France, pp 272-275

[5] G Dong, and J Li Efficient Mining of Emerging Patterns: Discovering Trends and Differences In Proceedings of the 1999 International Conference on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, USA

[6] H Alhammady, and K Ramamohanarao The Application of Emerging Patterns for Improving the Quality of Rare-class Classification In Proceedings of the 2004 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia

[7] H Alhammady, and K Ramamohanarao Using Emerging Patterns and Decision Trees in Rare-class Classification In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM'04), Brighton, UK

[8] H Alhammady, and K Ramamohanarao Expanding the Training Data Space Using Emerging Patterns and Genetic Methods In Proceeding of the 2005 SIAM International Conference on Data Mining (SDM’05), New Port Beach, CA, USA

[9] H Fan, and K Ramamohanarao A Bayesian Approach to Use Emerging Patterns for Classification

In Proceedings of the 14th Australasian Database Conference (ADC’03), Adelaide, Australia

[10] Guozhu D., Xiuzhen Z., Limsoon W., and Jinyan L CAEP: Classification by Aggregating Emerging Patterns

In Proceedings of the 2nd International Conference on Discovery Science (DS'99), Tokyo, Japan

[11] C Blake, E Keogh, and C J Merz UCI repository

of machine learning databases Department of Information and Computer Science, University of California at Irvine, CA, 1999

[12] H Fan, and K Ramamohanarao An Efficient Single-Scan Algorithm For Mining Essential Jumping Emerging Patterns for Classification In Proceedings of the 2002 Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan

[13] H Fan, and K Ramamohanarao Efficiently Mining Interesting Emerging Patterns In Proceedings of the 4th International Conference on Web-Age Information Management (WAIM’03), Chengdu, China

[14] Stanford Stream Data Management (STREAM) Project http://www-db.stanford.edu/stream

[15] B Babcock, M Datar, and R Motwani Sampling From a Moving Window Over Streaming Data In Proceedings of the 2002 Annual ACM-SIAM Symposium On Discrete Algorithms, San Francisco, CA, USA

Định dạng
Số trang	4
Dung lượng	227,11 KB