Data Mining and Knowledge Discovery Handbook, 2 Edition part 80 ppt

Giannella et al Giannella et al., 2003 have proposed and implemented a fre-quent itemsets mining algorithm over data stream.. The algorithm can process turnstile data stream model which

Trang 1

Fig 39.5 ANNCAD Framework

tion sequence FOCUS framework uses the difference between data mining models

as the deviation in data sets

Ferrer-Troyano et al (Ferrer-Troyano et al.,2004) have proposed a scalable

classiﬁcation algorithm for numerical data streams The algorithm has been termed

as Scalable Classiﬁcation Algorithm by Learning decisiOn Patterns SCALLOP The

algorithm starts by reading a number of user-speciﬁed labeled records A number

of rules are created for each class from these records For each record read after creating these rules, there are three cases:

a) Positive covering: a new record that strengthens a current discovered rule b) Possible expansion: a new record that is associated with at least one rule however

is not covered by any discovered rule

c) Negative covering: a new record that weakens a current discovered rule

For each of the above cases, a different procedure is used as follows:

a) Positive covering: an update of the positive support and conﬁdence of the rule is calculated and assigned to the existing rule

b) Possible expansion: the rule is extended if it satisﬁes two conditions:

1 It is bounded within a user-speciﬁed growth bounds to avoid a possible wrong expansion of the rule

2 There is no intersection between the expanded rule and any already discovered rule associated with the same class label

c) Negative covering: an update of the negative support and conﬁdence is calculated

If the confidence is less than a minimum user-specified threshold, a new rule is added Having read a user-defined number of records, a rule refining process takes place Merge of rules in the same class and within a user-defined acceptable distance mea-sure is used in this process with a condition non-intersecting with rules associated

Trang 2

with other class labels The resulting hypercube should also be within the growth bounds of the rules The second step of the refining stage release the uninteresting rules from the current model The rules that have less than the minimum positive support are released from the model Also the rules that are not covered by at least one of the records of the last user-defined number of received records are also re-leased from the classifier Figure 39.6 shows an illustration of the basic process of

using SCALLOP to build a data stream classiﬁer.

Finally a voting-based classiﬁcation technique is used to classify the unlabelled records for model use If there is a rule covers the current record, the label associated with that rule is used as the classiﬁer output; otherwise a voting over the current rules within the growth bounds is used to infer the class label

Fig 39.6 Basic SCALLOP Process

Papadimitriou et al (Papadimitriou et al., 2003) have proposed AWSOM

(Arbi-trary Window Stream mOdeling Method) for discovering interesting patterns from sensor data They developed a one-pass algorithm to incrementally update the

pat-terns Their method requires only O(logN) memory where N is the length of the

sequence They conducted experiments with real and synthetic data sets They use wavelet coefﬁcients as compact information representation and correlation structure detection, and then apply a linear regression model in the wavelet domain The sys-tem depends on creating compact representation to address the high speed streaming problem The experimental results show the efﬁciency in detecting correlation

Gaber et al (Gaber et al., 2005) have developed Lightweight Classiﬁcation

LW-Class It is a variation of LWC It is also an AOG-based technique The idea is to use Knearest neighbors with updating the frequency of class occurrence given the data stream features In case of contradiction between the incoming stream and the stored summary of the cases, the frequency is reduced In case of the frequency is equalized

to zero, all the cases represented by this class is released from the memory

Trang 3

39.4 Frequent Pattern Mining Techniques

Frequency counting is the process of identifying the highest frequent items It could

be used as a stand alone technique to discover the heavy hitters (Cormode and Muthukrishnan, 2003) It could also be used as a step towards finding association rules The main idea is to find data items with a probability greater than or equal to a pre-specified minimum threshold known in the context of frequent items as the item support (Dunham, 2003) The item support is calculated by dividing the number of times the observed item appears to the total number of records

Giannella et al (Giannella et al., 2003) have proposed and implemented a

fre-quent itemsets mining algorithm over data stream They have used tilted windows

to calculate the frequent patterns for the most recent transactions based on the fact that users are more interested in the most recent streaming information rather than older data streams They have developed an incremental algorithm to maintain the FP-stream, which is a tree data structure to represent and discover frequent itemsets

in data streams FP-stream has been developed based on FP-tree, which has been ﬁrst

introduced by Han et al (Han et al., 2000) as a graphical representation for

discov-ering frequent itemsets A number of experiments have been conducted to prove the algorithm efﬁciency The results show that with limited memory, the algorithm can discover the frequent itemsets with approximate support

Manku and Motwani (Manku and Motwani, 2002) have proposed and mented an approximate frequency counting algorithm in data streams The imple-mented algorithm uses all the previous historical data to calculate the frequent pat-terns incrementally Two algorithms have been introduced: sticky sampling and lossy counting algorithms Although the ﬁrst algorithm analytically should have a better performance because it has better worst-case bound, the experimental studies have proved the lossy count algorithm has a better practical performance The sticky sam-pling algorithm uses samsam-pling that attracts the new records with already existing en-tries to have a higher probability to be sampled The other algorithm uses that idea of group testing using buckets for counting items within the same group by maintaining one counter only

Cormode and Muthukrishnan (Cormode and Muthukrishnan, 2003) have devel-oped an algorithm for counting frequent items The algorithm uses group testing to ﬁnd the hottest k items The algorithm can process turnstile data stream model which allows addition as well as deletion of data records An approximation randomized algorithm has been used to approximately discover the most frequent items The al-gorithm can recall the frequent items with given item support and probability It is worth mentioning that the turnstile data stream model is the hardest to analyze Time series and cash register models are easier The former does not allow increments and decrements and the later one allows only increments

Jin et al (Jin et al., 2003) have proposed hCount algorithm to discovering frequent

items in data streams This algorithm also deals with the turnstile data stream model where insertion and deletion from the data are allowed The algorithm dynamically works with any range of data and does not need any prior knowledge about the data The algorithm is classiﬁed as an approximation technique that keeps the number

Trang 4

of counters that can guarantees a minimum acceptable error The algorithm simply keeps the number of counters that analytically can result in the ﬁnal approximated output deviated with a user given threshold of error

Gaber et al (Gaber et al., 2005) have developed one more AOG-based algorithm:

Lightweight frequency counting LWF It has the ability to ﬁnd an approximate solu-tion to the most frequent items in the incoming stream using adaptasolu-tion and releasing the least frequent items regularly in order to count the more frequent ones

39.5 Time Series Analysis

Time series analysis is concerned with discovering patterns in attribute values that vary over temporal basis Three main functions are performed in time series min-ing: clustering of similar time series, predicting future values in a time series, and classifying the behavior of a time series (Dunham, 2003)

Indyk et al (Indyk et al., 2000) have proposed approximate solutions with

prob-abilistic error bounding to two problems in time series analysis: relaxed periods and average trends The algorithms use dimensionality reduction sketching techniques The process starts with computing the sketches over an arbitrarily chosen time win-dow This creates what so called sketch pool Sketching is the process of random projection over a number of attributes Using this pool of sketches, relaxed periods and average trends are computed Relaxed periods refer to those periods in time se-ries that are repeated over time Since exact repetition is rare, similar ones using distance functions are acceptable Average trend is the mean values of a subsequence

of observation of a pre-speciﬁed length in a time series The algorithms have shown experimentally efﬁciency in running time and accuracy

Perlman and Java (Perlman and Java, 2003) have proposed an approach to mine astronomical time series streams The technique starts with handling missing data using interpolation A normalization process then takes place for a two-phase pre-processing step A process of finding frequently occurring shapes in times series us-ing time windows represents the first processus-ing step Then, clusterus-ing the discovered patterns of shapes is the second step Rule extraction and filtering over the created clusters represent final step in the approach The limitation of the implemented sys-tem is that it can process only one time series at any time Figure 39.7 shows a simple flow chart of the approach

Zhu and Shasha (Zhu and Shasha, 2003) have proposed techniques to compute

a set of statistical measures over time series data streams The proposed techniques use discrete Fourier transform to create synopsis data structure The system is called StatStream and is able to compute approximate error bounded correlations and inner products The system works over an arbitrarily chosen sliding window

Keogh et al (Keogh et al., 2003) have proved empirically that most cited

clus-tering time series data streams algorithms proposed so far in the literature result in meaningless results in subsequence clustering They have proposed a solution using k-motif to choose the subsequences that the algorithm can work on The 1-motif is the subsequence that has the highest count of not-trivial matches in a time series

Trang 5

Fig 39.7 Astronomical Time Series Analysis

Thus, the k-motif is the highest k subsequences that satisfy the condition of

high-est count of matches Experimental results show the success of the techniques in extracting meaningful time series clustering results

Lin et al (Lin et al., 2003) have proposed the use of symbolic representation of

time series data streams that has been termed Symbolic Aggregate approXimation (SAX) This representation allows dimensionality/numerosity reduction Numeros-ity reduction refers to reducing the number of records They have demonstrated the applicability of the proposed representation by applying it to clustering, classiﬁca-tion, indexing and anomaly detection mining techniques The approach has two main stages The ﬁrst one is the transformation of time series data to Piecewise Aggregate Approximation followed by transforming the output to discrete string symbols in the second stage

Chen et al (Chen et al., 2002) have proposed the application of what so called

re-gression cubes for data streams Due to the success of OnLine Analytical Processing

OLAP technology in the application of static stored data, it has been proposed to use

multidimensional regression analysis to create a compact cube that could be used for answering aggregate queries over the incoming data streams This research has been extended to be adopted in the undergoing project Mining Alarming Incidents in Data Streams MAIDS The technique has shown experimentally efﬁciency in analyzing time series data streams

39.6 Systems and Applications

Recently systems and applications that deal with mining data streams have been developed The systems are application-oriented except for MAIDS developed by

Cai et al (Cai et al., 2004) which represents the ﬁrst attempt to develop a generic data

stream mining system The following list introduces these systems and applications with short descriptions

Burl et al (Burl et al., 1999) have developed Diamond Eye for NASA and JPL.

The aim of the project is to enable remote systems as well as scientists to extract patterns from spatial objects in real time image streams The success of this project will enable ”a new era of exploration using highly autonomous spacecraft, rovers,

Trang 6

and sensors” (Burl et al., 1999) The system uses a high performance computational

facility for processing the data mining request The scientist uses a web interface that uses java applets to connect to the server that requests that images to perform the image mining process

Kargupta et al (Kargupta et al., 2002) have developed the ﬁrst ubiquitous data

stream mining system termed MobiMine It is a client/server PDA-based distributed data mining application for ﬁnancial data streams The system prototype has been de-veloped using a single data source and multiple mobile clients; however the system is designed to handle multiple data sources The server functionalities in the proposed system are data collection from different ﬁnancial web sites and storage, selection

of active stocks using common statistics methods, and applying online data min-ing techniques to the stock data The client functionalities are portfolio management using a mobile micro-database to store portfolio data and information about user’s preferences, and construction of the WatchList and this is the ﬁrst point of interaction between the client and the server The server computes the most active stocks in the market, and the client in turn selects a subset of this list to construct the personal-ized WatchList according to an optimization module The second point of interaction between the client and the server is that the server performs online data mining and then transforms the results using Fourier transformation and ﬁnally sends this to the client The client in turn visualizes the results on the PDA screen It is worth pointing out that the data mining process in MobiMine has been performed at the server side given the resource constraints of a mobile device With the increase need for onboard data mining in resource-constrained computing environments, Kargupta et al (Kar-gupta, 2004) have developed onboard mining techniques for a different application

in mining vehicle sensory data streams

Kargupta et al (Kargupta, 2004) have developed Vehicle Data Stream Mining

System VEDAS It is a ubiquitous data stream mining system that allows continuous

monitoring and pattern extraction from data streams generated on-board a moving vehicle The mining component is located on the PDA VEDAS uses online incre-mental clustering for modeling of driving behavior

Tanner et al (Tanner et al., 2002) have developed EnVironment for On-Board

Processing (EVE) for astronomical data streams The system analyzes data streams continuously generated from measurements of different on-board sensors Only in-teresting patterns are sent to the ground stations for further analysis preserving the limited bandwidth

Srivastava and Stroeve (Srivastava and Stroeve, 2003) work in a NASA project for onboard detection of geophysical processes such as snow, ice and clouds using kernel clustering methods for data compression preserving limited bandwidth needed

to send image streams to the ground centers The kernel methods have been chosen due to its low computational complexity

Cai et al (Cai et al., 2004) have developed an integrated mining and querying

sys-tem The system can classify, cluster, count frequency and query over data streams Mining Alarming Incidents of Data Streams MAIDS is currently under develop-ment and recently the project team has demonstrated its prototype impledevelop-mentation

Trang 7

Sequential pattern mining and hidden network mining are currently under develop-ment

Pirttikangas et al (Pirttikangas et al., 2001) have implemented a mobile

agent-based ubiquitous data mining for a context-aware health club for cyclists The sys-tem is called Genie of the Net The process starts by collecting information from sensors and databases in order to recognize the needed information for the speciﬁc application This information includes user’s context and other needed information collected by mobile agents The main scenario for the health club system is that the user has a plan for an exercise All the needed information about the health such as heart rate is recorded during the exercise This information is analyzed using data mining techniques to advise the user after each exercise

Having discussed the state-of-the-art in mining data streams in terms of devel-oped techniques as well as systems used in different applications, we can use this review as a base for classifying these techniques into generic categories

39.7 Taxonomy of Data Stream Mining Approaches

Research problems and challenges that have been discussed earlier in mining data streams have its solutions using well-established statistical and computational ap-proaches We can categorize these solutions to data-based and task-based ones In data-based solutions, the idea is to examine only a subset of the whole dataset or to transform the data vertically or horizontally to an approximate smaller size data representation On the other hand, in task-based solutions, techniques from com-putational theory have been adopted to achieve time and space efﬁcient solutions In this section we review these theoretical foundations

39.7.1 Data-based Techniques

Data-based techniques refer to summarizing the whole dataset or choosing a subset

of the incoming stream to be analyzed Sampling, load shedding and sketching tech-niques represent the former one Synopsis data structures and aggregation represent the later one The following subsections represent an outline of the basics of these techniques with pointers to its applications in the context of data stream mining

Sampling

Sampling refers to the process of probabilistic choice of a data item to be pro-cessed (Toivonen, 1996) Sampling is an old statistical technique that has been used for a long time in the context of conventional data mining for large databases In the context of data stream mining, boundaries of error rate of the computation are given as a function in the sampling rate or size Very Fast Machine Learning tech-niques (Domingos and Hulten, 2000) have used Hoeffding bound (Hoeffding, 1963)

to measure the sample size according to a derived loss function according to the

Trang 8

running mining algorithm The problem with using sampling in the context of data stream analysis is the unknown dataset size Thus the treatment of data stream should follow a special analysis to ﬁnd the error bounds Another problem with sampling

is that it is important to check for anomalies for surveillance analysis as an applica-tion in mining data streams Sampling is not the right choice for such an applicaapplica-tion Sampling also does not address the problem of ﬂuctuating data rates It would be worth investigating the relationship among the three parameters: data rate, sampling rate and error bounds

Load Shedding

Load shedding refers (Babcock et al., 2003, Tatbul et al., 2003, Tatbul et al., 2003)

to the process of dropping a sequence of data streams Load shedding has been used successfully in querying data streams It has the same problems of sampling Load shedding is difﬁcult to be used with mining algorithms because it drops chunks of data streams that could be used in the structuring of the generated models or it might represent a pattern of interest in time series analysis However recently it has been used in the classiﬁcation problem with an acceptable accuracy in an algorithm

de-veloped by Chi et al (Chi et al., 2005) The algorithm has been termed as Loadstar.

It represents the ﬁrst attempt for using load shedding in high speed data stream clas-siﬁcation problems

Sketching

Sketching (Babcock et al., 2002, Muthukrishnan, 2003) is the process of randomly

project a subset of the features It is the process of vertically sample the incoming stream Sketching has been applied in comparing different data streams and in ag-gregate queries The major drawback of sketching is that of accuracy It is hard to use

it in the context of data stream mining Principal Component Analysis (PCA) would

be a better solution that has been applied in streaming applications (Kargupta, 2004)

Synopsis Data Structures

Creating synopsis of data refers to the process of applying summarization techniques that are capable of summarizing the incoming stream for further analysis Wavelet

analysis (Gilbert et al., 2003), histograms, quantiles and frequency moments (Bab-cock et al., 2002) have been proposed as synopsis data structures Since synopsis of

data does not represent all the characteristics of the dataset, approximate answers are produced when using such data structures

Aggregation

Aggregation is the process of computing statistical measures such as means and vari-ance that summarize the incoming data stream Using this aggregated data could then

Trang 9

be used by the data mining algorithm The problem with aggregation is that it does not perform well with highly ﬂuctuating data distributions Merging online

aggrega-tion with ofﬂine mining has been studies in (Aggarwal et al., 2003, Aggarwal et al.,

2004, Aggarwal et al., 2004) for clustering and classiﬁcation of data streams.

Deﬁnitions, advantages and disadvantages of all of the above data-based ap-proaches are given in Table 39.2

39.7.2 Task-based Techniques

Task-based techniques are those methods that modify existing techniques or develop new ones in order to address the computational challenges of data stream processing Approximation algorithms, sliding window techniques represent this category In the following subsections, we examine each of these techniques and its application in the context of data stream analysis

Approximation algorithms

Approximation algorithms (Muthukrishnan, 2003) have their roots in algorithm de-sign It is concerned with design algorithms for computationally hard problems These algorithms can result in an approximate solution with error bounds The idea

is that data stream mining algorithms are considered hard computational problems given its features of continuity and speed and the resource-constrained computational environment Approximation algorithms have attracted researchers as a direct solu-tion to data stream mining problems However, the problem of data rates with regard

to the available resources could not be solved using approximation algorithms Other tools should be used along with these algorithms in order to adapt to the available resources Approximation algorithms have been used in (Cormode and

Muthukrish-nan, 2003, Jin et al., 2003) for discovering frequent items.

Sliding Window

The inspiration behind sliding window techniques is that the user is more concerned with the analysis of most recent data streams Thus, the detailed analysis is done over the most recent data items and summarized versions of the old ones This idea has been adopted in many techniques in the undergoing comprehensive data stream

mining system MAIDS (Dong et al., 2003) The main issue of the sliding window

techniques is how to remove the expired results from the current created model

Algorithm Output Granularity

The algorithm output granularity (AOG) (Gaber et al., 2005,Gaber et al., 2004)

intro-duces the ﬁrst resource-aware data analysis approach that can cope with ﬂuctuating very high data rates according to the available memory and the processing speed rep-resented in time constraints The AOG performs the local data analysis on a resource

Trang 10

Table 39.2 Data-based Techniques

Sampling The process of

choosing a subset

of a dataset for the sake of analysis using probability theory

es-tablished techniques

• Error

bound-aries guaran-teed

anomaly detec-tion

Load Shedding The process of

ig-noring a continuous chunk of streaming data

• Proved efﬁ-ciency with data stream querying

• Used recently

with success

in data stream mining

• Very poor for

anomaly detec-tion

Sketching Randomly

pro-jection of a set

of features to be analyzed

• Considerably

improve the running time

• Some unse-lected features might be of great impor-tance

Synopsis Data

Structure

Quick transfor-mation of the incoming stream into a summarized compressed form

• Analysis task

independent

• might not be

sufﬁcient with high data rates Aggregation Calculating

statisti-cal measures that capture the features

of data

• Analysis task

independent

• Aggregation

measures do not capture all the required features of data

Định dạng
Số trang	10
Dung lượng	171,51 KB