Giannella et al Giannella et al., 2003 have proposed and implemented a fre-quent itemsets mining algorithm over data stream.. The algorithm can process turnstile data stream model which
Trang 1Fig 39.5 ANNCAD Framework
tion sequence FOCUS framework uses the difference between data mining models
as the deviation in data sets
Ferrer-Troyano et al (Ferrer-Troyano et al.,2004) have proposed a scalable
classification algorithm for numerical data streams The algorithm has been termed
as Scalable Classification Algorithm by Learning decisiOn Patterns SCALLOP The
algorithm starts by reading a number of user-specified labeled records A number
of rules are created for each class from these records For each record read after creating these rules, there are three cases:
a) Positive covering: a new record that strengthens a current discovered rule b) Possible expansion: a new record that is associated with at least one rule however
is not covered by any discovered rule
c) Negative covering: a new record that weakens a current discovered rule
For each of the above cases, a different procedure is used as follows:
a) Positive covering: an update of the positive support and confidence of the rule is calculated and assigned to the existing rule
b) Possible expansion: the rule is extended if it satisfies two conditions:
1 It is bounded within a user-specified growth bounds to avoid a possible wrong expansion of the rule
2 There is no intersection between the expanded rule and any already discovered rule associated with the same class label
c) Negative covering: an update of the negative support and confidence is calculated
If the confidence is less than a minimum user-specified threshold, a new rule is added Having read a user-defined number of records, a rule refining process takes place Merge of rules in the same class and within a user-defined acceptable distance mea-sure is used in this process with a condition non-intersecting with rules associated
Trang 2with other class labels The resulting hypercube should also be within the growth bounds of the rules The second step of the refining stage release the uninteresting rules from the current model The rules that have less than the minimum positive support are released from the model Also the rules that are not covered by at least one of the records of the last user-defined number of received records are also re-leased from the classifier Figure 39.6 shows an illustration of the basic process of
using SCALLOP to build a data stream classifier.
Finally a voting-based classification technique is used to classify the unlabelled records for model use If there is a rule covers the current record, the label associated with that rule is used as the classifier output; otherwise a voting over the current rules within the growth bounds is used to infer the class label
Fig 39.6 Basic SCALLOP Process
Papadimitriou et al (Papadimitriou et al., 2003) have proposed AWSOM
(Arbi-trary Window Stream mOdeling Method) for discovering interesting patterns from sensor data They developed a one-pass algorithm to incrementally update the
pat-terns Their method requires only O(logN) memory where N is the length of the
sequence They conducted experiments with real and synthetic data sets They use wavelet coefficients as compact information representation and correlation structure detection, and then apply a linear regression model in the wavelet domain The sys-tem depends on creating compact representation to address the high speed streaming problem The experimental results show the efficiency in detecting correlation
Gaber et al (Gaber et al., 2005) have developed Lightweight Classification
LW-Class It is a variation of LWC It is also an AOG-based technique The idea is to use Knearest neighbors with updating the frequency of class occurrence given the data stream features In case of contradiction between the incoming stream and the stored summary of the cases, the frequency is reduced In case of the frequency is equalized
to zero, all the cases represented by this class is released from the memory
Trang 339.4 Frequent Pattern Mining Techniques
Frequency counting is the process of identifying the highest frequent items It could
be used as a stand alone technique to discover the heavy hitters (Cormode and Muthukrishnan, 2003) It could also be used as a step towards finding association rules The main idea is to find data items with a probability greater than or equal to a pre-specified minimum threshold known in the context of frequent items as the item support (Dunham, 2003) The item support is calculated by dividing the number of times the observed item appears to the total number of records
Giannella et al (Giannella et al., 2003) have proposed and implemented a
fre-quent itemsets mining algorithm over data stream They have used tilted windows
to calculate the frequent patterns for the most recent transactions based on the fact that users are more interested in the most recent streaming information rather than older data streams They have developed an incremental algorithm to maintain the FP-stream, which is a tree data structure to represent and discover frequent itemsets
in data streams FP-stream has been developed based on FP-tree, which has been first
introduced by Han et al (Han et al., 2000) as a graphical representation for
discov-ering frequent itemsets A number of experiments have been conducted to prove the algorithm efficiency The results show that with limited memory, the algorithm can discover the frequent itemsets with approximate support
Manku and Motwani (Manku and Motwani, 2002) have proposed and mented an approximate frequency counting algorithm in data streams The imple-mented algorithm uses all the previous historical data to calculate the frequent pat-terns incrementally Two algorithms have been introduced: sticky sampling and lossy counting algorithms Although the first algorithm analytically should have a better performance because it has better worst-case bound, the experimental studies have proved the lossy count algorithm has a better practical performance The sticky sam-pling algorithm uses samsam-pling that attracts the new records with already existing en-tries to have a higher probability to be sampled The other algorithm uses that idea of group testing using buckets for counting items within the same group by maintaining one counter only
Cormode and Muthukrishnan (Cormode and Muthukrishnan, 2003) have devel-oped an algorithm for counting frequent items The algorithm uses group testing to find the hottest k items The algorithm can process turnstile data stream model which allows addition as well as deletion of data records An approximation randomized algorithm has been used to approximately discover the most frequent items The al-gorithm can recall the frequent items with given item support and probability It is worth mentioning that the turnstile data stream model is the hardest to analyze Time series and cash register models are easier The former does not allow increments and decrements and the later one allows only increments
Jin et al (Jin et al., 2003) have proposed hCount algorithm to discovering frequent
items in data streams This algorithm also deals with the turnstile data stream model where insertion and deletion from the data are allowed The algorithm dynamically works with any range of data and does not need any prior knowledge about the data The algorithm is classified as an approximation technique that keeps the number
Trang 4of counters that can guarantees a minimum acceptable error The algorithm simply keeps the number of counters that analytically can result in the final approximated output deviated with a user given threshold of error
Gaber et al (Gaber et al., 2005) have developed one more AOG-based algorithm:
Lightweight frequency counting LWF It has the ability to find an approximate solu-tion to the most frequent items in the incoming stream using adaptasolu-tion and releasing the least frequent items regularly in order to count the more frequent ones
39.5 Time Series Analysis
Time series analysis is concerned with discovering patterns in attribute values that vary over temporal basis Three main functions are performed in time series min-ing: clustering of similar time series, predicting future values in a time series, and classifying the behavior of a time series (Dunham, 2003)
Indyk et al (Indyk et al., 2000) have proposed approximate solutions with
prob-abilistic error bounding to two problems in time series analysis: relaxed periods and average trends The algorithms use dimensionality reduction sketching techniques The process starts with computing the sketches over an arbitrarily chosen time win-dow This creates what so called sketch pool Sketching is the process of random projection over a number of attributes Using this pool of sketches, relaxed periods and average trends are computed Relaxed periods refer to those periods in time se-ries that are repeated over time Since exact repetition is rare, similar ones using distance functions are acceptable Average trend is the mean values of a subsequence
of observation of a pre-specified length in a time series The algorithms have shown experimentally efficiency in running time and accuracy
Perlman and Java (Perlman and Java, 2003) have proposed an approach to mine astronomical time series streams The technique starts with handling missing data using interpolation A normalization process then takes place for a two-phase pre-processing step A process of finding frequently occurring shapes in times series us-ing time windows represents the first processus-ing step Then, clusterus-ing the discovered patterns of shapes is the second step Rule extraction and filtering over the created clusters represent final step in the approach The limitation of the implemented sys-tem is that it can process only one time series at any time Figure 39.7 shows a simple flow chart of the approach
Zhu and Shasha (Zhu and Shasha, 2003) have proposed techniques to compute
a set of statistical measures over time series data streams The proposed techniques use discrete Fourier transform to create synopsis data structure The system is called StatStream and is able to compute approximate error bounded correlations and inner products The system works over an arbitrarily chosen sliding window
Keogh et al (Keogh et al., 2003) have proved empirically that most cited
clus-tering time series data streams algorithms proposed so far in the literature result in meaningless results in subsequence clustering They have proposed a solution using k-motif to choose the subsequences that the algorithm can work on The 1-motif is the subsequence that has the highest count of not-trivial matches in a time series
Trang 5Fig 39.7 Astronomical Time Series Analysis
Thus, the k-motif is the highest k subsequences that satisfy the condition of
high-est count of matches Experimental results show the success of the techniques in extracting meaningful time series clustering results
Lin et al (Lin et al., 2003) have proposed the use of symbolic representation of
time series data streams that has been termed Symbolic Aggregate approXimation (SAX) This representation allows dimensionality/numerosity reduction Numeros-ity reduction refers to reducing the number of records They have demonstrated the applicability of the proposed representation by applying it to clustering, classifica-tion, indexing and anomaly detection mining techniques The approach has two main stages The first one is the transformation of time series data to Piecewise Aggregate Approximation followed by transforming the output to discrete string symbols in the second stage
Chen et al (Chen et al., 2002) have proposed the application of what so called
re-gression cubes for data streams Due to the success of OnLine Analytical Processing
OLAP technology in the application of static stored data, it has been proposed to use
multidimensional regression analysis to create a compact cube that could be used for answering aggregate queries over the incoming data streams This research has been extended to be adopted in the undergoing project Mining Alarming Incidents in Data Streams MAIDS The technique has shown experimentally efficiency in analyzing time series data streams
39.6 Systems and Applications
Recently systems and applications that deal with mining data streams have been developed The systems are application-oriented except for MAIDS developed by
Cai et al (Cai et al., 2004) which represents the first attempt to develop a generic data
stream mining system The following list introduces these systems and applications with short descriptions
Burl et al (Burl et al., 1999) have developed Diamond Eye for NASA and JPL.
The aim of the project is to enable remote systems as well as scientists to extract patterns from spatial objects in real time image streams The success of this project will enable ”a new era of exploration using highly autonomous spacecraft, rovers,
Trang 6and sensors” (Burl et al., 1999) The system uses a high performance computational
facility for processing the data mining request The scientist uses a web interface that uses java applets to connect to the server that requests that images to perform the image mining process
Kargupta et al (Kargupta et al., 2002) have developed the first ubiquitous data
stream mining system termed MobiMine It is a client/server PDA-based distributed data mining application for financial data streams The system prototype has been de-veloped using a single data source and multiple mobile clients; however the system is designed to handle multiple data sources The server functionalities in the proposed system are data collection from different financial web sites and storage, selection
of active stocks using common statistics methods, and applying online data min-ing techniques to the stock data The client functionalities are portfolio management using a mobile micro-database to store portfolio data and information about user’s preferences, and construction of the WatchList and this is the first point of interaction between the client and the server The server computes the most active stocks in the market, and the client in turn selects a subset of this list to construct the personal-ized WatchList according to an optimization module The second point of interaction between the client and the server is that the server performs online data mining and then transforms the results using Fourier transformation and finally sends this to the client The client in turn visualizes the results on the PDA screen It is worth pointing out that the data mining process in MobiMine has been performed at the server side given the resource constraints of a mobile device With the increase need for onboard data mining in resource-constrained computing environments, Kargupta et al (Kar-gupta, 2004) have developed onboard mining techniques for a different application
in mining vehicle sensory data streams
Kargupta et al (Kargupta, 2004) have developed Vehicle Data Stream Mining
System VEDAS It is a ubiquitous data stream mining system that allows continuous
monitoring and pattern extraction from data streams generated on-board a moving vehicle The mining component is located on the PDA VEDAS uses online incre-mental clustering for modeling of driving behavior
Tanner et al (Tanner et al., 2002) have developed EnVironment for On-Board
Processing (EVE) for astronomical data streams The system analyzes data streams continuously generated from measurements of different on-board sensors Only in-teresting patterns are sent to the ground stations for further analysis preserving the limited bandwidth
Srivastava and Stroeve (Srivastava and Stroeve, 2003) work in a NASA project for onboard detection of geophysical processes such as snow, ice and clouds using kernel clustering methods for data compression preserving limited bandwidth needed
to send image streams to the ground centers The kernel methods have been chosen due to its low computational complexity
Cai et al (Cai et al., 2004) have developed an integrated mining and querying
sys-tem The system can classify, cluster, count frequency and query over data streams Mining Alarming Incidents of Data Streams MAIDS is currently under develop-ment and recently the project team has demonstrated its prototype impledevelop-mentation
Trang 7Sequential pattern mining and hidden network mining are currently under develop-ment
Pirttikangas et al (Pirttikangas et al., 2001) have implemented a mobile
agent-based ubiquitous data mining for a context-aware health club for cyclists The sys-tem is called Genie of the Net The process starts by collecting information from sensors and databases in order to recognize the needed information for the specific application This information includes user’s context and other needed information collected by mobile agents The main scenario for the health club system is that the user has a plan for an exercise All the needed information about the health such as heart rate is recorded during the exercise This information is analyzed using data mining techniques to advise the user after each exercise
Having discussed the state-of-the-art in mining data streams in terms of devel-oped techniques as well as systems used in different applications, we can use this review as a base for classifying these techniques into generic categories
39.7 Taxonomy of Data Stream Mining Approaches
Research problems and challenges that have been discussed earlier in mining data streams have its solutions using well-established statistical and computational ap-proaches We can categorize these solutions to data-based and task-based ones In data-based solutions, the idea is to examine only a subset of the whole dataset or to transform the data vertically or horizontally to an approximate smaller size data representation On the other hand, in task-based solutions, techniques from com-putational theory have been adopted to achieve time and space efficient solutions In this section we review these theoretical foundations
39.7.1 Data-based Techniques
Data-based techniques refer to summarizing the whole dataset or choosing a subset
of the incoming stream to be analyzed Sampling, load shedding and sketching tech-niques represent the former one Synopsis data structures and aggregation represent the later one The following subsections represent an outline of the basics of these techniques with pointers to its applications in the context of data stream mining
Sampling
Sampling refers to the process of probabilistic choice of a data item to be pro-cessed (Toivonen, 1996) Sampling is an old statistical technique that has been used for a long time in the context of conventional data mining for large databases In the context of data stream mining, boundaries of error rate of the computation are given as a function in the sampling rate or size Very Fast Machine Learning tech-niques (Domingos and Hulten, 2000) have used Hoeffding bound (Hoeffding, 1963)
to measure the sample size according to a derived loss function according to the
Trang 8running mining algorithm The problem with using sampling in the context of data stream analysis is the unknown dataset size Thus the treatment of data stream should follow a special analysis to find the error bounds Another problem with sampling
is that it is important to check for anomalies for surveillance analysis as an applica-tion in mining data streams Sampling is not the right choice for such an applicaapplica-tion Sampling also does not address the problem of fluctuating data rates It would be worth investigating the relationship among the three parameters: data rate, sampling rate and error bounds
Load Shedding
Load shedding refers (Babcock et al., 2003, Tatbul et al., 2003, Tatbul et al., 2003)
to the process of dropping a sequence of data streams Load shedding has been used successfully in querying data streams It has the same problems of sampling Load shedding is difficult to be used with mining algorithms because it drops chunks of data streams that could be used in the structuring of the generated models or it might represent a pattern of interest in time series analysis However recently it has been used in the classification problem with an acceptable accuracy in an algorithm
de-veloped by Chi et al (Chi et al., 2005) The algorithm has been termed as Loadstar.
It represents the first attempt for using load shedding in high speed data stream clas-sification problems
Sketching
Sketching (Babcock et al., 2002, Muthukrishnan, 2003) is the process of randomly
project a subset of the features It is the process of vertically sample the incoming stream Sketching has been applied in comparing different data streams and in ag-gregate queries The major drawback of sketching is that of accuracy It is hard to use
it in the context of data stream mining Principal Component Analysis (PCA) would
be a better solution that has been applied in streaming applications (Kargupta, 2004)
Synopsis Data Structures
Creating synopsis of data refers to the process of applying summarization techniques that are capable of summarizing the incoming stream for further analysis Wavelet
analysis (Gilbert et al., 2003), histograms, quantiles and frequency moments (Bab-cock et al., 2002) have been proposed as synopsis data structures Since synopsis of
data does not represent all the characteristics of the dataset, approximate answers are produced when using such data structures
Aggregation
Aggregation is the process of computing statistical measures such as means and vari-ance that summarize the incoming data stream Using this aggregated data could then
Trang 9be used by the data mining algorithm The problem with aggregation is that it does not perform well with highly fluctuating data distributions Merging online
aggrega-tion with offline mining has been studies in (Aggarwal et al., 2003, Aggarwal et al.,
2004, Aggarwal et al., 2004) for clustering and classification of data streams.
Definitions, advantages and disadvantages of all of the above data-based ap-proaches are given in Table 39.2
39.7.2 Task-based Techniques
Task-based techniques are those methods that modify existing techniques or develop new ones in order to address the computational challenges of data stream processing Approximation algorithms, sliding window techniques represent this category In the following subsections, we examine each of these techniques and its application in the context of data stream analysis
Approximation algorithms
Approximation algorithms (Muthukrishnan, 2003) have their roots in algorithm de-sign It is concerned with design algorithms for computationally hard problems These algorithms can result in an approximate solution with error bounds The idea
is that data stream mining algorithms are considered hard computational problems given its features of continuity and speed and the resource-constrained computational environment Approximation algorithms have attracted researchers as a direct solu-tion to data stream mining problems However, the problem of data rates with regard
to the available resources could not be solved using approximation algorithms Other tools should be used along with these algorithms in order to adapt to the available resources Approximation algorithms have been used in (Cormode and
Muthukrish-nan, 2003, Jin et al., 2003) for discovering frequent items.
Sliding Window
The inspiration behind sliding window techniques is that the user is more concerned with the analysis of most recent data streams Thus, the detailed analysis is done over the most recent data items and summarized versions of the old ones This idea has been adopted in many techniques in the undergoing comprehensive data stream
mining system MAIDS (Dong et al., 2003) The main issue of the sliding window
techniques is how to remove the expired results from the current created model
Algorithm Output Granularity
The algorithm output granularity (AOG) (Gaber et al., 2005,Gaber et al., 2004)
intro-duces the first resource-aware data analysis approach that can cope with fluctuating very high data rates according to the available memory and the processing speed rep-resented in time constraints The AOG performs the local data analysis on a resource
Trang 10Table 39.2 Data-based Techniques
Sampling The process of
choosing a subset
of a dataset for the sake of analysis using probability theory
es-tablished techniques
• Error
bound-aries guaran-teed
anomaly detec-tion
Load Shedding The process of
ig-noring a continuous chunk of streaming data
• Proved effi-ciency with data stream querying
• Used recently
with success
in data stream mining
• Very poor for
anomaly detec-tion
Sketching Randomly
pro-jection of a set
of features to be analyzed
• Considerably
improve the running time
• Some unse-lected features might be of great impor-tance
Synopsis Data
Structure
Quick transfor-mation of the incoming stream into a summarized compressed form
• Analysis task
independent
• might not be
sufficient with high data rates Aggregation Calculating
statisti-cal measures that capture the features
of data
• Analysis task
independent
• Aggregation
measures do not capture all the required features of data