Clustering Evolving Data Streams: A Micro-clustering Approach 3.1 Micro-clustering Challenges 3.2 Online Micro-cluster Maintenance: The CluStream Algo- rithm 3.3 High Dimensional P
Trang 1Data Streams
Models and Algorithms
Trang 2ADVANCES IN DATABASE SYSTEMS
Series Editor Ahmed K Elmagarmid
Purdue Universify West Lafayette, IN 47907
Other books in the Series:
SIMILARITY SEARCH: The Metric Space Approach, P Zezuln, C A~wito, V Dohnal, M Batko, ISBN: 0-387-29 146-6
STREAM DATA MANAGEMENT, Naurnan Chaudhry, Kevin Shaw, Mahdi Abdelgueifi, ISBN: 0-387-24393-3
FUZZY DATABASE MODELING WITH XML, Zongrnin Ma, ISBN: 0-387-
24248-1
MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang
and Jiong Yang; ISBN: 0-387-24246-5 ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB
APPLICATIONS, Yannis Manolopoulos, Alexandros Nanopoulos, Eleni Tousidou; ISBN: 1-4020-7425-5
ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and
Policy, edited by William J Mclver, Jr and Ahrned K Elrnagarrnid; ISBN: 1-
Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4
SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND
BROWSING, Shu-Ching Chen, R.L Kashyap, and ArifGhafoor; ISBN: 0-7923-
7888-1 INFORMATION BROKERING ACROSS HETEROGENEOUS DIGITAL DATA:
A Metadata-based Approach, Vipul Kashyap, Arnit Sheth; ISBN: 0-7923-7883-0
DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS,
Kian-Lee Tan and Beng Chin Ooi; ISBN: 0-7923-7866-0 MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet
Infrastructure, Michah Lerner, George Vanecek, Nino Vidovic, Dad Vrsalovic;
FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6
For a complete listing of books in this series, go to htt~://www.s~rin~er.com
Trang 4Library of Congress Control Number: 20069341 11
DATA STREAMS: Models and Algorithms edited by Charu C Aggarwal
ISBN- 10: 0-387-28759-0
ISBN- 13: 978-0-387-28759- 1
e-ISBN- 10: 0-387-47534-6
e-ISBN-13: 978-0-387-47534-9
Cover by Will Ladd, NRL Mapping, Charting and Geodesy Branch
utilizing NRL's GIDBB Portal System that can be utilized at
http://dmap.nrlssc.navy.mil
Printed on acid-free paper
O 2007 Springer Science+Business Media, LLC
All rights reserved This work may not be translated or copied in whole or
in part without the written permission of the publisher (Springer
Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly
analysis Use in connection with any form of information storage and
retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now know or hereafter developed is forbidden
The use in this publication of trade names, trademarks, service marks and
similar terms, even if the are not identified as such, is not to be taken as
an expression of opinion as to whether or not they are subject to
proprietary rights
Trang 51
An Introduction to Data Streams
Cham C Aggarwal
1 Introduction
2 Stream Mining Algorithms
3 Conclusions and Summary References
2
On Clustering Massive Data Streams: A Summarization Paradigm
Cham C Aggarwal, Jiawei Han, Jianyong Wang and Philip S Yu
1 Introduction
2 The Micro-clustering Based Stream Mining Framework
3 Clustering Evolving Data Streams: A Micro-clustering Approach
3.1 Micro-clustering Challenges
3.2 Online Micro-cluster Maintenance: The CluStream Algo- rithm
3.3 High Dimensional Projected Stream Clustering
4 Classification of Data Streams: A Micro-clustering Approach
4.1 On-Demand Stream Classification
5 Other Applications of Micro-clustering and Research Directions
6 Performance Study and Experimental Results
7 Discussion References
3
A Survey of Classification Methods in Data Streams
Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy
1 Introduction
2 Research Issues
3 Solution Approaches
4 Classification Techniques
4.1 Ensemble Based Classification
4.2 Very Fast Decision Trees (VFDT)
Trang 6DATA STREAMS: MODELS AND ALGORITHMS
4.3 On Demand Classification 4.4 Online Information Network (OLIN) 4.5 LWClass Algorithm
4.6 ANNCAD Algorithm 4.7 SCALLOP Algorithm
5 Summary References
4
Frequent Pattern Mining in Data Streams
Ruoming Jin and Gagan Agrawal
1 Introduction
2 Overview
3 New Algorithm
4 Work on Other Related Problems
5 Conclusions and Future Directions References
5
A Survey of Change Diagnosis
Algorithms in Evolving Data Streams
6
Streams Using Stream Cubes
Jiawei Hun, Z Dora Cai, rain Chen, Guozhu Dong, Jian Pei, Benjamin W: Wah, and Jianyong Wang
3 Architecture for On-line Analysis of Data Streams 108
3.3 Partial materialization of stream cube 111
Trang 7Contents vii
7
Load Shedding in Data Stream Systems
Brian Babcoclr, Mayur Datar and Rajeev Motwani
1 Load Shedding for Aggregation Queries 1.1 Problem Formulation
1.2 Load Shedding Algorithm 1.3 Extensions
2 Load Shedding in Aurora
3 Load Shedding for Sliding Window Joins
4 Load Shedding for Classification Queries
5 Summary References
8
The Sliding-Window Computation Model and Results
Mayur Datar and Rajeev Motwani
0.1 Motivation and Road Map
1 A Solution to the BASICCOUNTING Problem 1.1 The Approximation Scheme
2 Space Lower Bound for BASICCOUNTING Problem
3 Beyond 0's and 1's
4 References and Related Work
5 Conclusion References
9
A Survey of Synopsis Construction
in Data Streams Cham C Agganual, Philip S Yu
1 Introduction
2 Sampling Methods 2.1 Random Sampling with a Reservoir 2.2 Concise Sampling
3 Wavelets 3.1 Recent Research on Wavelet Decomposition in Data Streams
4 Sketches 4.1 Fixed Window Sketches for Massive Time Series
4.2 Variable Window Sketches of Massive Time Series
4.3 Sketches and their applications in Data Streams 4.4 Sketches with p-stable distributions
4.5 The Count-Min Sketch
4.6 Related Counting Methods: Hash Functions for Determining Distinct Elements
4.7 Advantages and Limitations of Sketch Based Methods
5 Histograms 5.1 One Pass Construction of Equi-depth Histograms
5.2 Constructing V-Optimal Histograms
5.3 Wavelet Based Histograms for Query Answering
5.4 Sketch Based Methods for Multi-dimensional Histograms
6 Discussion and Challenges
Trang 8viii DATA STREAMS: MODELS AND ALGORITHMS
2 Model and Semantics
3 State Management for Stream Joins 3.1 Exploiting Constraints 3.2 Exploiting Statistical Properties
4 Fundamental Algorithms for Stream Join Processing
5 Optimizing Stream Joins
6 Conclusion Acknowledgments
References
11
Indexing and Querying Data Streams
Ahmet Bulut, Ambuj K Singh
Introduction Indexing Streams 2.1 Preliminaries and definitions 2.2 Feature extraction
Future Directions 5.1 Distributed monitoring systems 5.2 Probabilistic modeling of sensor networks 5.3 Content distribution networks
Chapter Summary References
2 Principal component analysis (PCA)
3 Auto-regressive models and recursive least squares
5 Tracking correlations and hidden variables: SPIRIT
6 Putting SPIRIT to work
7 Experimental case studies
Trang 9Contents ix
8 Performance and accuracy
9 Conclusion Acknowledgments
13
A Survey of Distributed Mining of Data Streams
Srinivasan Parthasarathy, Am01 Ghoting and Matthew Eric Otey
14
Data Stream Mining
Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar, Hill01 Kargupta, Ran Wolfand Rong Chen
2 Motivation: Why Distributed Data Stream Mining? 311
3 Existing Distributed Data Stream Mining Algorithms 3 12
4 A local algorithm for distributed data stream mining 315
5 Bayesian Network Learning from Distributed Data Streams 32 1 5.1 Distributed Bayesian Network Learning Algorithm 322 5.2 Selection of samples for transmission to global site 323 5.3 Online Distributed Bayesian Network Learning 324
15
A Survey of Stream Processing
Problems and Techniques
in Sensor Networks
Sharmila Subramaniam, Dimitrios Gunopulos
1 Challenges
Trang 10DATA STREAMS: MODELS AND ALGORITHMS
2 The Data Collection Model
3 Data Communication
4 Query Processing 4.1 Aggregate Queries 4.2 Join Queries 4.3 Top-k Monitoring 4.4 Continuous Queries
5 Compression and Modeling 5.1 Data Distribution Modeling 5.2 Outlier Detection
6 Application: Tracking of Objects using Sensor Networks
7 Summary References Index
Trang 11List of Figures
Varying Horizons for the classification process 23 Quality comparison (Network Intrusion dataset, horizon=256,
trusiondataset, Time units=2500, buffer_size=1600, kf it=80,
Accuracy comparison (Synthetic dataset B300kC5D20, stream_speed=l 00, buffer_size=500, lc it=25, init_number=400) 3 1 Distribution ofthe (smallest) best horizon (Synthetic dataset
B300kC5D20, Time units=2000, buffer_size=500, lc it=25,
Stream Proc Rate (Charit Donation data, stream_speed=2000) 33 Stream Proc Rate (Ntwk Intrusion data, stream_speed=2000) 33 Scalability with Data Dimensionality (stream_speed=2000) 34 Scalability with Number of Clusters (stream_speed=2000) 34 The ensemble based classification method 53
Online Information Network System 55
Karp et al Algorithm to Find Frequent Items 68 Improving Algorithm with An Accuracy Bound 7 1
Trang 12xii DATA STREAMS: MODELS AND ALGORITHMS
StreamMining-Fixed: Algorithm Assuming Fixed Length
StreamMining-Bounded: Algorithm with a Bound on Accuracy 75
StreamMining: Final Algorithm The Forward Time Slice Density Estimate The Reverse Time Slice Density Estimate The Temporal Velocity Profile
The Spatial Velocity Profile
A tilted time frame with natural time partition
A tilted time frame with logarithmic time partition
A tilted time frame with progressive logarithmic time partition
Two critical layers in the stream cube Cube structure from the m-layer to the o-layer H-tree structure for cube computation
Cube computation: time and memory usage vs # tuples
at the m-layer for the data set D5L3C10
Cube computation: time and space vs # of dimensions for the data set L3ClOT100K
Cube computation: time and space vs # of levels for the data set
D5C10T50K
Data Flow Diagram Illustration of Example 7.1
Illustration of Observation 1.4 Procedure SetSamplingRate(x, R, )
Sliding window model notation
An illustration of an Exponential Histogram (EH)
Illustration of the Wavelet Decomposition The Error Tree from the Wavelet Decomposition Drifting normal distributions
Example ECBs
ECBs for sliding-window joins under the frequency-based model
ECBs under the age-based model
The system architecture for a multi-resolution index struc- ture consisting of 3 levels and stream-specific auto-regressive
(AR) models for capturing multi-resolution trends in the data 240 Exact feature extraction, update rate T = 1 24 1
Incremental feature extraction, update rate T = 1 24 1
Trang 13List of Figures
X l l l
Approximate feature extraction, update rate T = 1
Incremental feature extraction, update rate T = 2
Transforming an MBR using discrete wavelet transform
Transformation corresponds to rotating the axes (the ro- tation angle = 45" for Haar wavelets) 247
Aggregate query decomposition and approximation com- position for a query window of size w = 26 249
Subsequence query decomposition for a query window
Wall-clock times (including time to update forecasting models) 284
Hidden variable tracking accuracy
Centralized Stream Processing Architecture (left) Dis- tributed Stream Processing Architecture (right)
(A) the area inside an E circle (B) Seven evenly spaced vectors - ul u7 (C) The borders of the seven halfs- paces tii x 2 E define a polygon in which the circle is circumscribed (D) The area between the circle and the union of half-spaces
Quality of the algorithm with increasing number of nodes Cost of the algorithm with increasing number of nodes ASIA Model
Bayesian network for online distributed parameter learning Simulation results for online Bayesian learning: (left) KL distance between the conditional probabilities for the net- works Bol (k ) and Bb, for three nodes (right) KL distance between the conditional probabilities for the networks
Bol (k ) and Bb, for three nodes
An instance of dynamic cluster assignment in sensor sys- tem according to LEACH protocol Sensor nodes of the same clusters are shown with same symbol and the cluster heads are marked with highlighted symbols
Trang 14xiv DATA STREAMS: MODELS AND ALGORITHMS
Interest Propagation, gradient setup and path reinforce- ment for data propagation in directed-dzfusion paradigm
Event is described in terms of attribute value pairs The figure illustrates an event detected based on the location
of the node and target detection
Sensors aggregating the result for a MAX query in-netwc Error filter assignments in tree topology The nodes that are shown shaded are the passive nodes that take part
only in routing the measurements A sensor comrnuni-
cates a measurement only if it lies outside the interval of values specified by Ei i.e., maximum permitted error at the node A sensor that receives partial results from its children aggregates the results and communicates them
to its parent after checking against the error interval Usage of duplicate-sensitive sketches to allow result prop- agation to multiple parents providing fault tolerance The system is divided into levels during the query propaga- tion phase Partial results from a higher level (level 2 in the figure) is received at more than one node in the lower level (Level 1 in the figure)
(a) Two dimensional Gaussian model of the measure- ments from sensors S1 and S2 (b) The marginal distri- bution of the values of sensor S1, given S2: New obser- vations from one sensor is used to estimate theposterior density of the other sensors
Estimation of probability distribution of the measure- ments over sliding window
Trade-offs in modeling sensor data Tracking a target The leader nodes estimate the prob- ability of the target's direction and determines the next monitoring region that the target is going to traverse The leaders of the cells within the next monitoring region are alerted
Trang 15List of Tables
An example of snapshots stored for a = 2 and I = 2
A geometric time window Data Based Techniques Task Based Techniques Typical LWClass Training Results Summary of Reviewed Techniques Algorithms for Frequent Itemsets Mining over Data Streams Summary of results for the sliding-window model
An Example of Wavelet Coefficient Computation Description of notation
Description of datasets
Reconstruction accuracy (mean squared error rate)
Trang 16Preface
In recent years, the progress in hardware technology has made it possible for organizations to store and record large streams of transactional data Such data sets which continuously and rapidly grow over time are referred to as data streams In addition, the development of sensor technology has resulted in the possibility of monitoring many events in real time While data mining has become a fairly well established field now, the data stream problem poses a number of unique challenges which are not easily solved by traditional data mining methods
The topic of data streams is a very recent one The first research papers on this topic appeared slightly under a decade ago, and since then this field has
grown rapidly There is a large volume of literature which has been published
in this field over the past few years The work is also of great interest to practitioners in the field who have to mine actionable insights with large volumes
of continuously growing data Because of the large volume of literature in the field, practitioners and researchers may often find it an arduous task of isolating the right literature for a given topic In addition, from a practitioners point of
view, the use of research literature is even more difficult, since much of the
relevant material is buried in publications While handling a real problem, it
may often be difficult to know where to look in order to solve the problem
This book contains contributed chapters from a variety of well known re- searchers in the data mining field While the chapters will be written by dif-
ferent researchers, the topics and content will be organized in such a way so as
to present the most important models, algorithms, and applications in the data mining field in a structured and concise way In addition, the book is organized
in order to make it more accessible to application driven practitioners Given
the lack of structurally organized information on the topic, the book will pro-
vide insights which are not easily accessible otherwise In addition, the book
will be a great help to researchers and graduate students interested in the topic
The popularity and current nature of the topic of data streams is likely to make
it an important source of information for researchers interested in the topic
The data mining community has grown rapidly over the past few years, and the
topic of data streams is one of the most relevant and current areas of interest to