Data Streams Models and Algorithms- P1

Clustering Evolving Data Streams: A Micro-clustering Approach 3.1 Micro-clustering Challenges 3.2 Online Micro-cluster Maintenance: The CluStream Algo- rithm 3.3 High Dimensional P

Trang 1

Data Streams

Models and Algorithms

Trang 2

ADVANCES IN DATABASE SYSTEMS

Series Editor Ahmed K Elmagarmid

Purdue Universify West Lafayette, IN 47907

Other books in the Series:

SIMILARITY SEARCH: The Metric Space Approach, P Zezuln, C A~wito, V Dohnal, M Batko, ISBN: 0-387-29 146-6

STREAM DATA MANAGEMENT, Naurnan Chaudhry, Kevin Shaw, Mahdi Abdelgueifi, ISBN: 0-387-24393-3

FUZZY DATABASE MODELING WITH XML, Zongrnin Ma, ISBN: 0-387-

24248-1

MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang

and Jiong Yang; ISBN: 0-387-24246-5 ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB

APPLICATIONS, Yannis Manolopoulos, Alexandros Nanopoulos, Eleni Tousidou; ISBN: 1-4020-7425-5

ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and

Policy, edited by William J Mclver, Jr and Ahrned K Elrnagarrnid; ISBN: 1-

Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4

SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND

BROWSING, Shu-Ching Chen, R.L Kashyap, and ArifGhafoor; ISBN: 0-7923-

7888-1 INFORMATION BROKERING ACROSS HETEROGENEOUS DIGITAL DATA:

A Metadata-based Approach, Vipul Kashyap, Arnit Sheth; ISBN: 0-7923-7883-0

DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS,

Kian-Lee Tan and Beng Chin Ooi; ISBN: 0-7923-7866-0 MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet

Infrastructure, Michah Lerner, George Vanecek, Nino Vidovic, Dad Vrsalovic;

FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6

For a complete listing of books in this series, go to htt~://www.s~rin~er.com

Trang 4

Library of Congress Control Number: 20069341 11

DATA STREAMS: Models and Algorithms edited by Charu C Aggarwal

ISBN- 10: 0-387-28759-0

ISBN- 13: 978-0-387-28759- 1

e-ISBN- 10: 0-387-47534-6

e-ISBN-13: 978-0-387-47534-9

Cover by Will Ladd, NRL Mapping, Charting and Geodesy Branch

utilizing NRL's GIDBB Portal System that can be utilized at

http://dmap.nrlssc.navy.mil

Printed on acid-free paper

O 2007 Springer Science+Business Media, LLC

in part without the written permission of the publisher (Springer

Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,

USA), except for brief excerpts in connection with reviews or scholarly

analysis Use in connection with any form of information storage and

retrieval, electronic adaptation, computer software, or by similar or

dissimilar methodology now know or hereafter developed is forbidden

The use in this publication of trade names, trademarks, service marks and

similar terms, even if the are not identified as such, is not to be taken as

an expression of opinion as to whether or not they are subject to

proprietary rights

Trang 5

1

An Introduction to Data Streams

Cham C Aggarwal

1 Introduction

2 Stream Mining Algorithms

3 Conclusions and Summary References

2

On Clustering Massive Data Streams: A Summarization Paradigm

Cham C Aggarwal, Jiawei Han, Jianyong Wang and Philip S Yu

1 Introduction

2 The Micro-clustering Based Stream Mining Framework

3 Clustering Evolving Data Streams: A Micro-clustering Approach

3.1 Micro-clustering Challenges

3.2 Online Micro-cluster Maintenance: The CluStream Algo- rithm

3.3 High Dimensional Projected Stream Clustering

4 Classification of Data Streams: A Micro-clustering Approach

4.1 On-Demand Stream Classification

5 Other Applications of Micro-clustering and Research Directions

6 Performance Study and Experimental Results

7 Discussion References

3

A Survey of Classification Methods in Data Streams

Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy

1 Introduction

2 Research Issues

3 Solution Approaches

4 Classification Techniques

4.1 Ensemble Based Classification

4.2 Very Fast Decision Trees (VFDT)

Trang 6

DATA STREAMS: MODELS AND ALGORITHMS

4.3 On Demand Classification 4.4 Online Information Network (OLIN) 4.5 LWClass Algorithm

4.6 ANNCAD Algorithm 4.7 SCALLOP Algorithm

5 Summary References

4

Frequent Pattern Mining in Data Streams

Ruoming Jin and Gagan Agrawal

1 Introduction

2 Overview

3 New Algorithm

4 Work on Other Related Problems

5 Conclusions and Future Directions References

5

A Survey of Change Diagnosis

Algorithms in Evolving Data Streams

6

Streams Using Stream Cubes

Jiawei Hun, Z Dora Cai, rain Chen, Guozhu Dong, Jian Pei, Benjamin W: Wah, and Jianyong Wang

3 Architecture for On-line Analysis of Data Streams 108

3.3 Partial materialization of stream cube 111

Trang 7

Contents vii

7

Load Shedding in Data Stream Systems

Brian Babcoclr, Mayur Datar and Rajeev Motwani

1 Load Shedding for Aggregation Queries 1.1 Problem Formulation

1.2 Load Shedding Algorithm 1.3 Extensions

2 Load Shedding in Aurora

3 Load Shedding for Sliding Window Joins

4 Load Shedding for Classification Queries

5 Summary References

8

The Sliding-Window Computation Model and Results

Mayur Datar and Rajeev Motwani

0.1 Motivation and Road Map

1 A Solution to the BASICCOUNTING Problem 1.1 The Approximation Scheme

2 Space Lower Bound for BASICCOUNTING Problem

3 Beyond 0's and 1's

4 References and Related Work

5 Conclusion References

9

A Survey of Synopsis Construction

in Data Streams Cham C Agganual, Philip S Yu

1 Introduction

2 Sampling Methods 2.1 Random Sampling with a Reservoir 2.2 Concise Sampling

3 Wavelets 3.1 Recent Research on Wavelet Decomposition in Data Streams

4 Sketches 4.1 Fixed Window Sketches for Massive Time Series

4.2 Variable Window Sketches of Massive Time Series

4.3 Sketches and their applications in Data Streams 4.4 Sketches with p-stable distributions

4.5 The Count-Min Sketch

4.6 Related Counting Methods: Hash Functions for Determining Distinct Elements

4.7 Advantages and Limitations of Sketch Based Methods

5 Histograms 5.1 One Pass Construction of Equi-depth Histograms

5.2 Constructing V-Optimal Histograms

5.3 Wavelet Based Histograms for Query Answering

5.4 Sketch Based Methods for Multi-dimensional Histograms

6 Discussion and Challenges

Trang 8

viii DATA STREAMS: MODELS AND ALGORITHMS

2 Model and Semantics

3 State Management for Stream Joins 3.1 Exploiting Constraints 3.2 Exploiting Statistical Properties

4 Fundamental Algorithms for Stream Join Processing

5 Optimizing Stream Joins

6 Conclusion Acknowledgments

References

11

Indexing and Querying Data Streams

Ahmet Bulut, Ambuj K Singh

Introduction Indexing Streams 2.1 Preliminaries and definitions 2.2 Feature extraction

Future Directions 5.1 Distributed monitoring systems 5.2 Probabilistic modeling of sensor networks 5.3 Content distribution networks

Chapter Summary References

2 Principal component analysis (PCA)

3 Auto-regressive models and recursive least squares

5 Tracking correlations and hidden variables: SPIRIT

6 Putting SPIRIT to work

7 Experimental case studies

Trang 9

Contents ix

8 Performance and accuracy

9 Conclusion Acknowledgments

13

A Survey of Distributed Mining of Data Streams

Srinivasan Parthasarathy, Am01 Ghoting and Matthew Eric Otey

14

Data Stream Mining

Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar, Hill01 Kargupta, Ran Wolfand Rong Chen

2 Motivation: Why Distributed Data Stream Mining? 311

3 Existing Distributed Data Stream Mining Algorithms 3 12

4 A local algorithm for distributed data stream mining 315

5 Bayesian Network Learning from Distributed Data Streams 32 1 5.1 Distributed Bayesian Network Learning Algorithm 322 5.2 Selection of samples for transmission to global site 323 5.3 Online Distributed Bayesian Network Learning 324

15

A Survey of Stream Processing

Problems and Techniques

in Sensor Networks

Sharmila Subramaniam, Dimitrios Gunopulos

1 Challenges

Trang 10

DATA STREAMS: MODELS AND ALGORITHMS

2 The Data Collection Model

3 Data Communication

4 Query Processing 4.1 Aggregate Queries 4.2 Join Queries 4.3 Top-k Monitoring 4.4 Continuous Queries

5 Compression and Modeling 5.1 Data Distribution Modeling 5.2 Outlier Detection

6 Application: Tracking of Objects using Sensor Networks

7 Summary References Index

Trang 11

List of Figures

Varying Horizons for the classification process 23 Quality comparison (Network Intrusion dataset, horizon=256,

trusiondataset, Time units=2500, buffer_size=1600, kf it=80,

Accuracy comparison (Synthetic dataset B300kC5D20, stream_speed=l 00, buffer_size=500, lc it=25, init_number=400) 3 1 Distribution ofthe (smallest) best horizon (Synthetic dataset

B300kC5D20, Time units=2000, buffer_size=500, lc it=25,

Stream Proc Rate (Charit Donation data, stream_speed=2000) 33 Stream Proc Rate (Ntwk Intrusion data, stream_speed=2000) 33 Scalability with Data Dimensionality (stream_speed=2000) 34 Scalability with Number of Clusters (stream_speed=2000) 34 The ensemble based classification method 53

Online Information Network System 55

Karp et al Algorithm to Find Frequent Items 68 Improving Algorithm with An Accuracy Bound 7 1

Trang 12

xii DATA STREAMS: MODELS AND ALGORITHMS

StreamMining-Fixed: Algorithm Assuming Fixed Length

StreamMining-Bounded: Algorithm with a Bound on Accuracy 75

StreamMining: Final Algorithm The Forward Time Slice Density Estimate The Reverse Time Slice Density Estimate The Temporal Velocity Profile

The Spatial Velocity Profile

A tilted time frame with natural time partition

A tilted time frame with logarithmic time partition

A tilted time frame with progressive logarithmic time partition

Two critical layers in the stream cube Cube structure from the m-layer to the o-layer H-tree structure for cube computation

Cube computation: time and memory usage vs # tuples

at the m-layer for the data set D5L3C10

Cube computation: time and space vs # of dimensions for the data set L3ClOT100K

Cube computation: time and space vs # of levels for the data set

D5C10T50K

Data Flow Diagram Illustration of Example 7.1

Illustration of Observation 1.4 Procedure SetSamplingRate(x, R, )

Sliding window model notation

An illustration of an Exponential Histogram (EH)

Illustration of the Wavelet Decomposition The Error Tree from the Wavelet Decomposition Drifting normal distributions

Example ECBs

ECBs for sliding-window joins under the frequency-based model

ECBs under the age-based model

The system architecture for a multi-resolution index structure consisting of 3 levels and stream-specific auto-regressive

(AR) models for capturing multi-resolution trends in the data 240 Exact feature extraction, update rate T = 1 24 1

Incremental feature extraction, update rate T = 1 24 1

Trang 13

List of Figures

X l l l

Approximate feature extraction, update rate T = 1

Incremental feature extraction, update rate T = 2

Transforming an MBR using discrete wavelet transform

Transformation corresponds to rotating the axes (the ro- tation angle = 45" for Haar wavelets) 247

Aggregate query decomposition and approximation com- position for a query window of size w = 26 249

Subsequence query decomposition for a query window

Wall-clock times (including time to update forecasting models) 284

Hidden variable tracking accuracy

Centralized Stream Processing Architecture (left) Dis- tributed Stream Processing Architecture (right)

(A) the area inside an E circle (B) Seven evenly spaced vectors - ul u7 (C) The borders of the seven halfs- paces tii x 2 E define a polygon in which the circle is circumscribed (D) The area between the circle and the union of half-spaces

Quality of the algorithm with increasing number of nodes Cost of the algorithm with increasing number of nodes ASIA Model

Bayesian network for online distributed parameter learning Simulation results for online Bayesian learning: (left) KL distance between the conditional probabilities for the networks Bol (k ) and Bb, for three nodes (right) KL distance between the conditional probabilities for the networks

Bol (k ) and Bb, for three nodes

An instance of dynamic cluster assignment in sensor system according to LEACH protocol Sensor nodes of the same clusters are shown with same symbol and the cluster heads are marked with highlighted symbols

Trang 14

xiv DATA STREAMS: MODELS AND ALGORITHMS

Interest Propagation, gradient setup and path reinforce- ment for data propagation in directed-dzfusion paradigm

Event is described in terms of attribute value pairs The figure illustrates an event detected based on the location

of the node and target detection

Sensors aggregating the result for a MAX query in-netwc Error filter assignments in tree topology The nodes that are shown shaded are the passive nodes that take part

only in routing the measurements A sensor comrnuni-

cates a measurement only if it lies outside the interval of values specified by Ei i.e., maximum permitted error at the node A sensor that receives partial results from its children aggregates the results and communicates them

to its parent after checking against the error interval Usage of duplicate-sensitive sketches to allow result propagation to multiple parents providing fault tolerance The system is divided into levels during the query propagation phase Partial results from a higher level (level 2 in the figure) is received at more than one node in the lower level (Level 1 in the figure)

(a) Two dimensional Gaussian model of the measurements from sensors S1 and S2 (b) The marginal distribution of the values of sensor S1, given S2: New obser- vations from one sensor is used to estimate theposterior density of the other sensors

Estimation of probability distribution of the measurements over sliding window

Trade-offs in modeling sensor data Tracking a target The leader nodes estimate the probability of the target's direction and determines the next monitoring region that the target is going to traverse The leaders of the cells within the next monitoring region are alerted

Trang 15

List of Tables

An example of snapshots stored for a = 2 and I = 2

A geometric time window Data Based Techniques Task Based Techniques Typical LWClass Training Results Summary of Reviewed Techniques Algorithms for Frequent Itemsets Mining over Data Streams Summary of results for the sliding-window model

An Example of Wavelet Coefficient Computation Description of notation

Description of datasets

Reconstruction accuracy (mean squared error rate)

Trang 16

Preface

In recent years, the progress in hardware technology has made it possible for organizations to store and record large streams of transactional data Such data sets which continuously and rapidly grow over time are referred to as data streams In addition, the development of sensor technology has resulted in the possibility of monitoring many events in real time While data mining has become a fairly well established field now, the data stream problem poses a number of unique challenges which are not easily solved by traditional data mining methods

The topic of data streams is a very recent one The first research papers on this topic appeared slightly under a decade ago, and since then this field has

grown rapidly There is a large volume of literature which has been published

in this field over the past few years The work is also of great interest to practitioners in the field who have to mine actionable insights with large volumes

of continuously growing data Because of the large volume of literature in the field, practitioners and researchers may often find it an arduous task of isolating the right literature for a given topic In addition, from a practitioners point of

view, the use of research literature is even more difficult, since much of the

relevant material is buried in publications While handling a real problem, it

may often be difficult to know where to look in order to solve the problem

This book contains contributed chapters from a variety of well known researchers in the data mining field While the chapters will be written by dif-

ferent researchers, the topics and content will be organized in such a way so as

to present the most important models, algorithms, and applications in the data mining field in a structured and concise way In addition, the book is organized

in order to make it more accessible to application driven practitioners Given

the lack of structurally organized information on the topic, the book will pro-

vide insights which are not easily accessible otherwise In addition, the book

will be a great help to researchers and graduate students interested in the topic

The popularity and current nature of the topic of data streams is likely to make

it an important source of information for researchers interested in the topic

The data mining community has grown rapidly over the past few years, and the

topic of data streams is one of the most relevant and current areas of interest to

Tiêu đề	Data streams models and algorithms
Tác giả	Charu C. Aggarwal
Trường học	Purdue University
Chuyên ngành	Database Systems
Thể loại	edited book
Năm xuất bản	2007
Thành phố	New York

Định dạng
Số trang	30
Dung lượng	1,24 MB