Tài liệu Data Streams Models and Algorithms- P9 doc

In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pages 41 9-430, Paris, France.. In Proceedings of the 2004 ACM SIGMOD International Conference on

Trang 1

A Survey of Join Processing in Data Streams 229 few S2 tuples, while another tuple may behave in the exact opposite way The first issue is tackled by [5], who provided a family of algorithms for adaptively finding the optimal order to apply a series of filters (joining a tuple with a stream can be regarded as subjecting the tuple to a filter) through runtime profiling In

particular, the A-Greedy algorithm is able to capture correlations among filter selectivities, and is guaranteed to converge to an ordering within a constant factor of the optimal The theoretical guarantee extends to star joins; for general join graphs, though A-Greedy still can be used, the theoretical guarantee no longer holds The second issue is recently addressed by an approach called CBR [9], or content-based routing, which makes the choice of query plan

dependent on the values of the incoming tuple's "classifier attributes," whose values strongly correlate with operator selectivities In effect, CBR is able to process each incoming tuple with a customized query plan

One problem with MJoin is that it may incur a significant amount of recomputation Consider again the four-way join among S1, , S4, now processed

by a single MJoin operator Whenever a new tuple s3 arrives in S3, MJoin

in effect executes the query S1 w S2 w {s3) w S4; similarly, whenever a new

tuple s 4 arrives in S4, MJoin executes Sl w S2 w S3 w is4) The common

subquery S1 w S2 is processed over and over again for these S3 and S4 tuples

In contrast, the XJoin plan ((S1 XJoin S2) XJoin S3) XJoin S4 materializes all its intermediate results in hash tables, including S1 w S2; new tuples from S3 and S4 simply have to probe this hash table, thereby avoiding recomputation The optimal solution may well lie between these two extremes, as pointed out

by [6] They proposed an adaptive caching strategy, A-Caching, which starts

with MJoins and adds join subresult caches adaptively A-Caching profiles cache benefit and cost online, selects caches dynamically, and allocates mem-

ory to caches dynamically With this approach, the entire spectrum of caching

options from MJoins to XJoins can be explored

A number of other papers also consider multi-way stream joins Golab and

0zsu [23] studied processing and optimization of multi-way sliding-window joins Traditionally, we eagerly remove (expire) tuples that are no longer part of

the sliding window, and eagerly generate output tuples whenever input arrives The authors proposed algorithms supporting lazy expiration and lazy evaluation

as alternatives, which achieve higher efficiency at the expense of higher memory

requirements and longer response times, respectively Hamrnad et al [27] considered multi-way stream joins where a time-based window constraint can

be specified for each pair (or, in general, subset) of input streams An interesting

algorithm called FEW is proposed, which computes a forward point in time

before which all arriving tuples can join, thereby avoiding repeated checking

of window constraints

Eddies [3] are a novel approach towards stream query processing and optimization that is markedly different from the standard plan-based approaches

Trang 2

23 0 DATA STREAMS: MODELS AND ALGORITHMS

Eddies eliminate query plans entirely by routing each input tuple adaptively across the operators that need to process it Interestingly, in eddies, the behav-

ior of SteM [36] mimics that of MJoin, while STAIRS [16] is able to emulate

XJoin Note that while eddies provide the mechanisms for adapting the pro-

cessing strategy on an individual tuple basis, currently their policies typically

do not result in plans that change for every incoming tuple It would be nice to

see how features of CBR can be supported in eddies

6 Conclusion

In this chapter, we have presented an overview of research problems and recent advances in join processing for data streams Stream processing is a

young and exciting research area, yet it also has roots in and connections to

well-established areas in databases as well as computer science in general In

Section 3.2, we have already discussed the relationship between stream join

state management and classic caching Now, let us briefly re-examine parts of

this chapter in light of their relationship to materialized views [25]

The general connection between stream processing and materialized views has long been identified [8] This connection is reflected in the way that we

specify the semantics of stream joins-by regarding them as views and defining

their output as the view update stream resulting from base relation updates

(Section 2) Recall that the standard semantics requires the output sequence to

reflect the exact sequence of states of the underlying view, which is analogous

to the notion of complete and strong consistency of a data warehouse view

with respect to its source relations [46] The connection does not stop at the

semantics The problem of determining what needs to be retained in the state to

compute a stream join is analogous to the problem of deriving auxiliary views

to make a join view self-maintainable [35] Just as constraints can be used to

reduce stream join state (Section 3.1), they have also been used to help expire

data from data warehouses without affecting the maintainability of warehouse

views [21] For a stream join Sl w w Sn, processing an incoming tuple from

stream Si is analogous to maintaining a join view incrementally by evaluating

a maintenance query S1 w - w ASi w - - w Sn Since there are n different

forms of maintenance queries (one for each i), it is natural to optimize each

form differently, which echoes the intuition behind the asymmetric processing

strategy of [30] and MJoin [43] In fact, we can optimize the maintenance query

for each instance of ASi, which would achieve the same goal of supporting a

customized query plan for each tuple as CBR [9] Finally, noticing that the

maintenance queries run frequently and share many common subqueries, we

may choose to materialize some subqueries as additional views to improve

query performance, which is also what A-Caching [6] tries to accomplish

Trang 3

A Survey of Join Processing in Data Streams 23 1

Of course, despite high-level similarities, techniques from the two areas- data streams and materialized views-may still differ significantly in actual details Nonetheless, it would be nice to develop a general framework that uni-

fies both areas, or, less ambitiously, to apply ideas from one area to the other

Many such possibilities exist For example, methods and insights from the well-

studied problems of answering query using views [26] and view selection [14] could be extended and applied to data streams: Given a set of stream queries running continuously in a system, what materialized views (over join states and

database relations) and/or additional stream queries can we create to improve the

performance of the system? Another area is distributed stream processing Dis-

tributed stream processing can be regarded as view maintenance in a distributed setting, which has been studied extensively in the context of data warehous-

ing Potentially applicable in this setting are techniques for making warehouse

self-maintainable [35], optimizing view maintenance queries across distributed

sources [3 11, ensuring consistency of multi-source warehouse views 1461, etc

Conversely, stream processing techniques can be applied to materialized views

as well In particular, view maintenance could benefit from optimization tech-

niques that exploit update stream statistics (Section 3.2) Also, selection of

materialized views for performance can be improved by adaptive caching tech-

niques (Section 5)

Besides the future work directions mentioned above and throughout the chapter, another important direction worth exploring is the connection between

data stream processing and distributed event-based systems [19] such as pub-

lish/subscribe systems Such systemsneed to scale to thousands or even millions

of subscriptions, which are essentially continuous queries over event streams

While efficient techniques for handling continuous selections already exist,

scalable processing of continuous joins remains a challenging problem Ham-

mad et al [28] considered shared processing of stream joins with identical join

conditions but different sliding-window durations We need to consider more

general query forms, e.g., joins with different join conditions as well as addi-

tional selection conditions on input streams NiagaraCQ [I 31 and CACQ [32]

are able to group-process selections and share processing of identical join oper-

ations However, there is no group or shared processing of joins with different

join conditions, and processing selections separately from joins limits optimiza-

tion potentials PSoup [ l 11 treats queries as data, thereby allowing set-oriented

processing of queries with arbitrary join and selection conditions Still, new

indexing and processing techniques must be developed for the system to be able

to process each event in time sublinear in the number of subscriptions

Trang 4

232 DATA STREAMS: MODELS AND ALGORITHMS

Acknowledgments

This work is supported by a NSF CAREER Award under grant 11s-0238386

We would also like to thank Shivnath Babu, Yuguo Chen, Kamesh Munagala,

and members of the Duke Database Research Group for their discussions

References

[I] Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J (2002) Char-

acterizing memory requirements for queries over continuous data streams

In Proceedings of the 2002 ACM Symposium on Principles of Database Systems, pages 221-232, Madison, Wisconsin, USA

[2] Arasu, A., Babu, S., and Widom, J (2003) The CQL continuous query

language: Semantic foundations and query execution Technical Report 2003-67, InfoLab, Stanford University

[3] Avnw, R and Hellerstein, J M (2000) Eddies: Continuously adaptive

query processing In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 261-272, Dallas, Texas, USA

[4] Ayad, A and Naughton, J F (2004) Static optimization of conjunctive

queries with sliding windows over infinite streams In Proceedings of the

2004 ACM SIGMOD International Conference on Management of Data,

pages 41 9-430, Paris, France

[5] Babu, S., Motwani, R., Munagala, K., Nishizawa, I., and Widom, J (2004a)

Adaptive ordering of pipelined stream filters In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pages

407-41 8, Paris, France

[6] Babu, S., Munagala, K., Widom, J., and Motwani, R (2005) Adaptive

caching for continuous queries In Proceedings of the 2005 International

Conference on Data Engineering, Tokyo, Japan

[7] Babu, S., Srivastava, U., and Widom, J (2004b) Exploiting Ic-constraints

to reduce memory overhead in continuous queries over data streams ACM

Transactions on Database Systems, 29(3):545-580

[8] Babu, S and Widom, J (2001) Continuous queries over data streams

ACM SIGMOD Record

[9] Bizarro, P., Babu, S., DeWitt, D., and Widom, J (2005) Content-based

routing: Different plans for different data In Proceedings of the 2005 Inter-

national Conference on Very Large Data Bases, Trondheim, Norway

[lo] Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman,

G., Stonebraker, M., Tatbul, N., andzdonik, S B (2002) Monitoring streams

- a new class of data management applications In Proceedings of the 2002

Trang 5

A Survey of Join Processing in Data Streams 233

International Conference on Very Large Data Bases, pages 215-226, Hong Kong, China

[ l 11 Chandrasekaran, S and Franklin, M J (2003) PSoup: a system for

streaming queries over streaming data The VLDB Journal, 12(2): 140-156

[12] Chaudhuri, S., Motwani, R., and Narasayya, V R (1999) On random

sampling over joins In Proceedings of the 1999 ACM SIGMOD Interna-

tional Conference on Management of Data, pages 263-274, Philadelphia, Pennsylvania, USA

[13] Chen, J., DeWitt, D J., Tian, F., and Wang, Y (2000) NiagaraCQ: A

scalable continuous query system for internet databases In Proceedings of

the 2000 ACM SIGMOD International Conference on Management ofData,

pages 379-390, Dallas, Texas, USA

[14] Chirkova, R., Halevy, A Y, and Suciu, D (2001) A formal perspective

on the view selection problem In Proceedings of the 2001 International

Conference on Very Large Data Bases, pages 59-68, Roma, Italy

[15] Das, A., Gehrke, J., and Riedewald, M (2003) Approximate join pro-

cessing over data streams In Proceedings of the 2003 ACM SIGMOD In-

ternational Conference on Management of Data, pages 40-5 1, San Diego, California, USA

[16] Deshpande, A and Hellerstein, J M (2004) Lifting the burden of history

from adaptive query processing In Proceedings of the 2004 International Conference on Very Large Data Bases, pages 948-959, Toronto, Canada

[17] Ding, L., Mehta, N., Rundensteiner, E., and Heineman, G (2004) Joining

punctuated streams In Proceedings of the 2004 International Conference

on Extending Database Technology, Heraklion, Crete, Greece

[18] Ding, L and Rundensteiner, E A (2004) Evaluating window joins over

punctuated streams In Proceedings of the 2004 International Conference on Information and Knowledge Management, pages 98-1 07, Washington DC, USA

[19] Dingel, J and Strom, R., editors (2005) Proceedings of the 2005 Inter-

national Workshop on Distributed Event Based Systems, Columbus, Ohio, USA

[20] Dittrich, J.-P., Seeger, B., Taylor, D S., and Widmayer, P (2002) Pro-

gressive merge join: A generic and non-blocking sort-based join algorithm

In Proceedings of the 2002 International Conference on firy Large Data Bases, pages 299-3 10, Hong Kong, China

[21] Garcia-Molina, H., Labio, W., and Yang, J (1998) Expiring data in a

warehouse In Proceedings of the 1998 International Conference on Very

Large Data Bases, pages 500-5 1 1, New York City, New York, USA

Trang 6

234 DATA STREAMS: MODELS AND ALGORITHMS

[22] Golab, L., Garg, S., and ~ z s u , T (2004) On indexing sliding windows over

on-line data streams In Proceedings of the 2004 International Conference

on Extending Database Technology, Heraklion, Crete, Greece

[23] Golab, L and ~ z s u , M T (2003) Processing sliding window multi-

joins in continuous queries over data streams In Proceedings of the 2003

International Conference on Very Large Data Bases, pages 500-5 1 1, Berlin, Germany

[24] Golab, L and ~ z s u , M T (2005) Update-pattern-aware modeling and

processing of continuous queries In Proceedings of the 2005 ACM SIG-

MOD International Conference on Management of Data, pages 658-669, Baltimore, Maryland, USA

[25] Gupta, A and Mumick, I S., editors (1999) Materialized Uews: Tech-

niques, Implementations, and Applications MIT Press

1261 Halevy, A Y (2001) Answering queries using views: A survey The

VZDB Journal, 10(4):27&294

[27] Hammad, M A., Aref, W G., and Elmagarmid, A K (2003a) Stream

window join: Tracking moving objects in sensor-network databases In Pro-

ceedings of the 2003 International Conference on ScientiJic and Statistical

Database Management, pages 75-84, Cambridge, Massachusetts, USA

[28] Hammad, M A., Franklin, M J., Aref, W G., and Elmagarmid, A K

(2003b) Scheduling for shared window joins over data streams In Pro-

ceedings of the 2003 International Conference on Very Large Data Bases, pages 297-308, Berlin, Germany

[29] Ives, Z G., Florescu, D., Friedman, M., Levy, A Y., and Weld, D S

(1999) An adaptive query execution system for data integration In Proceed- ings of the 1999 ACM SIGMOD International Conference on Management

of Data, pages 299-3 10, Philadelphia, Pennsylvania, USA

[30] Kang, J., Naughton, J F., and Viglas, S (2003) Evaluating window

joins over unbounded streams In Proceedings of the 2003 International

Conference on Data Engineering, pages 341-352, Bangalore, India

[3 11 Liu, B and Rundensteiner, E A (2005) Cost-driven general join view

maintenance over distributed data sources In Proceedings of the 2005 Inter-

national Conference on Data Engineering, pages 578-579, Tokyo, Japan

[32] Madden, S., Shah, M A., Hellerstein, J M., and Raman, V (2002) Con-

tinuously adaptive continuous queries over streams In Proceedings of the

2002 ACM SIGMOD International Conference on Management of Data,

Madison, Wisconsin, USA

[33] Mokbel, M F., Lu, M., and Aref, W G (2004) Hash-merge join: A

non-blocking join algorithm for producing fast and early join results In

Trang 7

A Survey of Join Processing in Data Streams 235

Proceedings of the 2004 International Conference on Data Engineering, pages 25 1-263, Boston, Massachusetts, USA

[34] Olken, F (1993) Random Sampling from Databases PhD thesis, Uni-

versity of California at Berkeley

[35] Quass, D., Gupta, A., Mumick, I S., and Widom, J (1996) Making views self-maintainable for data warehousing In Proceedings of the I996

International Conference on Parallel and Distributed Information Systems, pages 158-1 69, Miami Beach, Florida, USA

[36] Raman, V., Deshpande, A., and Hellerstein, J M (2003) Using state mod-

ules for adaptive query processing In Proceedings of the 2003 International Conference on Data Engineering, pages 353-364, Bangalore, India

1371 Srivastava, U and Widom, J (2004) Memory-limited execution of win-

dowed stream joins In Proceedings of the 2004 International Conference

on Very Large Data Bases, pages 324-335, Toronto, Canada

[38] Tao, Y., Yiu, M L., Papadias, D., Hadjieleftheriou, M., and Mamoulis, N

(2005) RPJ: Producing fast join results on streams through rate-based optimization In Proceedings of the 2005 ACM SIGMOD International Confer- ence on Management of Data, pages 371-382, Baltimore, Maryland, USA

[39] Tatbul, N., Cetintemel, U., Zdonik, S B., Cherniack, M., and Stonebraker,

M (2003) Load shedding in a data stream manager In Proceedings of the

2003 International Conference on Very Large Data Bases, pages 309-320, Berlin, Germany

[40] Tucker, P A., Maier, D., Sheard, T., and Fegaras, L (2003) Exploiting punctuation semantics in continuous data streams IEEE Transactions on Knowledge and Data Engineering, 15(3):555-5 68

[41] Urhan, T and Franklin, M J (2001) Dynamic pipeline scheduling for

improving interactive query performance In Proceedings of the 2001 In- ternational Conference on Very Large Data Bases, pages 501-510, Roma, Italy

[42] Viglas, S D andNaughton, J F (2002) Rate-based query optimization for

streaming information sources In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 37-48, Madison, Wisconsin, USA

[43] Viglas, S D., Naughton, J F., and Burger, J (2003) Maximizing the

output rate of multi-way join queries over streaming information sources

In Proceedings of the 2003 International Conference on Very Large Data Bases, pages 285-296, Berlin, Germany

[44] Wilschut, A N and Apers, P M G (1991) Dataflow query execution in

a parallel main-memory environment In Proceedings of the 1991 Interna-

Trang 8

236 DATA STREAMS: MODELS AAN ALGORITHMS

tional Conference on Parallel and Distributed Information Systems, pages 68-77, Miami Beach, Florida, USA

[45] Xie, J., Yang, J., and Chen, Y (2005) On joining and caching stochastic

streams In Proceedings of the 2005 ACM SIGMOD International Confer- ence on Management of Data, pages 359-370, Baltimore, Maryland, USA

[46] Zhuge, Y, Garcia-Molina, H., and Wiener, J L (1998) Consistency

algorithms for multi-source warehouse view maintenance Distributed and Parallel Databases, 6(1):7-40

Trang 9

Department of Computer Science

University of California Santa Barbara

Santa Barbara, CA 931 06

ambuj Ocs.ucsb.edu

Abstract Online monitoring of data streams poses a challenge in many data-centric appli-

cations including network traffic management, trend analysis, web-click streams, intrusion detection, and sensor networks Indexing techniques used in these applications have to be time and space efficient while providing a high quality of answers to user queries: (I) queries that monitor aggregates, such as finding surprising levels ("volatility" of a data stream), and detecting bursts, and (2) queries that monitor trends, such as detecting correlations and finding similar patterns

Data stream indexing becomes an even more challenging task, when we take into account the dynamic nature of underlying raw data For example, bursts of events can occur at variable temporal modalities from hours to days to weeks We focus

on a multi-resolution indexing architecture The architecture enables the discov- ery of "interesting" behavior online, provides flexibility in user query definitions, and interconnects registered queries for real-time and in-depth analysis

Keywords: stream indexing, monitoring real-time systems, mining continuous data flows,

multi-resolution index, synopsis maintenance, trend analysis, network traffic analysis

Trang 10

23 8 DATA STREAMS: MODELS AND ALGORITHMS

1 Introduction

Raw stream data, such as faults and alarms generated by network traffic monitors

and log records generated by web servers, are almost always at low level and

too large to maintain in main memory One can instead summarize the data

and compute synopsis structures at meaningful abstraction levels on the fly

The synopsis is a small space data structure, and can be updated incrementally

as new stream values arrive Later in operational cycle, it can be used to

discover interesting behavior, which prompts in-depth analysis at lower levels

of abstraction [lo]

Consider the following application in astrophysics: the sky is constantly observed for high-energy particles When a particular astrophysical event hap-

pens, a shower of high-energy particles arrives in addition to the background

noise This yields an unusually high number of detectable events (high-energy

photons) over a certain time period, which indicates the existence of a Gamma

Ray Burst If we know the duration of the shower, we can maintain a count

on the total number of events over sliding windows of the known window size

and raise an alarm when the moving sum is above a threshold Unfortunately,

in many cases, we cannot predict the duration of the burst period The burst of

high-energy photons might last for a few milliseconds, a few hours, or even a

few days [3 11

Finding similar patterns in a time series database is a well studied problem [I, 131 The features of a time series sequence are extracted using a sliding

window, and inserted into an index structure for query efficiency However,

such an approach is not adequate for data stream applications, since it requires

a time consuming feature extraction step with each incoming data item For this

purpose, incremental feature extraction techniques that use the previous feature

in computing the new feature have been proposed to accelerate per-item process-

ing [30] A batch technique can further decrease the per-item processing cost by

computing a new feature periodically instead of every time unit [22] A majority

of these techniques assume a priori knowledge on query patterns However in

a real world situation, a user might want to know all time periods during which

the movement of a particular stock follows a certain interesting trend, which

itself can be generated automatically by a particular application [26] In order to

address this issue, a multi-resolution indexing scheme has been proposed [16]

This work addresses off-line time series databases, and does not consider how

well the proposed scheme extends to a real-time streaming algorithm

Continuous queries that run indefinitely, unless a query lifetime has been specified, fit naturally into the mold of data stream applications Examples of

these queries include monitoring a set of conditions or events to occur, detecting

a certain trend in the underlying raw data, or in general discovering relations

between various components of a large real time system The kinds of queries

Trang 11

Indexing and Querying Data Streams 239

that are of interest from an application point of view can be listed as follows:

(1) monitoring aggregates, (2) monitoring or finding patterns, and (3) detecting

correlations Each of these queries requires data management over some history

of values, and not just over the most recently reported values 191 For example

in case of aggregate queries, the system monitors whether the current window

aggregate deviates significantly from that aggregate in most time periods of

the same size In case of correlation queries, the self-similar nature of sensor

measurements may be reflected as temporal correlations at some resolution over

the course of the stream [24] Therefore, the system has to maintain historical

data along with the current data in order to be able to answer these queries

A key vision in developing stream management systems of practical value is

to interconnect queries in a monitoring infrastructure For example, an unusual

volatility of a stream may trigger an in-depth trend analysis Unified system so-

lutions can lay ground for tomorrow's information infrastructures by providing

users with a rich set of interconnected querying capabilities 181

2 Indexing Streams

In this section, we introduce a multi-resolution indexing architecture, and then

later in Section 3, show how it can be utilized to monitor user queries efficiently

Multi-resolution approach imposes an inherent restriction on what constitutes

a meaningful query The core part of the scheme is the feature extraction at

multiple resolutions A dynamic index structure is used to index features for

query efficiency The system architecture is shown in Figure 1 1.1 The key

architecture aspects are:

The features at higher resolutions are computed using the features at lower resolutions; therefore, all features are computed in a single pass

w The system guarantees the accuracy provided to user queries by provable error bounds

The index structure has tunable parameters to trade accuracy for speed and space The per-item processing cost and the space overhead can be tuned according to the application requirements by varying the update rate and the number of coefficients maintained in the index structure

2.1 Preliminaries and definitions

We adapt the use of x[i] to refer to the i-th entry of stream x, and x[il : ia] to

refer to the subsequence of entries at positions il through i2

DEFINITION 2.1 A feature is the result of applying a characteristic function

over a possibly normalized set of stream values in order to acquire a higher

level information or concept

Trang 12

DATA STREAMS: MODELS AND ALGORITHMS

1 Model 2 /' ' ',\

Stream Processin En ine

P

Figuve 11.1 The system architecture for a multi-resolution index structure consisting of 3

levels and strearn-specific auto-regressive (AR) models for capturing multi-resolution trends in

the data

The widely used characteristic h c t i o n s are (1) aggregate functions, such as

summation, maximum, minimum, and average, (2) orthogonal transformations,

such as discrete wavelet transform (DWT) and discrete fourier transform (DFT),

and (3) piecewise linear approximations Normalization is performed in case

of DWT, DFT, and linear approximations The interested reader can refer to

the Sections 3.2 and 3.3 for more details

2.2 Feature extraction

The features at a specific resolution are obtained with a sliding window of a

fixed length w The sliding window size doubles as we go up a resolution, i.e.,

a level In the rest of the paper, we will use the terms "level" and "resolution"

interchangeably We denote a newly computed feature at resolution i as Fi

Figure 11.2 shows an example where we have three resolutions with corre-

Trang 13

Indexing and Querying Data Streams 241

sponding sliding window sizes of 2,4 and 8 With each arrival of a new stream

value, features Fo, f i , and &, i.e., one for each resolution, can be computed

However, this requires maintaining all the stream values within a time window

equal to the size of the largest sliding window, i.e., 8 in our running example

The per-item processing cost and the space required is linear in the size of the

Figure 11.2 Exact feature extraction, update rate T = 1

For a given window w of values y = x[t - w + 11, , x[t], an incremental transformation F (y) is used to compute features The type of transformation F

depends on the monitoring query For example, F is SUM for burst detection,

MAX-MIN for volatility detection, and DWT for detecting correlations and

finding surprising patterns For most real time series, the first f (f << w)

DWT coefficients retain most of the energy of the signal Therefore, we can

safely disregard all but the very first few coefficients to retain the salient features

(e.g., the overall trend) of the original signal

incoming

Figure 11.3 Incremental feature extraction, update rate T = 1

Using an incremental transformation leads to a more efficient way of computing features at all resolutions Level-1 features are computed using level-0

Trang 14

242 DATA STREAMS: MODELS AND ALGORITHMS

features, and level-2 features are computed using level-1 features In general,

we can use lower level features to compute higher level features [3] Fig-

ure 1 1.3 depicts this new way of computation This new algorithm has a lower

per-item processing cost, since we can compute Fl and F2 in constant time

The following lemma establishes this result

L E M M A 11.1 The new feature Fj at level j for the subsequence x[t - w + 1 : t ] can be computed "exactly" using the features FIp1 and Fj-1 at level j - 1 for

the subsequences x[t - w + 1 : t - w/2] and x[t - w/2 + 1 : t] respectively

Proof Fj is r n a ~ ( F i - ~ , Fj-l), rnin(F&, , Fj-i), .FiV1 + Fj-1 for MAX,

MIN, and SUM respectively For DWT, see Lemma 11.4 in Section 2.4 1

However, the space required for this scheme is also linear in the size of the largest window The reason is that we need to maintain half of the features

at the lower level to compute the feature at the upper level incrementally If

we can trade accuracy for space, then we can decrease the space overhead by

computing features approximately At each resolution level, every c of the

feature vectors are combined in a box, or in other words, a minimum bounding

rectangle (MBR) Figure 11.4 depicts this scheme for c = 2 Since each MBR

B contains c features, it has an extent along each dimension In case of SUM,

B[1] corresponds to the smallest sum, and B[2] corresponds to the largest sum

among all c sums In general, B[2i] denotes the low coordinate and B [2i + 11

denotes the high coordinate along the i-th dimension Note that for SUM,

MAX and MIN, B has a single dimension However, for DWT the number of

dimensions f is application dependent

w=2

~ 6 incoming stream

Figure 11.4 Approximate feature extraction, update rate T = 1

Trang 15

Indexing and Querying Data Streams 243

This new approach decreases the space overhead by a factor of c Since the extent information of the MBRs is used in the computation, the newly computed

feature will also be an extent The following lemma proves this result

L E M M A 11.2 The new feature Fj at level j can be computed "approximately"

using the MBRs B1 and B2 that contain the features F P 1 and Fj-1 at level

See Lemma 11.5 in Section 2.4

Using MBRs instead of individual features exploits the fact that there is a strong spatio-temporal correlation between the consecutive features Therefore,

it is natural to extend the computation scheme to eliminate this redundancy

Instead of computing a new feature at each data arrival, one can employ a batch

computation such that a new feature is computed periodically, at every T time

unit This allows us to maintain features instead of MBRs Figure 11.5 shows

this scheme with T = 2 The new scheme has a clear advantage in terms of

accuracy; however it can dismiss potentially interesting events that may occur

between the periods

stream

Figure 11.5 Incremental feature extraction, update rate T = 2

Depending on the box capacity and the update rate Tj at a given level j

(the rate at which we compute a new feature), there are two general feature

computation algorithms:

Tiêu đề	Join Processing in Data Streams
Trường học	University of Information Technology and Communication
Chuyên ngành	Data Streams Models and Algorithms
Thể loại	Survey
Thành phố	Hanoi

Định dạng
Số trang	30
Dung lượng	1,67 MB

Tài liệu tham khảo	Loại	Chi tiết
[12] P. Dinda. CMU, Aug 97 Load Trace. In Host Load Data Archive http://www.cs.northwestern.edu/"pdinda/LoadTraces/	Link
[18] E. Keogh and T. Folias. Time Series Data Mining Archive. In http://www.cs.ucr:edu/"eamonn/TSDMA, 2002.[I 91 Y. Law, H. Wang, and C. Zaniolo. Query languages and data models for database sequences and data streams. In VLDB, pages 492-503,2004	Link
[2] A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attacks on ISP networks. In MPDS, 2003	Khác
[3] A. Arasu and J. Widom. Resource sharing in continuous sliding-window aggregates. In VLDB, pages 336347,2004	Khác
[4] S. Banerjee, B. Bhattacharjee, and C. Kommareddy. Scalable Application Layer Multicast. In SIGCOMM, pages 205-217,2002	Khác
[5] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322-33 1,1990	Khác
[6] J. Bentley, B. Weide, and A. Yao. Optimal expected time algorithms for closest point problems. In ACM Trans. on Math. Software, volume 6, pages 563-580,1980	Khác
[7] A. Bulut and A. Singh. SWAT: Hierarchical stream summarization in large networks. In ICDE, pages 303-3 14,2003	Khác
[8] A. Bulut and A. Singh. A unified framework for monitoring data streams in real time. In ICDE, pages 44-55,2005	Khác
[13] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, pages 419-429, 1994	Khác
[14] C. Guestrin, P. Bodi, R. Thibau, M. Paski, and S. Madden. Distributed regression: an efficient framework for modeling sensor network data. In IPSN, pages 1-10,2004	Khác
[15] A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, pages 47-57,1984	Khác
[16] T. Kahveci and A. Singh. Variable length queries for time series data. In ICDE, pages 273-282,2001	Khác
[17] E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani. Locally adap- tive dimensionality reduction for indexing large time series databases. In SIGMOD, pages 15 1 - 162,2001	Khác
[22] Y. Moon, K. Whang, and W. Han. General match: a subsequence match- ing method in time-series databases based on generalized windows. In SIGMOD, pages 382-393,2002	Khác
[23] T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In ICDE, pages 338-349, 2004	Khác