In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pages 41 9-430, Paris, France.. In Proceedings of the 2004 ACM SIGMOD International Conference on
Trang 1A Survey of Join Processing in Data Streams 229 few S2 tuples, while another tuple may behave in the exact opposite way The first issue is tackled by [5], who provided a family of algorithms for adaptively finding the optimal order to apply a series of filters (joining a tuple with a stream can be regarded as subjecting the tuple to a filter) through runtime profiling In
particular, the A-Greedy algorithm is able to capture correlations among filter selectivities, and is guaranteed to converge to an ordering within a constant factor of the optimal The theoretical guarantee extends to star joins; for general join graphs, though A-Greedy still can be used, the theoretical guarantee no longer holds The second issue is recently addressed by an approach called CBR [9], or content-based routing, which makes the choice of query plan
dependent on the values of the incoming tuple's "classifier attributes," whose values strongly correlate with operator selectivities In effect, CBR is able to process each incoming tuple with a customized query plan
One problem with MJoin is that it may incur a significant amount of recom- putation Consider again the four-way join among S1, , S4, now processed
by a single MJoin operator Whenever a new tuple s3 arrives in S3, MJoin
in effect executes the query S1 w S2 w {s3) w S4; similarly, whenever a new
tuple s 4 arrives in S4, MJoin executes Sl w S2 w S3 w is4) The common
subquery S1 w S2 is processed over and over again for these S3 and S4 tuples
In contrast, the XJoin plan ((S1 XJoin S2) XJoin S3) XJoin S4 materializes all its intermediate results in hash tables, including S1 w S2; new tuples from S3 and S4 simply have to probe this hash table, thereby avoiding recomputation The optimal solution may well lie between these two extremes, as pointed out
by [6] They proposed an adaptive caching strategy, A-Caching, which starts
with MJoins and adds join subresult caches adaptively A-Caching profiles cache benefit and cost online, selects caches dynamically, and allocates mem-
ory to caches dynamically With this approach, the entire spectrum of caching
options from MJoins to XJoins can be explored
A number of other papers also consider multi-way stream joins Golab and
0zsu [23] studied processing and optimization of multi-way sliding-window joins Traditionally, we eagerly remove (expire) tuples that are no longer part of
the sliding window, and eagerly generate output tuples whenever input arrives The authors proposed algorithms supporting lazy expiration and lazy evaluation
as alternatives, which achieve higher efficiency at the expense of higher memory
requirements and longer response times, respectively Hamrnad et al [27] considered multi-way stream joins where a time-based window constraint can
be specified for each pair (or, in general, subset) of input streams An interesting
algorithm called FEW is proposed, which computes a forward point in time
before which all arriving tuples can join, thereby avoiding repeated checking
of window constraints
Eddies [3] are a novel approach towards stream query processing and opti- mization that is markedly different from the standard plan-based approaches
Trang 223 0 DATA STREAMS: MODELS AND ALGORITHMS
Eddies eliminate query plans entirely by routing each input tuple adaptively across the operators that need to process it Interestingly, in eddies, the behav-
ior of SteM [36] mimics that of MJoin, while STAIRS [16] is able to emulate
XJoin Note that while eddies provide the mechanisms for adapting the pro-
cessing strategy on an individual tuple basis, currently their policies typically
do not result in plans that change for every incoming tuple It would be nice to
see how features of CBR can be supported in eddies
6 Conclusion
In this chapter, we have presented an overview of research problems and recent advances in join processing for data streams Stream processing is a
young and exciting research area, yet it also has roots in and connections to
well-established areas in databases as well as computer science in general In
Section 3.2, we have already discussed the relationship between stream join
state management and classic caching Now, let us briefly re-examine parts of
this chapter in light of their relationship to materialized views [25]
The general connection between stream processing and materialized views has long been identified [8] This connection is reflected in the way that we
specify the semantics of stream joins-by regarding them as views and defining
their output as the view update stream resulting from base relation updates
(Section 2) Recall that the standard semantics requires the output sequence to
reflect the exact sequence of states of the underlying view, which is analogous
to the notion of complete and strong consistency of a data warehouse view
with respect to its source relations [46] The connection does not stop at the
semantics The problem of determining what needs to be retained in the state to
compute a stream join is analogous to the problem of deriving auxiliary views
to make a join view self-maintainable [35] Just as constraints can be used to
reduce stream join state (Section 3.1), they have also been used to help expire
data from data warehouses without affecting the maintainability of warehouse
views [21] For a stream join Sl w w Sn, processing an incoming tuple from
stream Si is analogous to maintaining a join view incrementally by evaluating
a maintenance query S1 w - w ASi w - - w Sn Since there are n different
forms of maintenance queries (one for each i), it is natural to optimize each
form differently, which echoes the intuition behind the asymmetric processing
strategy of [30] and MJoin [43] In fact, we can optimize the maintenance query
for each instance of ASi, which would achieve the same goal of supporting a
customized query plan for each tuple as CBR [9] Finally, noticing that the
maintenance queries run frequently and share many common subqueries, we
may choose to materialize some subqueries as additional views to improve
query performance, which is also what A-Caching [6] tries to accomplish
Trang 3A Survey of Join Processing in Data Streams 23 1
Of course, despite high-level similarities, techniques from the two areas- data streams and materialized views-may still differ significantly in actual details Nonetheless, it would be nice to develop a general framework that uni-
fies both areas, or, less ambitiously, to apply ideas from one area to the other
Many such possibilities exist For example, methods and insights from the well-
studied problems of answering query using views [26] and view selection [14] could be extended and applied to data streams: Given a set of stream queries running continuously in a system, what materialized views (over join states and
database relations) and/or additional stream queries can we create to improve the
performance of the system? Another area is distributed stream processing Dis-
tributed stream processing can be regarded as view maintenance in a distributed setting, which has been studied extensively in the context of data warehous-
ing Potentially applicable in this setting are techniques for making warehouse
self-maintainable [35], optimizing view maintenance queries across distributed
sources [3 11, ensuring consistency of multi-source warehouse views 1461, etc
Conversely, stream processing techniques can be applied to materialized views
as well In particular, view maintenance could benefit from optimization tech-
niques that exploit update stream statistics (Section 3.2) Also, selection of
materialized views for performance can be improved by adaptive caching tech-
niques (Section 5)
Besides the future work directions mentioned above and throughout the chapter, another important direction worth exploring is the connection between
data stream processing and distributed event-based systems [19] such as pub-
lish/subscribe systems Such systemsneed to scale to thousands or even millions
of subscriptions, which are essentially continuous queries over event streams
While efficient techniques for handling continuous selections already exist,
scalable processing of continuous joins remains a challenging problem Ham-
mad et al [28] considered shared processing of stream joins with identical join
conditions but different sliding-window durations We need to consider more
general query forms, e.g., joins with different join conditions as well as addi-
tional selection conditions on input streams NiagaraCQ [I 31 and CACQ [32]
are able to group-process selections and share processing of identical join oper-
ations However, there is no group or shared processing of joins with different
join conditions, and processing selections separately from joins limits optimiza-
tion potentials PSoup [ l 11 treats queries as data, thereby allowing set-oriented
processing of queries with arbitrary join and selection conditions Still, new
indexing and processing techniques must be developed for the system to be able
to process each event in time sublinear in the number of subscriptions
Trang 4232 DATA STREAMS: MODELS AND ALGORITHMS
Acknowledgments
This work is supported by a NSF CAREER Award under grant 11s-0238386
We would also like to thank Shivnath Babu, Yuguo Chen, Kamesh Munagala,
and members of the Duke Database Research Group for their discussions
References
[I] Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J (2002) Char-
acterizing memory requirements for queries over continuous data streams
In Proceedings of the 2002 ACM Symposium on Principles of Database Systems, pages 221-232, Madison, Wisconsin, USA
[2] Arasu, A., Babu, S., and Widom, J (2003) The CQL continuous query
language: Semantic foundations and query execution Technical Report 2003-67, InfoLab, Stanford University
[3] Avnw, R and Hellerstein, J M (2000) Eddies: Continuously adaptive
query processing In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 261-272, Dallas, Texas, USA
[4] Ayad, A and Naughton, J F (2004) Static optimization of conjunctive
queries with sliding windows over infinite streams In Proceedings of the
2004 ACM SIGMOD International Conference on Management of Data,
pages 41 9-430, Paris, France
[5] Babu, S., Motwani, R., Munagala, K., Nishizawa, I., and Widom, J (2004a)
Adaptive ordering of pipelined stream filters In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pages
407-41 8, Paris, France
[6] Babu, S., Munagala, K., Widom, J., and Motwani, R (2005) Adaptive
caching for continuous queries In Proceedings of the 2005 International
Conference on Data Engineering, Tokyo, Japan
[7] Babu, S., Srivastava, U., and Widom, J (2004b) Exploiting Ic-constraints
to reduce memory overhead in continuous queries over data streams ACM
Transactions on Database Systems, 29(3):545-580
[8] Babu, S and Widom, J (2001) Continuous queries over data streams
ACM SIGMOD Record
[9] Bizarro, P., Babu, S., DeWitt, D., and Widom, J (2005) Content-based
routing: Different plans for different data In Proceedings of the 2005 Inter-
national Conference on Very Large Data Bases, Trondheim, Norway
[lo] Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman,
G., Stonebraker, M., Tatbul, N., andzdonik, S B (2002) Monitoring streams
- a new class of data management applications In Proceedings of the 2002
Trang 5A Survey of Join Processing in Data Streams 233
International Conference on Very Large Data Bases, pages 215-226, Hong Kong, China
[ l 11 Chandrasekaran, S and Franklin, M J (2003) PSoup: a system for
streaming queries over streaming data The VLDB Journal, 12(2): 140-156
[12] Chaudhuri, S., Motwani, R., and Narasayya, V R (1999) On random
sampling over joins In Proceedings of the 1999 ACM SIGMOD Interna-
tional Conference on Management of Data, pages 263-274, Philadelphia, Pennsylvania, USA
[13] Chen, J., DeWitt, D J., Tian, F., and Wang, Y (2000) NiagaraCQ: A
scalable continuous query system for internet databases In Proceedings of
the 2000 ACM SIGMOD International Conference on Management ofData,
pages 379-390, Dallas, Texas, USA
[14] Chirkova, R., Halevy, A Y, and Suciu, D (2001) A formal perspective
on the view selection problem In Proceedings of the 2001 International
Conference on Very Large Data Bases, pages 59-68, Roma, Italy
[15] Das, A., Gehrke, J., and Riedewald, M (2003) Approximate join pro-
cessing over data streams In Proceedings of the 2003 ACM SIGMOD In-
ternational Conference on Management of Data, pages 40-5 1, San Diego, California, USA
[16] Deshpande, A and Hellerstein, J M (2004) Lifting the burden of history
from adaptive query processing In Proceedings of the 2004 International Conference on Very Large Data Bases, pages 948-959, Toronto, Canada
[17] Ding, L., Mehta, N., Rundensteiner, E., and Heineman, G (2004) Joining
punctuated streams In Proceedings of the 2004 International Conference
on Extending Database Technology, Heraklion, Crete, Greece
[18] Ding, L and Rundensteiner, E A (2004) Evaluating window joins over
punctuated streams In Proceedings of the 2004 International Conference on Information and Knowledge Management, pages 98-1 07, Washington DC, USA
[19] Dingel, J and Strom, R., editors (2005) Proceedings of the 2005 Inter-
national Workshop on Distributed Event Based Systems, Columbus, Ohio, USA
[20] Dittrich, J.-P., Seeger, B., Taylor, D S., and Widmayer, P (2002) Pro-
gressive merge join: A generic and non-blocking sort-based join algorithm
In Proceedings of the 2002 International Conference on firy Large Data Bases, pages 299-3 10, Hong Kong, China
[21] Garcia-Molina, H., Labio, W., and Yang, J (1998) Expiring data in a
warehouse In Proceedings of the 1998 International Conference on Very
Large Data Bases, pages 500-5 1 1, New York City, New York, USA
Trang 6234 DATA STREAMS: MODELS AND ALGORITHMS
[22] Golab, L., Garg, S., and ~ z s u , T (2004) On indexing sliding windows over
on-line data streams In Proceedings of the 2004 International Conference
on Extending Database Technology, Heraklion, Crete, Greece
[23] Golab, L and ~ z s u , M T (2003) Processing sliding window multi-
joins in continuous queries over data streams In Proceedings of the 2003
International Conference on Very Large Data Bases, pages 500-5 1 1, Berlin, Germany
[24] Golab, L and ~ z s u , M T (2005) Update-pattern-aware modeling and
processing of continuous queries In Proceedings of the 2005 ACM SIG-
MOD International Conference on Management of Data, pages 658-669, Baltimore, Maryland, USA
[25] Gupta, A and Mumick, I S., editors (1999) Materialized Uews: Tech-
niques, Implementations, and Applications MIT Press
1261 Halevy, A Y (2001) Answering queries using views: A survey The
VZDB Journal, 10(4):27&294
[27] Hammad, M A., Aref, W G., and Elmagarmid, A K (2003a) Stream
window join: Tracking moving objects in sensor-network databases In Pro-
ceedings of the 2003 International Conference on ScientiJic and Statistical
Database Management, pages 75-84, Cambridge, Massachusetts, USA
[28] Hammad, M A., Franklin, M J., Aref, W G., and Elmagarmid, A K
(2003b) Scheduling for shared window joins over data streams In Pro-
ceedings of the 2003 International Conference on Very Large Data Bases, pages 297-308, Berlin, Germany
[29] Ives, Z G., Florescu, D., Friedman, M., Levy, A Y., and Weld, D S
(1999) An adaptive query execution system for data integration In Proceed- ings of the 1999 ACM SIGMOD International Conference on Management
of Data, pages 299-3 10, Philadelphia, Pennsylvania, USA
[30] Kang, J., Naughton, J F., and Viglas, S (2003) Evaluating window
joins over unbounded streams In Proceedings of the 2003 International
Conference on Data Engineering, pages 341-352, Bangalore, India
[3 11 Liu, B and Rundensteiner, E A (2005) Cost-driven general join view
maintenance over distributed data sources In Proceedings of the 2005 Inter-
national Conference on Data Engineering, pages 578-579, Tokyo, Japan
[32] Madden, S., Shah, M A., Hellerstein, J M., and Raman, V (2002) Con-
tinuously adaptive continuous queries over streams In Proceedings of the
2002 ACM SIGMOD International Conference on Management of Data,
Madison, Wisconsin, USA
[33] Mokbel, M F., Lu, M., and Aref, W G (2004) Hash-merge join: A
non-blocking join algorithm for producing fast and early join results In
Trang 7A Survey of Join Processing in Data Streams 235
Proceedings of the 2004 International Conference on Data Engineering, pages 25 1-263, Boston, Massachusetts, USA
[34] Olken, F (1993) Random Sampling from Databases PhD thesis, Uni-
versity of California at Berkeley
[35] Quass, D., Gupta, A., Mumick, I S., and Widom, J (1996) Making views self-maintainable for data warehousing In Proceedings of the I996
International Conference on Parallel and Distributed Information Systems, pages 158-1 69, Miami Beach, Florida, USA
[36] Raman, V., Deshpande, A., and Hellerstein, J M (2003) Using state mod-
ules for adaptive query processing In Proceedings of the 2003 International Conference on Data Engineering, pages 353-364, Bangalore, India
1371 Srivastava, U and Widom, J (2004) Memory-limited execution of win-
dowed stream joins In Proceedings of the 2004 International Conference
on Very Large Data Bases, pages 324-335, Toronto, Canada
[38] Tao, Y., Yiu, M L., Papadias, D., Hadjieleftheriou, M., and Mamoulis, N
(2005) RPJ: Producing fast join results on streams through rate-based opti- mization In Proceedings of the 2005 ACM SIGMOD International Confer- ence on Management of Data, pages 371-382, Baltimore, Maryland, USA
[39] Tatbul, N., Cetintemel, U., Zdonik, S B., Cherniack, M., and Stonebraker,
M (2003) Load shedding in a data stream manager In Proceedings of the
2003 International Conference on Very Large Data Bases, pages 309-320, Berlin, Germany
[40] Tucker, P A., Maier, D., Sheard, T., and Fegaras, L (2003) Exploiting punctuation semantics in continuous data streams IEEE Transactions on Knowledge and Data Engineering, 15(3):555-5 68
[41] Urhan, T and Franklin, M J (2001) Dynamic pipeline scheduling for
improving interactive query performance In Proceedings of the 2001 In- ternational Conference on Very Large Data Bases, pages 501-510, Roma, Italy
[42] Viglas, S D andNaughton, J F (2002) Rate-based query optimization for
streaming information sources In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 37-48, Madison, Wisconsin, USA
[43] Viglas, S D., Naughton, J F., and Burger, J (2003) Maximizing the
output rate of multi-way join queries over streaming information sources
In Proceedings of the 2003 International Conference on Very Large Data Bases, pages 285-296, Berlin, Germany
[44] Wilschut, A N and Apers, P M G (1991) Dataflow query execution in
a parallel main-memory environment In Proceedings of the 1991 Interna-
Trang 8236 DATA STREAMS: MODELS AAN ALGORITHMS
tional Conference on Parallel and Distributed Information Systems, pages 68-77, Miami Beach, Florida, USA
[45] Xie, J., Yang, J., and Chen, Y (2005) On joining and caching stochastic
streams In Proceedings of the 2005 ACM SIGMOD International Confer- ence on Management of Data, pages 359-370, Baltimore, Maryland, USA
[46] Zhuge, Y, Garcia-Molina, H., and Wiener, J L (1998) Consistency
algorithms for multi-source warehouse view maintenance Distributed and Parallel Databases, 6(1):7-40
Trang 9Department of Computer Science
University of California Santa Barbara
Santa Barbara, CA 931 06
ambuj Ocs.ucsb.edu
Abstract Online monitoring of data streams poses a challenge in many data-centric appli-
cations including network traffic management, trend analysis, web-click streams, intrusion detection, and sensor networks Indexing techniques used in these ap- plications have to be time and space efficient while providing a high quality of answers to user queries: (I) queries that monitor aggregates, such as finding sur- prising levels ("volatility" of a data stream), and detecting bursts, and (2) queries that monitor trends, such as detecting correlations and finding similar patterns
Data stream indexing becomes an even more challenging task, when we take into account the dynamic nature of underlying raw data For example, bursts of events can occur at variable temporal modalities from hours to days to weeks We focus
on a multi-resolution indexing architecture The architecture enables the discov- ery of "interesting" behavior online, provides flexibility in user query definitions, and interconnects registered queries for real-time and in-depth analysis
Keywords: stream indexing, monitoring real-time systems, mining continuous data flows,
multi-resolution index, synopsis maintenance, trend analysis, network traffic analysis
Trang 1023 8 DATA STREAMS: MODELS AND ALGORITHMS
1 Introduction
Raw stream data, such as faults and alarms generated by network traffic monitors
and log records generated by web servers, are almost always at low level and
too large to maintain in main memory One can instead summarize the data
and compute synopsis structures at meaningful abstraction levels on the fly
The synopsis is a small space data structure, and can be updated incrementally
as new stream values arrive Later in operational cycle, it can be used to
discover interesting behavior, which prompts in-depth analysis at lower levels
of abstraction [lo]
Consider the following application in astrophysics: the sky is constantly observed for high-energy particles When a particular astrophysical event hap-
pens, a shower of high-energy particles arrives in addition to the background
noise This yields an unusually high number of detectable events (high-energy
photons) over a certain time period, which indicates the existence of a Gamma
Ray Burst If we know the duration of the shower, we can maintain a count
on the total number of events over sliding windows of the known window size
and raise an alarm when the moving sum is above a threshold Unfortunately,
in many cases, we cannot predict the duration of the burst period The burst of
high-energy photons might last for a few milliseconds, a few hours, or even a
few days [3 11
Finding similar patterns in a time series database is a well studied prob- lem [I, 131 The features of a time series sequence are extracted using a sliding
window, and inserted into an index structure for query efficiency However,
such an approach is not adequate for data stream applications, since it requires
a time consuming feature extraction step with each incoming data item For this
purpose, incremental feature extraction techniques that use the previous feature
in computing the new feature have been proposed to accelerate per-item process-
ing [30] A batch technique can further decrease the per-item processing cost by
computing a new feature periodically instead of every time unit [22] A majority
of these techniques assume a priori knowledge on query patterns However in
a real world situation, a user might want to know all time periods during which
the movement of a particular stock follows a certain interesting trend, which
itself can be generated automatically by a particular application [26] In order to
address this issue, a multi-resolution indexing scheme has been proposed [16]
This work addresses off-line time series databases, and does not consider how
well the proposed scheme extends to a real-time streaming algorithm
Continuous queries that run indefinitely, unless a query lifetime has been specified, fit naturally into the mold of data stream applications Examples of
these queries include monitoring a set of conditions or events to occur, detecting
a certain trend in the underlying raw data, or in general discovering relations
between various components of a large real time system The kinds of queries
Trang 11Indexing and Querying Data Streams 239
that are of interest from an application point of view can be listed as follows:
(1) monitoring aggregates, (2) monitoring or finding patterns, and (3) detecting
correlations Each of these queries requires data management over some history
of values, and not just over the most recently reported values 191 For example
in case of aggregate queries, the system monitors whether the current window
aggregate deviates significantly from that aggregate in most time periods of
the same size In case of correlation queries, the self-similar nature of sensor
measurements may be reflected as temporal correlations at some resolution over
the course of the stream [24] Therefore, the system has to maintain historical
data along with the current data in order to be able to answer these queries
A key vision in developing stream management systems of practical value is
to interconnect queries in a monitoring infrastructure For example, an unusual
volatility of a stream may trigger an in-depth trend analysis Unified system so-
lutions can lay ground for tomorrow's information infrastructures by providing
users with a rich set of interconnected querying capabilities 181
2 Indexing Streams
In this section, we introduce a multi-resolution indexing architecture, and then
later in Section 3, show how it can be utilized to monitor user queries efficiently
Multi-resolution approach imposes an inherent restriction on what constitutes
a meaningful query The core part of the scheme is the feature extraction at
multiple resolutions A dynamic index structure is used to index features for
query efficiency The system architecture is shown in Figure 1 1.1 The key
architecture aspects are:
The features at higher resolutions are computed using the features at lower resolutions; therefore, all features are computed in a single pass
w The system guarantees the accuracy provided to user queries by provable error bounds
The index structure has tunable parameters to trade accuracy for speed and space The per-item processing cost and the space overhead can be tuned according to the application requirements by varying the update rate and the number of coefficients maintained in the index structure
2.1 Preliminaries and definitions
We adapt the use of x[i] to refer to the i-th entry of stream x, and x[il : ia] to
refer to the subsequence of entries at positions il through i2
DEFINITION 2.1 A feature is the result of applying a characteristic function
over a possibly normalized set of stream values in order to acquire a higher
level information or concept
Trang 12DATA STREAMS: MODELS AND ALGORITHMS
1 Model 2 /' ' ',\
Stream Processin En ine
P
Figuve 11.1 The system architecture for a multi-resolution index structure consisting of 3
levels and strearn-specific auto-regressive (AR) models for capturing multi-resolution trends in
the data
The widely used characteristic h c t i o n s are (1) aggregate functions, such as
summation, maximum, minimum, and average, (2) orthogonal transformations,
such as discrete wavelet transform (DWT) and discrete fourier transform (DFT),
and (3) piecewise linear approximations Normalization is performed in case
of DWT, DFT, and linear approximations The interested reader can refer to
the Sections 3.2 and 3.3 for more details
2.2 Feature extraction
The features at a specific resolution are obtained with a sliding window of a
fixed length w The sliding window size doubles as we go up a resolution, i.e.,
a level In the rest of the paper, we will use the terms "level" and "resolution"
interchangeably We denote a newly computed feature at resolution i as Fi
Figure 11.2 shows an example where we have three resolutions with corre-
Trang 13Indexing and Querying Data Streams 241
sponding sliding window sizes of 2,4 and 8 With each arrival of a new stream
value, features Fo, f i , and &, i.e., one for each resolution, can be computed
However, this requires maintaining all the stream values within a time window
equal to the size of the largest sliding window, i.e., 8 in our running example
The per-item processing cost and the space required is linear in the size of the
Figure 11.2 Exact feature extraction, update rate T = 1
For a given window w of values y = x[t - w + 11, , x[t], an incremental transformation F (y) is used to compute features The type of transformation F
depends on the monitoring query For example, F is SUM for burst detection,
MAX-MIN for volatility detection, and DWT for detecting correlations and
finding surprising patterns For most real time series, the first f (f << w)
DWT coefficients retain most of the energy of the signal Therefore, we can
safely disregard all but the very first few coefficients to retain the salient features
(e.g., the overall trend) of the original signal
incoming
Figure 11.3 Incremental feature extraction, update rate T = 1
Using an incremental transformation leads to a more efficient way of com- puting features at all resolutions Level-1 features are computed using level-0
Trang 14242 DATA STREAMS: MODELS AND ALGORITHMS
features, and level-2 features are computed using level-1 features In general,
we can use lower level features to compute higher level features [3] Fig-
ure 1 1.3 depicts this new way of computation This new algorithm has a lower
per-item processing cost, since we can compute Fl and F2 in constant time
The following lemma establishes this result
L E M M A 11.1 The new feature Fj at level j for the subsequence x[t - w + 1 : t ] can be computed "exactly" using the features FIp1 and Fj-1 at level j - 1 for
the subsequences x[t - w + 1 : t - w/2] and x[t - w/2 + 1 : t] respectively
Proof Fj is r n a ~ ( F i - ~ , Fj-l), rnin(F&, , Fj-i), .FiV1 + Fj-1 for MAX,
MIN, and SUM respectively For DWT, see Lemma 11.4 in Section 2.4 1
However, the space required for this scheme is also linear in the size of the largest window The reason is that we need to maintain half of the features
at the lower level to compute the feature at the upper level incrementally If
we can trade accuracy for space, then we can decrease the space overhead by
computing features approximately At each resolution level, every c of the
feature vectors are combined in a box, or in other words, a minimum bounding
rectangle (MBR) Figure 11.4 depicts this scheme for c = 2 Since each MBR
B contains c features, it has an extent along each dimension In case of SUM,
B[1] corresponds to the smallest sum, and B[2] corresponds to the largest sum
among all c sums In general, B[2i] denotes the low coordinate and B [2i + 11
denotes the high coordinate along the i-th dimension Note that for SUM,
MAX and MIN, B has a single dimension However, for DWT the number of
dimensions f is application dependent
w=2
~ 6 incoming stream
Figure 11.4 Approximate feature extraction, update rate T = 1
Trang 15Indexing and Querying Data Streams 243
This new approach decreases the space overhead by a factor of c Since the extent information of the MBRs is used in the computation, the newly computed
feature will also be an extent The following lemma proves this result
L E M M A 11.2 The new feature Fj at level j can be computed "approximately"
using the MBRs B1 and B2 that contain the features F P 1 and Fj-1 at level
See Lemma 11.5 in Section 2.4
Using MBRs instead of individual features exploits the fact that there is a strong spatio-temporal correlation between the consecutive features Therefore,
it is natural to extend the computation scheme to eliminate this redundancy
Instead of computing a new feature at each data arrival, one can employ a batch
computation such that a new feature is computed periodically, at every T time
unit This allows us to maintain features instead of MBRs Figure 11.5 shows
this scheme with T = 2 The new scheme has a clear advantage in terms of
accuracy; however it can dismiss potentially interesting events that may occur
between the periods
stream
Figure 11.5 Incremental feature extraction, update rate T = 2
Depending on the box capacity and the update rate Tj at a given level j
(the rate at which we compute a new feature), there are two general feature
computation algorithms: