DSpace at VNU: A parallel dimensionality reduction for time-series data and some of its applications tài liệu, giáo án,...
Trang 1A parallel dimensionality reduction for time-series data and some of its applications
Hoang Chi Thanh*
Department of Informatics, Hanoi University of Science, VNUH,
334 Nguyen Trai Rd., Hanoi, Vietnam E-mail: thanhhc@vnu.vn
*Corresponding author Nguyen Quang Thanh
Da Nang Department of Information and Communication,
15 Quang Trung Str., Da Nang, Vietnam E-mail: thanhnq@dsp.vn
Abstract: The subsequence matching in a large time-series database has been
an interesting problem Many methods have been proposed that cope with this problem in an adequate extent One of the good ideas is reducing properly the dimensionality of time-series data
In this paper, we propose a new method to reduce the dimensionality of high-dimensional time-series data The method is simpler than existing ones based on the discrete Fourier transform and the discrete cosine transform
Furthermore, our dimensionality reduction may be executed in parallel The method is used to time-series data matching problem and it decreases drastically the complexity of the corresponding algorithm The method preserves planar geometric blocks and it is also applied to minimum bounding rectangles as well
Keywords: time-series data; database; dimensionality reduction; matching
problem; minimum bounding rectangle; MBR
Reference to this paper should be made as follows: Thanh, H-C and
Thanh, N-Q (2011) ‘A parallel dimensionality reduction for time-series data
and some of its applications’, Int J Intelligent Information and Database Systems, Vol 5, No 1, pp.39–48
Biographical notes: Hoang Chi Thanh is an Associate Professor at Hanoi
University of Science, Vietnam He received his PhD in Computer Science from Warsaw Technical University, Poland and his BSc in Computational Mathematics from The University of Hanoi, Vietnam Since 1974 he has been working for The University of Hanoi (currently Hanoi University of Science)
From 2000 to 2008 he was the Head of the Department of Informatics Since
2004 he has been the Director of Science Co., Ltd He has published more than
40 refereed papers and eight books He is the supervisor of three PhD students
His current research interests include concurrency theory, combinatorics, data mining and knowledge-based systems
Trang 2Nguyen Quang Thanh is a PhD student at Hanoi University of Science, Vietnam He received his MSc in Information Technology and his BSc in Mathematics from Can Tho University, Vietnam Since 1999 he has been working for Da Nang Department of Information and Communication, Vietnam His research interests include data mining, knowledge-based systems and network security
1 Introduction
Time-series data are the sequences of real numbers representing values at specific points
in time For example, the bid prices and the ask prices of stock items, exchange rates, weather data and human speech signals… are typical illustrations of time-series data The data stored in a database are called data sequences The aim of the subsequence matching problem in a large time-series database is finding data sequences similar to the given query sequence from the database This problem has attracted a lot of interest by its applications
Many methods have been proposed that cope with this problem in an adequate extend (Agrawal et al., 1993; Keogh et al., 2000; Keogh et al., 2001; Faloutsos et al., 2001;
Moon et al., 2002) One of good ideas to increase the matching speed is a proper dimensionality reduction for high-dimensional time-series data In 2007, Moon proposed
a data transformation based on the discrete Fourier transform and then Moon and Kim presented a data transformation based on the discrete cosine transform
In this paper we present another dimensionality reduction for high-dimensional time-series data The method splits a high-dimensional time-series data into parts as equal
in time scale as possible and then takes the average of each part The reduction is simpler than existing ones above presented and it may be performed in parallel So this method decreases the time for ‘narrowing’ data and speeds up the matching process in a large time-series database We also use this dimensionality reduction for a special type of time-series data – minimum bounding rectangles (MBR)
This paper is organised as follows In Section 2 we present a dimensionality reduction function for high-dimensional time-series data and point out some its properties
Section 3 presents application of the dimensionality reduction function to time-series data matching and to MBR When applying this reduction function to MBRs we show that it becomes safe Some conclusion remarks are given in the last section
2 Dimensionality reduction for time-series data
Let T[1 n] be a time-series data The time-series data consists of n real numbers, so it is called an n-dimensional data
The dimensionality n of time-series data is as high as difficult to store, search and
match So it turns out that how to ‘narrow’ the data In other words, we have to construct
an operation, which transforms a high-dimensional time-series data with hundreds or thousands of dimensions to a low-dimensional time-series data with some dimensions
Instead of doing on high-dimensional time-series data one can do the same on low-dimensional time-series data with high performance To do so, we construct
Trang 3dimensionality reduction functions for time-series data Each such a function is indeed a
mapping F: R n → Rm
Let F be any dimensionality reduction function transforming n-dimensional time-series data to m-dimensional time-series data, with 0 < m < n We are interested only
in those functions that satisfy the following requirement
Definition 2.1: A dimensionality reduction function F is proper if for any pair of n-dimensional time-series data X and Y:
( ) ( )
where, D n and D m are the distance functions of the n-dimensional space and the m-dimensional space, respectively
So each proper dimensionality reduction function on time-series data is a shrinking mapping The properness of a reduction function guarantees no false dismissals for range queries
Let T[1 n] be an n-dimensional time-series data and let m be a positive integer such that 0 < m << n The authors of Moon (2007) and Moon and Kim (2007) have
constructed two dimensionality reduction functions based on the discrete Fourier
transform and the discrete cosine transform for T[1 n] to get m-dimensional time-series data T RF [1 m] and T RC [1 m] as follows
1 the dimensionality reduction function based on the discrete Fourier transform is:
1
1
1
[ ] cos( 2 ( 1) / 2 ( 1) / ), if is odd;
[ ]
1
[ ]] sin( 2 ( 1) / 2 ( 1) / ), if is odd 1
n
j
j
n
n
i m
π
π
=
=
⎧
⎪⎪
= ⎨
⎪
⎪⎩
≤ ≤
∑
2 the dimensionality reduction function based on the discrete cosine transform is:
1
2 ( ) (2 1)( 1) [ ] [ ] cos( ),
2
n RC
j
π
=
− −
where ( ) 2 / 2, if 1;
1, if 2
i
c i
i m i m
⎧ =
⎪
= ⎨
≤ ≤ ≤ ≤
⎪⎩
Back to an n-dimensional time-series data T[1 n] To reduce the dimensionality of the data we split it into m parts as equal in time scale as possible This always may be done
because of the following arithmetic fact:
For two positive integers n and m with 0 < m < n, there exist two non-negative integers q and d, such that n = d.(q + 1) + (m – d).q
The proof of this fact is very simple
Let choose q = n div m and d = n mod m We get, n = m.q + d = d.q + d + m.q – d.q = d.(q + 1) + (m – d).q
Trang 4The above fact offers us a method to part an n-dimensional time-series data into the following m parts: d first parts with the size of q + 1 and m – d remaining parts with the size of q Then we take the average of each part So we are able to transform an n-dimensional time-series data to an m-dimensional time-series data
Let denote q = n div m and d = n mod m
Definition 2.2: The m-dimensional time-series data T R [1 m] constructed as follows:
.( 1)
( 1).( 1) 1
( 1) 1
1
[ ], if 1 ; 1
[ ]
i q
j i q
R d i q
j d i q
q
T i
q
+
= − + + +
= + − +
⎧
≤ ≤
⎪ +
⎪⎪
= ⎨
⎪
⎪⎩
∑
is called a reduced m-dimensional time-series data of the n-dimensional time-series data T[1 n]
The formula (2.4) gives us a function transforming n-dimensional time-series data to m-dimensional time-series data This transforming function may be used to store large
databases of multi-dimensional time-series data It causes to save memory and to increase the matching speed Moreover, our dimensionality reduction may be performed in parallel (Thanh, 2007; Thanh, 2009) The time for building the reduced database will be drastically decreased
Theorem 2.1: The dimensionality reduction function f constructed as in the formula (2.4)
is proper
Proof: Let X[1 n] and Y[1 n] be two n-dimensional time-series data The distance function used here is Hamming distance L1, called also Manhattan distance or city block distance
So, 1
1
( , ) | [ ] [ ] |
n
k
=
1
( ( ), ( )) | [ ] [ ] |,
m
R R i
=
X R [i] and Y R [i] are the corresponding components of the m-dimensional time-series data
transformed by the formula (2.4)
To prove the properness of the function f we check the inequality (2.1) only on each
part split as in Definition 2.2 On the first part we have:
[1] [1] [2] [2] [ 1] [ 1]
[1] [1] [2] [2] [ 1] [ 1]
[1] [2] [ 1] [1] [2] [ 1]
[1] [2] [ 1] [1] [2] [ 1]
1 [1] [2] [ 1] [1] [2]
1
q
q
− + − + + + − +
≥ − + − + + + − +
= + + + + − + + + +
+ + + + − + + + +
≥
+ + + + + + +
+
1 [1] [1]
R R
Y q q
+ + +
= −
Trang 5Proving analogously for remaining parts and then adding up both sides of the inequalities,
we get the inequality (2.1)
Note that the properness of the dimensionality reduction function f can be proved
even though the maximum distance
1max [ ] [ ]
k n
≤ ≤
= − is chosen as the distance function
Furthermore, we show that the dimensionality reduction function f preserves some
basic geometric figures
A line segment in the n-dimensional space is represented by its starting point A and ending point B Denote the line segment by A – B Using the dimensionality reduction function f for the points A and B we get reduced points A R and B R These obtained points
form a line segment in the m-dimensional space, denoted by A R – B R Our dimensionality reduction function preserves the line
Theorem 2.2: Line segments are invariable under the dimensionality reduction function f,
i.e.,
∀ ∈ − ⇒ ∈ −
Proof: The equation of the line passing A and B in the n-dimensional space is:
[1] [1] [2] [2] [ ] [ ]
n
−
− −
= = =
As the point X belongs to the line A – B, we have:
[1] [1] [2] [2] [ ] [ ] , [1] [1] [2] [2] [ ] [ ]
with 0 ≤ k ≤ 1
It means,
[1] [1] ( [1] [1]) [2] [2] ( [2] [2])
[ ] [ ] ( [ ] [ ])
− = −
⎧
⎪ − = −
⎪
⎨
⎪
⎪ − = −
⎩
(2.5)
To show that X R ∈ AR – B R we have to prove:
[1] [1] [2] [2] [ ] [ ]
[1] [1] [2] [2] [ ] [ ]
= = =
In fact, replacing the numerator and the denominator of the first fraction with the formula (2.4) correspondingly and using equalities (2.5) we obtain:
Trang 6( ) ( )
[1] [1]
[1] [1]
[1] [2] [ 1] [1] [2] [ 1]
[1] [2] | [ 1] [1] [2] [ 1]
[1] [2] [ 1] [1] [2] [ 1]
[1] [2] [ 1] [1] [2] [
R R
R R
−
=
−
−
=
1]
[1] [1]) ( [2] [2] [ 1] [ 1]
[1] [1]) ( [2] [2] [ 1] [ 1]
[1] [1]) ( [2] [2] [ 1] [ 1]
[1] [1]) ( [2] [2] [ 1] [ 1]
k
= +
=
=
Analogously, we show that each fraction in (2.6) is equal to k So they are all identical
This proves the theorem
Corollary 2.3: The dimensionality reduction function f as in (2.4) preserves polygons
Note that spheres are not preserved by the dimensionality reduction function f
3 Some applications
3.1 Application to matching problem
Finding all occurrences of a pattern in a database is the purpose of matching problem
Searching for particular patterns in DNA sequences is its typical example The problem is formalised in Cornen et al (2001) as follows
We assume that the database is an array S[1 l] of the length l and that the pattern is an array P[1 k] of the length k, with k ≤ l We further assume that elements of S and P belong to a finite set A
We say that the pattern P occurs beginning at position q in the database S if
1 ≤ q ≤ l – k + 1 and S[q q + k – 1] = P[1 k] (that is, if S[q + j – 1] = P[j], for 1 ≤ j ≤ k)
In the case when elements of the set A are characters, Rabin and Karp have proposed
a string-matching algorithm based on assuming that each character is a digit in radix-d notation, where d = |A| So then a string of h consecutive characters can be viewed as representing a length-h number and its value can be computed by using Horner’s rule
Instead of comparing S[q q + k – 1] = P[1 k] the algorithm compares their values for
finding candidates And then each candidate will be tightly compared with the pattern
P[1 k] to show positions of the pattern’s occurrences
Assume now that elements of the database S and the pattern P are time-series data, where each time-series data consists of n real numbers Thus, we can calculate the sum of
each time-series data or the sum of some consecutive time-series data
Trang 7The matching process on a time-series database will be divided into two steps In the first step we calculate the sum of
1
1
[ ][ ]
q k n
i q j
S i j
+ −
1
n
j
=
+ −
previous sum and then subtracting
1
[ 1][ ],
n
j
=
−
∑ and compare the obtained sum with
the sum of the pattern P The step is called a preprocessing
Figure 1 Calculating the value of S[q q + k – 1]
If a candidate is found (p = t) we move forward to the matching step In this step we have
to compare tightly the candidate with the pattern and print a notice if the comparison result is true
Basing on the idea of Rabin and Karp’s algorithm, we propose a matching algorithm
in a time-series database as follows
Algorithm 3.1 Time-series data matching
Begin
1 L ← length(S)
2 k ← length(P)
3
1 1
[ ][ ]
k n
i j
= =
4 S[0] [1 n] ← 0
1 1
[ ][ ]
i j
−
= =
6 for q ← 1 to l – k + 1 do
7 begin
[ ][ ] [ 1][ ]
−
10 then if P[1 k] = S[q q + k – 1]
11 then print ‘pattern occurs beginning at position’ q
12 end End
Trang 8Note that the computations of
1
n
j
=
+ −
1
[ 1][ ]
n
j
=
−
∑ in the instruction (8)
can be performed in parallel So can be the comparison of P[1 k] = S[q q + k – 1] in the instruction (10), too The complexity of the algorithm is O(l.n)
We apply the dimensionality reduction transformation f constructed as in the formula
(2.4) to the time-series matching problem Firstly, we reduce the dimensionality of the
database S[1 l][1 n] and the pattern P[1 k][1 n] from n to m, with 0 < m << n So we get
a new database S R [1 l][1 m] and a new pattern P R [1 k][1 m] in the m-dimensional space And then we substitute S R for S and P R for P in the above time-series data matching algorithm The properness of the transformation f guarantees the correctness of
this algorithm and no false dismissals for range queries Furthermore, the complexity of the time-series data matching algorithm will be drastically decreased with the ratio of m
n The dimensionality reduction transformation f can be applied for a matching problem
even though a time-series database and a pattern have different dimensions In that case
we first reduce the dimensionalities of the time-series database and the pattern to a same dimension and then do matching on the new time-series database and the new pattern
3.2 Application to MBR
Given a database consisting of many n-dimensional time-series data Each n-dimensional time-series data corresponds to a point in the n-dimensional space Construct the least rectangle in the n-dimensional space that contains these points Such a rectangle is called
a MBR (Moon, 2007; Moon and Kim, 2007)
Moreover, for many objects we can not know exact information about them We only know that the information belongs to some interval For example, the price of a stock item is represented by the bid price and the ask price, the temperature at a region is represented by the lowest and the highest temperature… MBRs may be used for these data
An MBR has 2n vertex points, where n is the dimensionality of time-series data To
present the data rectangle we store only two time-series data corresponding to its
lower-left and upper-right points, i.e., the point with smallest coordinates and the point with greatest coordinates Let denote these points by L[1 n] and U[1 n] The corresponding n-dimensional MBR is denoted by [L, U]
Let F be a dimensionality reduction function Using the function for an n-dimensional MBR [L, U] by reducing only two vertex points L and U we obtain two new points L R
and U R These points form a new MBR [L F , U F] So when does the dimensionality
reduction function F transform the high-dimensional MBR [L, U] into the low-dimensional MBR [L F , U F] The following definition of an MBR-safe transformation was introduced in Moon (2007)
Definition 3.1: A transformation F is MBR-safe if it satisfies the following requirement:
for any n-dimensional time-series data X and any n-dimensional MBR [L, U],
[ , ] F [ F, F]
X∈ L U ⇒X ∈ L U
Trang 9Figure 2 An MBR-safe transformation
The safety of the transformation f constructed as in (2.4) is asserted by the following
theorem
Theorem 3.2: The dimensionality reduction transformation f constructed as in the formula
(2.4) is MBR-safe
Proof: By the definition of an MBR we have n following double inequalities:
[ ] [ ] [ ], 1, 2, ,
Adding q + 1 first double inequalities and dividing totals by q + 1 we get:
[1] [2] [ 1]
1 [1] [2] [ 1]
1 [1] [2] [ 1]
1
q
q
q
+ + + +
≤ +
+ + + + ≤ +
+ + + + +
This means, L R[1] ≤ XR[1] ≤ UR[1]
Analogously for remaining parts, we obtain:
[ ] [ ] [ ], 1, 2, ,
So, X R∈[L U R, R]
Corollary 2.3 and Theorem 3.2 show that the dimensionality reduction transformation
f preserves planar geometric blocks represented by line segments They point out an important role of the transformation f in computer graphics and image processing
4 Conclusions
In this paper we present a dimensionality reduction transformation for multi-dimensional time-series data and some its applications in matching problem and MBR The transformation is proper, MBR-safe and simpler than existing transformations in Moon et
al (2002) and Moon (2007) Therefore, it may be applied very well in storing large databases of multi-dimensional time-series data, in searching, matching and data mining
These dimensionality reduction processes can be performed in parallel, so the time for dimensionality reduction will be decreased
Trang 10In the further research we will apply the MBR-safe transformation to multi-media data retrieval and GIS Furthermore, the dimensionality reduction preserves planar geometric blocks Hence, it may be used in computer graphics and image processing as well
Acknowledgements
A part of this paper was presented at the 1st Asian Conference on Intelligent Information and Database Systems held in Dong Hoi, Vietnam in April 2009
The authors are thankful to Vietnam National University, Hanoi for providing support for this research (Project QG-09-01)
References
Agrawal, R., Faloutsos, C and Swami, A (1993) ‘Efficient similarity search in sequence
databases’, Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, USA, pp.69–84
Cornen, T.H., Leiserson, C.E, Rivest, R.L and Stein, C (2001) Introduction to Algorithms, The MIT Press
Faloutsos, C., Ranganathan, M and Manolopoulos, Y (2001) ‘Fast subsequence matching in
time-series databases’, Proceedings of the International Conference on Management of Data,
ACM SIGMOD, pp.419–429
Keogh, E., Chakrabarti, K, Pazzani, M and Mehrotra, S (2000) ‘Dimensionality reduction for fast
similarity search in large time-series database’, Journal of Knowledge and Information Systems, Vol 3, No 3, pp.263–286
Keogh, E., Chakrabarti, K., Mehrotra, S and Pazzani, M (2001) ‘Locally adaptive dimensionality
reduction for indexing large time-series databases’, Proceedings of the International Conference on Management of Data, ACM SIGMOD, pp.151–162
Moon, Y.S (2007) ‘An MBR-safe transformation for high-dimensional MBRs in similar sequence
matching’, Proceedings of the International Conference on Database systems for Advanced Applications, Thailand
Moon, Y.S and Kim, J (2007) ‘A theoretical study on MBR-safe transformations’, Proceedings of the 12th International Conference on Knowledge-Based and Intelligent Information &
Engineering Systems, Italy
Moon, Y.S., Whang, K.Y and Han, W.S (2002) ‘General match: a subsequence matching method
in time-series databases based on generalized windows’, Proceedings of the International Conference on Management of Data, ACM SIGMOD, pp.382–393
Thanh, H.C (2007) ‘Transforming sequential processes of a net system into concurent ones’,
International Journal of Knowledge-based and Intelligent Engineering Systems, Vol 11,
No 6, pp.391–397
Thanh, H.C (2009) ‘Parallel dimensionality reduction transformation for time-series data’, in
Ngoc Thanh Nguyen, Huynh Phan Nguyen and Adam Grzech (Eds.): ACIIDS 2009, IEEE Computer Society, pp.104–108