Parallel Dimensionality Reduction Transformation for Time-Series Data Hoang Chi Thanh Department of Informatics, Hanoi University of Science, VNUH 334 - Nguyen Trai Rd., Hanoi, Vietnam
Trang 1Parallel Dimensionality Reduction Transformation for Time-Series Data
Hoang Chi Thanh Department of Informatics, Hanoi University of Science, VNUH
334 - Nguyen Trai Rd., Hanoi, Vietnam E-mail: thanhhc@vnu.vn
Abstract
The subsequence matching in large
time-series databases has been being an interesting
problem Many methods have been proposed that
cope with this problem in an adequate extend One of
good ideas is reducing properly the dimensionality of
time-series data
In this paper, we propose a method to
reduce the dimensionality of high-dimensional
time-series data The method is simpler than existing ones
based on the discrete Fourier transform and the
discrete cosine transform Furthermore, our
dimensionality reduction may be executed in parallel
It preserves planar geometric blocks and may be
applied to minimum bounding rectangles as well
Keywords: Time-series data, dimensionality
reduction, matching problem, minimum bounding
rectangle
1 Introduction
Time-series data are the sequences of real
numbers representing values at specific points in
time For example, the bid prices and the ask prices
of stock items, exchange rates, weather data and
human speech signals … are typical illustrations of
time-series data The data stored in a database are
called data sequences The aim of the subsequence
matching problem in a large time-series database is
finding data sequences similar to the given query
sequence from the database This problem has
attracted a lot of interest by its applications
Many methods have been proposed that
cope with this problem in an adequate extend [1-5]
One of good ideas to increase the matching speed is
proper dimensionality reductions for
high-dimensional time-series data In [6] the author
proposed a data transformation based on the discrete
Fourier transform The authors of [7] presented a data
transformation based on the discrete cosine
transform
In this paper we present another
dimensionality reduction for high-dimensional
time-series data The method splits a high-dimensional
time-series data into parts as equal in time scale as
possible and then take the average of each part The reduction is simpler than existing ones above presented and it may be performed in parallel So this method decreases the time for “narrowing” data and speeds up the matching We also use this dimensionality reduction for a special type of time-series data – minimum bounding rectangles
This paper is organized as follows In Section 2 we present a dimensionality reduction function for high-dimensional time-series data and some its properties Section 3 shows that this reduction function is safe when applying it to minimum bounding rectangles Some conclusion remarks are given in the last section
2 Dimensionality reduction for time-series data
Let T[1 n] be a series data The time-series data consists of n real numbers, so it is called
an n-dimensional data
The dimensionality n of time-series data is
as high as difficult to store, search and match So it turns out that how to “narrow” the data In other words, we have to construct an operation, which transforms a high-dimensional time-series data with hundreds or thousands of dimensions to a low-dimensional time-series data with some dimensions Instead of doing on high-dimensional time-series data, one can do the same on low-dimensional time-series data with high performance To do so, we construct dimensionality reduction functions for time-series data Each such a function is indeed a
mapping F : R n → R m
Let F be any dimensionality reduction function transforming n-dimensional time-series data
to m-dimensional time-series data, with 0 < m < n
We are interested only in those functions that satisfy the following requirement
Definition 1: A dimensionality reduction function F
is proper if for any pair of n-dimensional time-series data X and Y then:
D m (F(X),F(Y)) ≤ D n (X,Y) (1)
2009 First Asian Conference on Intelligent Information and Database Systems
2009 First Asian Conference on Intelligent Information and Database Systems
Trang 2where, D n and D m are the distance functions of the
n-dimensional space and the m-n-dimensional space
respectively
So each proper dimensionality reduction
function on time-series data is a shrinking mapping
The properness of a reduction function guarantees no
false dismissals for range queries
Let T[1 n] be an n-dimensional time-series
data and let m be a positive integer such that 0 < m
<< n The authors of [6,7] constructed two
dimensionality reduction functions based on the
discrete Fourier transform and the discrete cosine
transform for T[1 n] to get m-dimensional
time-series data T RF [1 m] and T RC [1 m] as follows:
1) The dimensionality reduction function based on
the discrete Fourier transform is:
⎣ ⎦
⎣ ⎦
⎪
⎪
⎩
⎪⎪
⎨
⎧
−
−
−
−
−
−
=
∑
∑
=
=
), / ) 1 ( 2 / ) 1 ( 2 sin(
] [
1
), / ) 1 ( 2 / ) 1 ( 2 cos(
]
[
1
]
1
1
n k i k
T
n
n k i k
T
n
i
T
n
k
n
k
RF
π
π
, 1 ≤ i ≤ m (2)
2) The dimensionality reduction function based on
the discrete cosine transform is:
), 2
) 1 )(
1 2 ( cos(
] [ )
2
]
i k k
T
n
i
c
i
k
=
where
⎩
⎨
⎧
=
, 1
, 2 / 2 )
(i
c
, 1 ≤ i ≤ m (3)
Back to an n-dimensional time-series data
T[1 n] To reduce the dimensionality of the data, we
split it into m parts as equal in time scale as possible
This always may be done because of the following
arithmetic fact:
For two given positive integers n and m with
0 < m < n, there exist two non-negative integers q and
d, such that n = d.(q+1) + (m-d).q
The proof of this fact is very simple Let
choose q = n div m and d = n mod m We get, n =
m.q + d = d.q + d + m.q - d.q = d.(q+1) + (m-d).q
The above fact offers us a method splitting
an n-dimensional time-series data into the following
m parts: d first parts with the size of q+1 and m-d
remaining parts with the size of q Then we take the
average of each part So we are able to transform an
n-dimensional time-series data to an m-dimensional
time-series data
Let denote q = n div m and d = n mod m
Definition 2: The m-dimensional time-series data
T R [1 m] constructed as follows:
⎪
⎪
⎩
⎪
⎪
⎨
⎧ +
=
∑
∑
+ +
− +
=
+ + +
−
=
q i d q i d k
q i q i k R
k T q
k T q
i
1 ).
1 (
) 1 (
1 ) 1 ).(
1 ( ] [ 1
] [ 1
1 ]
(4)
is called a reduced m-dimensional time-series data of the n-dimensional time-series data T[1 n]
The formula (4) gives us a function
transforming n-dimensional time-series data to
m-dimensional time-series data This transforming function may be used to store large databases of multidimensional time-series data It causes to save memory and to increase the matching speed
Moreover, our dimensionality reduction may be executed in parallel [8] The time for building the reduced database will be drastically decreased
Theorem 1: The dimensionality reduction function f
constructed as in the formula (4) is proper
Proof:
Let X[1 n] and Y[1 n] be two n-dimensional
time-series data The distance function used here is
the Manhattan distance L 1
1
k
−
= )) ( ), ( (
L | [ ] [ ] |
1
i Y i
XR R
m i
−
∑
and Y R [i] are the corresponding components of the
m-dimensional time-series data transformed by (4)
To prove the properness of the function f we
check the inequality (1) only on each part split as in Definition 2 On the first part we have:
.|
] 1 [ ] 1 [
|
| 1
]) 1 [
] 2 [ ] 1 [ ( 1
]) 1 [
|]
2 [ ] 1 [ (
|
1
| ]) 1 [
] 2 [ ] 1 [ ( ]) 1 [
|]
2 [ ] 1 [ (
|
| ]) 1 [
] 2 [ ] 1 [ ( ]) 1 [
|]
2 [ ] 1 [ (
|
| ]) 1 [ ] 1 [ (
]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ (
|
|]
1 [ ] 1 [
|
|]
2 [ ] 2 [
|
|]
1 [ ] 1 [
|
R
X
q
q Y Y
Y q
q X X
X
q
q Y Y
Y q X X
X
q Y Y
Y q X X
X
q Y q X Y
X Y X
q Y q X Y
X Y X
−
=
+
+ + + +
− +
+ + + +
=
+
+ + + +
− + + + +
≥
+ + + +
− + + + +
=
+
− + + +
− +
−
≥
+
− + + +
− +
−
Proving analogously for remaining parts and then adding up both sides of the inequalities, we get the inequality (1)
, if 1 ≤ i ≤ d;
, if d+1 ≤ i ≤ m
if i is odd;
if i is even
if i = 1;
if 2 ≤ i ≤ m
Trang 3Note that the properness of the
dimensionality reduction function f can be proved
even though the maximum distance
| ] [ ] [
|
max
L
n
=
≤
≤
distance function
Furthermore, we show that the
dimensionality reduction function f preserves some
basic geometric figures
A line segment in the n-dimensional space is
represented by its startingpoint A and ending point B
Denote the line segment by A – B Using the
dimensionality reduction function f for the points A
and B we get reduced points A R and B R These
obtained points form a line segment in the
m-dimensional space, denoted by A R – B R Our
dimensionality reduction function preserves the line
Theorem 2: Line segments are invariable under the
dimensionality reduction function f , i e
∀X ∈ A – B ⇒ X R ∈ AR – B R Proof: The equation of the line passing A and B in the n-dimensional space is: ] [ ] [ ] [
] 2 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] 1 [ 2 1 n A n B n A x A B A x A B A x n − − = = − − = − − As the point X belongs to the line A – B, we have: , ] [ ] [ ] [ ] [
] 2 [ ] 2 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ k n A n B n A n X A B A X A B A X = − − = = − − = − − with 0 ≤ k ≤ 1 It means,
⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ − = − − = − − = − ]) [ ] [ ( ] [ ] [
]) 2 [ ] 2 [ ( ] 2 [ ] 2 [ ]) 1 [ ] 1 [ ( ] 1 [ ] 1 [ n A n B k n A n X A B k A X A B k A X (5)
To show that X R ∈ A R – B R , we have to prove: ] [ ] [ ] [ ] [
] 2 [ ] 2 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ m A m B m A m X A B A X A B A X R R R R R R R R R R R R − − = = − − = − − (6)
In fact, replacing the numerator and the denominator of the first fraction with the formula (4) correspondingly and using equalities (5) we obtain: ]) 1 [ ] 1 [ (
]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ ( ]) 1 [ ] 1 [ (
]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ ( ]) 1 [ ] 1 [ (
]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ ( ]) 1 [ ] 1 [ (
]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ ( ]) 1 [
] 2 [ ] 1 [ ( ]) 1 [
| ] 2 [ ] 1 [ ( ]) 1 [
] 2 [ ] 1 [ ( ]) 1 [
| ] 2 [ ] 1 [ ( 1 ] 1 [
] 2 [ ] 1 [ 1 ] 1 [
| ] 2 [ ] 1 [ 1 ] 1 [
] 2 [ ] 1 [ 1 ] 1 [
| ] 2 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ k q A q B A B A B q A q B k A B k A B k q A q B A B A B q A q X A X A X q A A A q B B B q A A A q X X X q q A A A q q B B B q q A A A q q X X X A B A X R R R R = + − + + + − + − + − + + + − + − = = + − + + + − + − + − + + + − + − = = + + + + − + + + + + + + + − + + + + = = + + + + + − + + + + + + + + + + − + + + + + = = − − Analogously, we show that each fraction in (6) is equal to k So they are all identical This proves the theorem
Corollary 3: The dimensionality reduction function f
as in (4) preserves polygons
Note that spheres are not preserved by this
dimensionality reduction function f
3 Application to minimum bounding rectangles
Given a database consisting of many n-dimensional time-series data Each n-n-dimensional time-series data corresponds to a point in the
n-dimensional space Construct the least rectangle in
the n-dimensional space that contains these points Such a rectangle is called a minimum bounding rectangle (MBR, for short) [6,7]
Moreover, for many objects we can not know exact information about them We only know that the information belongs to some interval For example, the price of a stock item is represented by the bid price and the ask price, the temperature at a region is represented by the lowest and the highest temperature … MBRs may be used for these data
An MBR has 2n vertex points, where n is the
dimensionality of time-series data To present the data rectangle we store only two time-series data
corresponding to its lower-left and upper-right points,
i e the point with smallest coordinates and the point with greatest coordinates Let denote these points by
L[1 n] and U[1 n] The corresponding n-dimensional MBR is denoted by [L,U]
Trang 4Let F be a dimensionality reduction
function Using this function for an n-dimensional
MBR [L,U] by reducing only two vertex points L and
U, we obtain two new points L R and U R These points
form an MBR [L F ,U F] So when does the
dimensionality reduction function F transform the
high-dimensional MBR [L,U] into the
low-dimensional MBR [L F ,U F] The following definition
of an MBR-safe transformation was introduced in
[6]
Definition 3: A transform F is MBR-safe if it satisfies
the following requirement: for an n-dimensional
time-series data X and an n-dimensional MBR [L,U],
X ∈ [L,U] ⇒ X F ∈ [LF ,U F]
The safety of the transformation f
constructed as in (4) is asserted by the following
theorem
Theorem 4: The dimensionality reduction
transformation f constructed as in (4) is MBR-safe
Proof:
By the definition of an MBR, we have n
following double inequalities:
L[k] ≤ X[k] ≤ U[k] , ∀k = 1, 2, …, n
Adding q+1 first double inequalities and dividing
totals by q+1 we get:
1
q
] 1 U[q
] U[2 ]
U[1
1 q
] 1 X[q
] X[2 ]
X[1
1 q
] 1 L[q
] L[2 ]
L[1
+
+ + + +
≤ +
+ + + +
≤ +
+ + + +
This means, L R[1] ≤ XR[1] ≤ UR[1]
Analogously for remaining parts, we obtain:
L R [i] ≤ X R [i] ≤ U R [i] , ∀i = 1, 2, …, m
So, X R ∈ [LR ,U R]
Corollary 3 and Theorem 4 show that the
dimensionality reduction transformation f preserves
planar geometric blocks represented by line segments They point out an important role of the
transformation f in computer graphics and image
processing
The dimensionality reduction transformation
f can be applied for a matching problem, even though
time-series data and queries have different dimensions
4 Conclusions
In this paper we present a dimensionality reduction transformation for multidimensional time-series data The transformation is proper, MBR-safe and simpler than existing transformations in [5,6] Therefore, it is applied as well in storing large databases of multidimensional time-series data, in searching, matching and data mining … These dimensionality reduction processes can be performed
in parallel, so the time for dimensionality reduction will be decreased In the further research we will apply the MBR-safe transformation to multimedia data retrieval and GIS … Furthermore, the dimensionality reduction preserves planar geometric blocks Hence it may be used in computer graphics and image processing as well
Acknowledgment
This work was supported by Vietnam National University, Hanoi
References
[1] R Agrawal, C Faloutsos and A Swami, Efficient similarity search in sequence databases,
Proceedings of the 4 th Int’l Conference on Foundations of Data Organization and Algorithms,
USA, 1993, pp 69-84
[2] C Faloutsos, M Ranganathan and Y Manolopoulos, Fast subsequence matching in
time-series databases, Proceedings of the Int’l Conference
on Management of Data, ACM SIGMOD - 2001, pp
419-429
[3] E Keogh, K Chakrabarti, S Mehrotra and M Pazzani, Locally adaptive dimensionality reduction
for indexing large time-series databases, Proceedings
of the Int’l Conference on Management of Data, ACM SIGMOD - 2001, pp 151-162
[4] E Keogh, K Chakrabarti, M Pazzani and S Mehrotra, Dimensionality reduction for fast similarity search in large time-series database,
Journal of Knowledge and Information Systems, Nr
3 (3), 2000, pp 263-286
Trang 5[5] Y S Moon, K Y Whang and W S Han, General Match: A subsequence matching method in time-series databases based on generalized windows,
Proceedings of the Int’l Conference on Management
of Data, ACM SIGMOD - 2002, pp 382-393
[6] Y S Moon, An MBR-safe transformation for high-dimensional MBRs in similar sequence
matching, Proceedings of the Int’l Conference on Database systems for Advanced applications, Thailand, 2007
[7] Y S Moon and J Kim, A theoretical study on
MBR-safe transformations, Proceedings of the 12 th
International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Italy,
2007
[8] H C Thanh, Transforming sequential processes
of a net system into concurent ones, International Journal of Knowledge-based and Intelligent Engineering Systems, IOS Press, Amsterdam, Vol
11, Nr 6, 2007, pp 391-397