DSpace at VNU: Parallel dimensionality reduction transformation for time-series data

Parallel Dimensionality Reduction Transformation for Time-Series Data Hoang Chi Thanh Department of Informatics, Hanoi University of Science, VNUH 334 - Nguyen Trai Rd., Hanoi, Vietnam

Trang 1

Parallel Dimensionality Reduction Transformation for Time-Series Data

Hoang Chi Thanh Department of Informatics, Hanoi University of Science, VNUH

334 - Nguyen Trai Rd., Hanoi, Vietnam E-mail: thanhhc@vnu.vn

Abstract

The subsequence matching in large

time-series databases has been being an interesting

problem Many methods have been proposed that

cope with this problem in an adequate extend One of

good ideas is reducing properly the dimensionality of

time-series data

In this paper, we propose a method to

reduce the dimensionality of high-dimensional

time-series data The method is simpler than existing ones

based on the discrete Fourier transform and the

discrete cosine transform Furthermore, our

dimensionality reduction may be executed in parallel

It preserves planar geometric blocks and may be

applied to minimum bounding rectangles as well

Keywords: Time-series data, dimensionality

reduction, matching problem, minimum bounding

rectangle

1 Introduction

Time-series data are the sequences of real

numbers representing values at specific points in

time For example, the bid prices and the ask prices

of stock items, exchange rates, weather data and

human speech signals … are typical illustrations of

time-series data The data stored in a database are

called data sequences The aim of the subsequence

matching problem in a large time-series database is

finding data sequences similar to the given query

sequence from the database This problem has

attracted a lot of interest by its applications

Many methods have been proposed that

cope with this problem in an adequate extend [1-5]

One of good ideas to increase the matching speed is

proper dimensionality reductions for

high-dimensional time-series data In [6] the author

proposed a data transformation based on the discrete

Fourier transform The authors of [7] presented a data

transformation based on the discrete cosine

transform

In this paper we present another

dimensionality reduction for high-dimensional

time-series data The method splits a high-dimensional

time-series data into parts as equal in time scale as

possible and then take the average of each part The reduction is simpler than existing ones above presented and it may be performed in parallel So this method decreases the time for “narrowing” data and speeds up the matching We also use this dimensionality reduction for a special type of time-series data – minimum bounding rectangles

This paper is organized as follows In Section 2 we present a dimensionality reduction function for high-dimensional time-series data and some its properties Section 3 shows that this reduction function is safe when applying it to minimum bounding rectangles Some conclusion remarks are given in the last section

2 Dimensionality reduction for time-series data

Let T[1 n] be a series data The time-series data consists of n real numbers, so it is called

an n-dimensional data

The dimensionality n of time-series data is

as high as difficult to store, search and match So it turns out that how to “narrow” the data In other words, we have to construct an operation, which transforms a high-dimensional time-series data with hundreds or thousands of dimensions to a low-dimensional time-series data with some dimensions Instead of doing on high-dimensional time-series data, one can do the same on low-dimensional time-series data with high performance To do so, we construct dimensionality reduction functions for time-series data Each such a function is indeed a

mapping F : R n → R m

Let F be any dimensionality reduction function transforming n-dimensional time-series data

to m-dimensional time-series data, with 0 < m < n

We are interested only in those functions that satisfy the following requirement

Definition 1: A dimensionality reduction function F

is proper if for any pair of n-dimensional time-series data X and Y then:

D m (F(X),F(Y)) ≤ D n (X,Y) (1)

2009 First Asian Conference on Intelligent Information and Database Systems

Trang 2

where, D n and D m are the distance functions of the

n-dimensional space and the m-n-dimensional space

respectively

So each proper dimensionality reduction

function on time-series data is a shrinking mapping

The properness of a reduction function guarantees no

false dismissals for range queries

Let T[1 n] be an n-dimensional time-series

data and let m be a positive integer such that 0 < m

<< n The authors of [6,7] constructed two

dimensionality reduction functions based on the

discrete Fourier transform and the discrete cosine

transform for T[1 n] to get m-dimensional

time-series data T RF [1 m] and T RC [1 m] as follows:

1) The dimensionality reduction function based on

the discrete Fourier transform is:

⎣ ⎦

⎪

⎩

⎪⎪

⎨

⎧

−

=

∑

=

), / ) 1 ( 2 / ) 1 ( 2 sin(

] [

1

), / ) 1 ( 2 / ) 1 ( 2 cos(

]

[

1

]

1

n k i k

T

n

n k i k

T

n

i

T

n

k

n

k

RF

π

, 1 ≤ i ≤ m (2)

2) The dimensionality reduction function based on

the discrete cosine transform is:

), 2

) 1 )(

1 2 ( cos(

] [ )

2

]

i k k

T

n

i

c

i

k

=

where

⎩

⎨

⎧

=

, 1

, 2 / 2 )

(i

c

, 1 ≤ i ≤ m (3)

Back to an n-dimensional time-series data

T[1 n] To reduce the dimensionality of the data, we

split it into m parts as equal in time scale as possible

This always may be done because of the following

arithmetic fact:

For two given positive integers n and m with

0 < m < n, there exist two non-negative integers q and

d, such that n = d.(q+1) + (m-d).q

The proof of this fact is very simple Let

choose q = n div m and d = n mod m We get, n =

m.q + d = d.q + d + m.q - d.q = d.(q+1) + (m-d).q

The above fact offers us a method splitting

an n-dimensional time-series data into the following

m parts: d first parts with the size of q+1 and m-d

remaining parts with the size of q Then we take the

average of each part So we are able to transform an

n-dimensional time-series data to an m-dimensional

time-series data

Let denote q = n div m and d = n mod m

Definition 2: The m-dimensional time-series data

T R [1 m] constructed as follows:

⎪

⎩

⎪

⎨

⎧ +

=

∑

+ +

− +

=

+ + +

−

=

q i d q i d k

q i q i k R

k T q

i

1 ).

1 (

) 1 (

1 ) 1 ).(

1 ( ] [ 1

] [ 1

1 ]

(4)

is called a reduced m-dimensional time-series data of the n-dimensional time-series data T[1 n]

The formula (4) gives us a function

transforming n-dimensional time-series data to

m-dimensional time-series data This transforming function may be used to store large databases of multidimensional time-series data It causes to save memory and to increase the matching speed

Moreover, our dimensionality reduction may be executed in parallel [8] The time for building the reduced database will be drastically decreased

Theorem 1: The dimensionality reduction function f

constructed as in the formula (4) is proper

Proof:

Let X[1 n] and Y[1 n] be two n-dimensional

time-series data The distance function used here is

the Manhattan distance L 1

1

k

−

= )) ( ), ( (

L | [ ] [ ] |

1

i Y i

XR R

m i

−

∑

and Y R [i] are the corresponding components of the

m-dimensional time-series data transformed by (4)

To prove the properness of the function f we

check the inequality (1) only on each part split as in Definition 2 On the first part we have:

.|

] 1 [ ] 1 [

|

| 1

]) 1 [

] 2 [ ] 1 [ ( 1

]) 1 [

|]

2 [ ] 1 [ (

|

1

| ]) 1 [

] 2 [ ] 1 [ ( ]) 1 [

|]

2 [ ] 1 [ (

|

| ]) 1 [

] 2 [ ] 1 [ ( ]) 1 [

|]

2 [ ] 1 [ (

|

| ]) 1 [ ] 1 [ (

]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ (

|

|]

1 [ ] 1 [

|

|]

2 [ ] 2 [

|

|]

1 [ ] 1 [

|

R

X

q

q Y Y

Y q

q X X

X

q

q Y Y

Y q X X

X

q Y Y

Y q X X

X

q Y q X Y

X Y X

q Y q X Y

X Y X

−

=

+

+ + + +

− +

+ + + +

=

+

+ + + +

− + + + +

≥

+ + + +

− + + + +

=

+

− + + +

− +

−

≥

+

− + + +

− +

−

Proving analogously for remaining parts and then adding up both sides of the inequalities, we get the inequality (1) 

, if 1 ≤ i ≤ d;

, if d+1 ≤ i ≤ m

if i is odd;

if i is even

if i = 1;

if 2 ≤ i ≤ m

Trang 3

Note that the properness of the

dimensionality reduction function f can be proved

even though the maximum distance

| ] [ ] [

|

max

L

n

=

≤

distance function

Furthermore, we show that the

dimensionality reduction function f preserves some

basic geometric figures

A line segment in the n-dimensional space is

represented by its startingpoint A and ending point B

Denote the line segment by A – B Using the

dimensionality reduction function f for the points A

and B we get reduced points A R and B R These

obtained points form a line segment in the

m-dimensional space, denoted by A R – B R Our

dimensionality reduction function preserves the line

Theorem 2: Line segments are invariable under the

dimensionality reduction function f , i e

∀X ∈ A – B ⇒ X R ∈ AR – B R Proof: The equation of the line passing A and B in the n-dimensional space is: ] [ ] [ ] [

] 2 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] 1 [ 2 1 n A n B n A x A B A x A B A x n − − = = − − = − − As the point X belongs to the line A – B, we have: , ] [ ] [ ] [ ] [

] 2 [ ] 2 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ k n A n B n A n X A B A X A B A X = − − = = − − = − − with 0 ≤ k ≤ 1 It means,

⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ − = − − = − − = − ]) [ ] [ ( ] [ ] [

]) 2 [ ] 2 [ ( ] 2 [ ] 2 [ ]) 1 [ ] 1 [ ( ] 1 [ ] 1 [ n A n B k n A n X A B k A X A B k A X (5)

To show that X R ∈ A R – B R , we have to prove: ] [ ] [ ] [ ] [

] 2 [ ] 2 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ m A m B m A m X A B A X A B A X R R R R R R R R R R R R − − = = − − = − − (6)

In fact, replacing the numerator and the denominator of the first fraction with the formula (4) correspondingly and using equalities (5) we obtain: ]) 1 [ ] 1 [ (

]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ ( ]) 1 [ ] 1 [ (

]) 2 [ ] 2 [ ( ]) 1 [ ] 1 [ ( ]) 1 [

] 2 [ ] 1 [ ( ]) 1 [

| ] 2 [ ] 1 [ ( ]) 1 [

] 2 [ ] 1 [ ( ]) 1 [

| ] 2 [ ] 1 [ ( 1 ] 1 [

] 2 [ ] 1 [ 1 ] 1 [

| ] 2 [ ] 1 [ 1 ] 1 [

] 2 [ ] 1 [ 1 ] 1 [

| ] 2 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ ] 1 [ k q A q B A B A B q A q B k A B k A B k q A q B A B A B q A q X A X A X q A A A q B B B q A A A q X X X q q A A A q q B B B q q A A A q q X X X A B A X R R R R = + − + + + − + − + − + + + − + − = = + − + + + − + − + − + + + − + − = = + + + + − + + + + + + + + − + + + + = = + + + + + − + + + + + + + + + + − + + + + + = = − − Analogously, we show that each fraction in (6) is equal to k So they are all identical This proves the theorem 

Corollary 3: The dimensionality reduction function f

as in (4) preserves polygons

Note that spheres are not preserved by this

dimensionality reduction function f

3 Application to minimum bounding rectangles

Given a database consisting of many n-dimensional time-series data Each n-n-dimensional time-series data corresponds to a point in the

n-dimensional space Construct the least rectangle in

the n-dimensional space that contains these points Such a rectangle is called a minimum bounding rectangle (MBR, for short) [6,7]

Moreover, for many objects we can not know exact information about them We only know that the information belongs to some interval For example, the price of a stock item is represented by the bid price and the ask price, the temperature at a region is represented by the lowest and the highest temperature … MBRs may be used for these data

An MBR has 2n vertex points, where n is the

dimensionality of time-series data To present the data rectangle we store only two time-series data

corresponding to its lower-left and upper-right points,

i e the point with smallest coordinates and the point with greatest coordinates Let denote these points by

L[1 n] and U[1 n] The corresponding n-dimensional MBR is denoted by [L,U]

Trang 4

Let F be a dimensionality reduction

function Using this function for an n-dimensional

MBR [L,U] by reducing only two vertex points L and

U, we obtain two new points L R and U R These points

form an MBR [L F ,U F] So when does the

dimensionality reduction function F transform the

high-dimensional MBR [L,U] into the

low-dimensional MBR [L F ,U F] The following definition

of an MBR-safe transformation was introduced in

[6]

Definition 3: A transform F is MBR-safe if it satisfies

the following requirement: for an n-dimensional

time-series data X and an n-dimensional MBR [L,U],

X ∈ [L,U] ⇒ X F ∈ [LF ,U F]

The safety of the transformation f

constructed as in (4) is asserted by the following

theorem

Theorem 4: The dimensionality reduction

transformation f constructed as in (4) is MBR-safe

Proof:

By the definition of an MBR, we have n

following double inequalities:

L[k] ≤ X[k] ≤ U[k] , ∀k = 1, 2, …, n

Adding q+1 first double inequalities and dividing

totals by q+1 we get:

1

q

] 1 U[q

] U[2 ]

U[1

1 q

] 1 X[q

] X[2 ]

X[1

1 q

] 1 L[q

] L[2 ]

L[1

+

+ + + +

≤ +

+ + + +

≤ +

+ + + +

This means, L R[1] ≤ XR[1] ≤ UR[1]

Analogously for remaining parts, we obtain:

L R [i] ≤ X R [i] ≤ U R [i] , ∀i = 1, 2, …, m

So, X R ∈ [LR ,U R] 

Corollary 3 and Theorem 4 show that the

dimensionality reduction transformation f preserves

planar geometric blocks represented by line segments They point out an important role of the

transformation f in computer graphics and image

processing

The dimensionality reduction transformation

f can be applied for a matching problem, even though

time-series data and queries have different dimensions

4 Conclusions

In this paper we present a dimensionality reduction transformation for multidimensional time-series data The transformation is proper, MBR-safe and simpler than existing transformations in [5,6] Therefore, it is applied as well in storing large databases of multidimensional time-series data, in searching, matching and data mining … These dimensionality reduction processes can be performed

in parallel, so the time for dimensionality reduction will be decreased In the further research we will apply the MBR-safe transformation to multimedia data retrieval and GIS … Furthermore, the dimensionality reduction preserves planar geometric blocks Hence it may be used in computer graphics and image processing as well

Acknowledgment

This work was supported by Vietnam National University, Hanoi

References

[1] R Agrawal, C Faloutsos and A Swami, Efficient similarity search in sequence databases,

Proceedings of the 4 th Int’l Conference on Foundations of Data Organization and Algorithms,

USA, 1993, pp 69-84

[2] C Faloutsos, M Ranganathan and Y Manolopoulos, Fast subsequence matching in

time-series databases, Proceedings of the Int’l Conference

on Management of Data, ACM SIGMOD - 2001, pp

419-429

[3] E Keogh, K Chakrabarti, S Mehrotra and M Pazzani, Locally adaptive dimensionality reduction

for indexing large time-series databases, Proceedings

of the Int’l Conference on Management of Data, ACM SIGMOD - 2001, pp 151-162

[4] E Keogh, K Chakrabarti, M Pazzani and S Mehrotra, Dimensionality reduction for fast similarity search in large time-series database,

Journal of Knowledge and Information Systems, Nr

3 (3), 2000, pp 263-286

Trang 5

[5] Y S Moon, K Y Whang and W S Han, General Match: A subsequence matching method in time-series databases based on generalized windows,

Proceedings of the Int’l Conference on Management

of Data, ACM SIGMOD - 2002, pp 382-393

[6] Y S Moon, An MBR-safe transformation for high-dimensional MBRs in similar sequence

matching, Proceedings of the Int’l Conference on Database systems for Advanced applications, Thailand, 2007

[7] Y S Moon and J Kim, A theoretical study on

MBR-safe transformations, Proceedings of the 12 th

International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Italy,

2007

[8] H C Thanh, Transforming sequential processes

of a net system into concurent ones, International Journal of Knowledge-based and Intelligent Engineering Systems, IOS Press, Amsterdam, Vol

11, Nr 6, 2007, pp 391-397

Định dạng
Số trang	5
Dung lượng	290,79 KB