DSpace at VNU: A parallel dimensionality reduction for time-series data and some of its applications

DSpace at VNU: A parallel dimensionality reduction for time-series data and some of its applications tài liệu, giáo án,...

Trang 1

A parallel dimensionality reduction for time-series data and some of its applications

Hoang Chi Thanh*

Department of Informatics, Hanoi University of Science, VNUH,

334 Nguyen Trai Rd., Hanoi, Vietnam E-mail: thanhhc@vnu.vn

*Corresponding author Nguyen Quang Thanh

Da Nang Department of Information and Communication,

15 Quang Trung Str., Da Nang, Vietnam E-mail: thanhnq@dsp.vn

Abstract: The subsequence matching in a large time-series database has been

an interesting problem Many methods have been proposed that cope with this problem in an adequate extent One of the good ideas is reducing properly the dimensionality of time-series data

In this paper, we propose a new method to reduce the dimensionality of high-dimensional time-series data The method is simpler than existing ones based on the discrete Fourier transform and the discrete cosine transform

Furthermore, our dimensionality reduction may be executed in parallel The method is used to time-series data matching problem and it decreases drastically the complexity of the corresponding algorithm The method preserves planar geometric blocks and it is also applied to minimum bounding rectangles as well

Keywords: time-series data; database; dimensionality reduction; matching

problem; minimum bounding rectangle; MBR

Reference to this paper should be made as follows: Thanh, H-C and

Thanh, N-Q (2011) ‘A parallel dimensionality reduction for time-series data

and some of its applications’, Int J Intelligent Information and Database Systems, Vol 5, No 1, pp.39–48

Biographical notes: Hoang Chi Thanh is an Associate Professor at Hanoi

University of Science, Vietnam He received his PhD in Computer Science from Warsaw Technical University, Poland and his BSc in Computational Mathematics from The University of Hanoi, Vietnam Since 1974 he has been working for The University of Hanoi (currently Hanoi University of Science)

From 2000 to 2008 he was the Head of the Department of Informatics Since

2004 he has been the Director of Science Co., Ltd He has published more than

40 refereed papers and eight books He is the supervisor of three PhD students

His current research interests include concurrency theory, combinatorics, data mining and knowledge-based systems

Trang 2

Nguyen Quang Thanh is a PhD student at Hanoi University of Science, Vietnam He received his MSc in Information Technology and his BSc in Mathematics from Can Tho University, Vietnam Since 1999 he has been working for Da Nang Department of Information and Communication, Vietnam His research interests include data mining, knowledge-based systems and network security

1 Introduction

Time-series data are the sequences of real numbers representing values at specific points

in time For example, the bid prices and the ask prices of stock items, exchange rates, weather data and human speech signals… are typical illustrations of time-series data The data stored in a database are called data sequences The aim of the subsequence matching problem in a large time-series database is finding data sequences similar to the given query sequence from the database This problem has attracted a lot of interest by its applications

Many methods have been proposed that cope with this problem in an adequate extend (Agrawal et al., 1993; Keogh et al., 2000; Keogh et al., 2001; Faloutsos et al., 2001;

Moon et al., 2002) One of good ideas to increase the matching speed is a proper dimensionality reduction for high-dimensional time-series data In 2007, Moon proposed

a data transformation based on the discrete Fourier transform and then Moon and Kim presented a data transformation based on the discrete cosine transform

In this paper we present another dimensionality reduction for high-dimensional time-series data The method splits a high-dimensional time-series data into parts as equal

in time scale as possible and then takes the average of each part The reduction is simpler than existing ones above presented and it may be performed in parallel So this method decreases the time for ‘narrowing’ data and speeds up the matching process in a large time-series database We also use this dimensionality reduction for a special type of time-series data – minimum bounding rectangles (MBR)

This paper is organised as follows In Section 2 we present a dimensionality reduction function for high-dimensional time-series data and point out some its properties

Section 3 presents application of the dimensionality reduction function to time-series data matching and to MBR When applying this reduction function to MBRs we show that it becomes safe Some conclusion remarks are given in the last section

2 Dimensionality reduction for time-series data

Let T[1 n] be a time-series data The time-series data consists of n real numbers, so it is called an n-dimensional data

The dimensionality n of time-series data is as high as difficult to store, search and

match So it turns out that how to ‘narrow’ the data In other words, we have to construct

an operation, which transforms a high-dimensional time-series data with hundreds or thousands of dimensions to a low-dimensional time-series data with some dimensions

Instead of doing on high-dimensional time-series data one can do the same on low-dimensional time-series data with high performance To do so, we construct

Trang 3

dimensionality reduction functions for time-series data Each such a function is indeed a

mapping F: R n → Rm

Let F be any dimensionality reduction function transforming n-dimensional time-series data to m-dimensional time-series data, with 0 < m < n We are interested only

in those functions that satisfy the following requirement

Definition 2.1: A dimensionality reduction function F is proper if for any pair of n-dimensional time-series data X and Y:

( ) ( )

where, D n and D m are the distance functions of the n-dimensional space and the m-dimensional space, respectively

So each proper dimensionality reduction function on time-series data is a shrinking mapping The properness of a reduction function guarantees no false dismissals for range queries

Let T[1 n] be an n-dimensional time-series data and let m be a positive integer such that 0 < m << n The authors of Moon (2007) and Moon and Kim (2007) have

constructed two dimensionality reduction functions based on the discrete Fourier

transform and the discrete cosine transform for T[1 n] to get m-dimensional time-series data T RF [1 m] and T RC [1 m] as follows

1 the dimensionality reduction function based on the discrete Fourier transform is:

1

[ ] cos( 2 ( 1) / 2 ( 1) / ), if is odd;

[ ]

1

[ ]] sin( 2 ( 1) / 2 ( 1) / ), if is odd 1

n

j

n

i m

π

=

⎧

⎪⎪

= ⎨

⎪

⎪⎩

≤ ≤

∑

2 the dimensionality reduction function based on the discrete cosine transform is:

1

2 ( ) (2 1)( 1) [ ] [ ] cos( ),

2

n RC

j

π

=

− −

where ( ) 2 / 2, if 1;

1, if 2

i

c i

i m i m

⎧ =

⎪

= ⎨

≤ ≤ ≤ ≤

⎪⎩

Back to an n-dimensional time-series data T[1 n] To reduce the dimensionality of the data we split it into m parts as equal in time scale as possible This always may be done

because of the following arithmetic fact:

For two positive integers n and m with 0 < m < n, there exist two non-negative integers q and d, such that n = d.(q + 1) + (m – d).q

The proof of this fact is very simple

Let choose q = n div m and d = n mod m We get, n = m.q + d = d.q + d + m.q – d.q = d.(q + 1) + (m – d).q

Trang 4

The above fact offers us a method to part an n-dimensional time-series data into the following m parts: d first parts with the size of q + 1 and m – d remaining parts with the size of q Then we take the average of each part So we are able to transform an n-dimensional time-series data to an m-dimensional time-series data

Let denote q = n div m and d = n mod m

Definition 2.2: The m-dimensional time-series data T R [1 m] constructed as follows:

.( 1)

( 1).( 1) 1

( 1) 1

1

[ ], if 1 ; 1

[ ]

i q

j i q

R d i q

j d i q

q

T i

q

+

= − + + +

= + − +

⎧

≤ ≤

⎪ +

⎪⎪

= ⎨

⎪

⎪⎩

∑

is called a reduced m-dimensional time-series data of the n-dimensional time-series data T[1 n]

The formula (2.4) gives us a function transforming n-dimensional time-series data to m-dimensional time-series data This transforming function may be used to store large

databases of multi-dimensional time-series data It causes to save memory and to increase the matching speed Moreover, our dimensionality reduction may be performed in parallel (Thanh, 2007; Thanh, 2009) The time for building the reduced database will be drastically decreased

Theorem 2.1: The dimensionality reduction function f constructed as in the formula (2.4)

is proper

Proof: Let X[1 n] and Y[1 n] be two n-dimensional time-series data The distance function used here is Hamming distance L1, called also Manhattan distance or city block distance

So, 1

1

( , ) | [ ] [ ] |

n

k

=

1

( ( ), ( )) | [ ] [ ] |,

m

R R i

=

X R [i] and Y R [i] are the corresponding components of the m-dimensional time-series data

transformed by the formula (2.4)

To prove the properness of the function f we check the inequality (2.1) only on each

part split as in Definition 2.2 On the first part we have:

[1] [1] [2] [2] [ 1] [ 1]

[1] [2] [ 1] [1] [2] [ 1]

1 [1] [2] [ 1] [1] [2]

1

q

− + − + + + − +

≥ − + − + + + − +

= + + + + − + + + +

+ + + + − + + + +

≥

+ + + + + + +

+

1 [1] [1]

R R

Y q q

+ + +

= −

Trang 5

Proving analogously for remaining parts and then adding up both sides of the inequalities,

we get the inequality (2.1)

Note that the properness of the dimensionality reduction function f can be proved

even though the maximum distance

1max [ ] [ ]

k n

≤ ≤

= − is chosen as the distance function

Furthermore, we show that the dimensionality reduction function f preserves some

basic geometric figures

A line segment in the n-dimensional space is represented by its starting point A and ending point B Denote the line segment by A – B Using the dimensionality reduction function f for the points A and B we get reduced points A R and B R These obtained points

form a line segment in the m-dimensional space, denoted by A R – B R Our dimensionality reduction function preserves the line

Theorem 2.2: Line segments are invariable under the dimensionality reduction function f,

i.e.,

∀ ∈ − ⇒ ∈ −

Proof: The equation of the line passing A and B in the n-dimensional space is:

[1] [1] [2] [2] [ ] [ ]

n

−

− −

= = =

As the point X belongs to the line A – B, we have:

[1] [1] [2] [2] [ ] [ ] , [1] [1] [2] [2] [ ] [ ]

with 0 ≤ k ≤ 1

It means,

[1] [1] ( [1] [1]) [2] [2] ( [2] [2])

[ ] [ ] ( [ ] [ ])

− = −

⎧

⎪ − = −

⎪

⎨

⎪

⎪ − = −

⎩

(2.5)

To show that X R ∈ AR – B R we have to prove:

[1] [1] [2] [2] [ ] [ ]

= = =

In fact, replacing the numerator and the denominator of the first fraction with the formula (2.4) correspondingly and using equalities (2.5) we obtain:

Trang 6

( ) ( )

[1] [1]

[1] [2] [ 1] [1] [2] [ 1]

[1] [2] | [ 1] [1] [2] [ 1]

[1] [2] [ 1] [1] [2] [ 1]

[1] [2] [ 1] [1] [2] [

R R

−

=

−

=

1]

[1] [1]) ( [2] [2] [ 1] [ 1]

k

= +

=

Analogously, we show that each fraction in (2.6) is equal to k So they are all identical

This proves the theorem

Corollary 2.3: The dimensionality reduction function f as in (2.4) preserves polygons

Note that spheres are not preserved by the dimensionality reduction function f

3 Some applications

3.1 Application to matching problem

Finding all occurrences of a pattern in a database is the purpose of matching problem

Searching for particular patterns in DNA sequences is its typical example The problem is formalised in Cornen et al (2001) as follows

We assume that the database is an array S[1 l] of the length l and that the pattern is an array P[1 k] of the length k, with k ≤ l We further assume that elements of S and P belong to a finite set A

We say that the pattern P occurs beginning at position q in the database S if

1 ≤ q ≤ l – k + 1 and S[q q + k – 1] = P[1 k] (that is, if S[q + j – 1] = P[j], for 1 ≤ j ≤ k)

In the case when elements of the set A are characters, Rabin and Karp have proposed

a string-matching algorithm based on assuming that each character is a digit in radix-d notation, where d = |A| So then a string of h consecutive characters can be viewed as representing a length-h number and its value can be computed by using Horner’s rule

Instead of comparing S[q q + k – 1] = P[1 k] the algorithm compares their values for

finding candidates And then each candidate will be tightly compared with the pattern

P[1 k] to show positions of the pattern’s occurrences

Assume now that elements of the database S and the pattern P are time-series data, where each time-series data consists of n real numbers Thus, we can calculate the sum of

each time-series data or the sum of some consecutive time-series data

Trang 7

The matching process on a time-series database will be divided into two steps In the first step we calculate the sum of

1

[ ][ ]

q k n

i q j

S i j

+ −

1

n

j

=

+ −

previous sum and then subtracting

1

[ 1][ ],

n

j

=

−

∑ and compare the obtained sum with

the sum of the pattern P The step is called a preprocessing

Figure 1 Calculating the value of S[q q + k – 1]

If a candidate is found (p = t) we move forward to the matching step In this step we have

to compare tightly the candidate with the pattern and print a notice if the comparison result is true

Basing on the idea of Rabin and Karp’s algorithm, we propose a matching algorithm

in a time-series database as follows

Algorithm 3.1 Time-series data matching

Begin

1 L ← length(S)

2 k ← length(P)

3

1 1

[ ][ ]

k n

i j

= =

4 S[0] [1 n] ← 0

1 1

[ ][ ]

i j

−

= =

6 for q ← 1 to l – k + 1 do

7 begin

[ ][ ] [ 1][ ]

−

10 then if P[1 k] = S[q q + k – 1]

11 then print ‘pattern occurs beginning at position’ q

12 end End

Trang 8

Note that the computations of

1

n

j

=

+ −

1

[ 1][ ]

n

j

=

−

∑ in the instruction (8)

can be performed in parallel So can be the comparison of P[1 k] = S[q q + k – 1] in the instruction (10), too The complexity of the algorithm is O(l.n)

We apply the dimensionality reduction transformation f constructed as in the formula

(2.4) to the time-series matching problem Firstly, we reduce the dimensionality of the

database S[1 l][1 n] and the pattern P[1 k][1 n] from n to m, with 0 < m << n So we get

a new database S R [1 l][1 m] and a new pattern P R [1 k][1 m] in the m-dimensional space And then we substitute S R for S and P R for P in the above time-series data matching algorithm The properness of the transformation f guarantees the correctness of

this algorithm and no false dismissals for range queries Furthermore, the complexity of the time-series data matching algorithm will be drastically decreased with the ratio of m

n The dimensionality reduction transformation f can be applied for a matching problem

even though a time-series database and a pattern have different dimensions In that case

we first reduce the dimensionalities of the time-series database and the pattern to a same dimension and then do matching on the new time-series database and the new pattern

3.2 Application to MBR

Given a database consisting of many n-dimensional time-series data Each n-dimensional time-series data corresponds to a point in the n-dimensional space Construct the least rectangle in the n-dimensional space that contains these points Such a rectangle is called

a MBR (Moon, 2007; Moon and Kim, 2007)

Moreover, for many objects we can not know exact information about them We only know that the information belongs to some interval For example, the price of a stock item is represented by the bid price and the ask price, the temperature at a region is represented by the lowest and the highest temperature… MBRs may be used for these data

An MBR has 2n vertex points, where n is the dimensionality of time-series data To

present the data rectangle we store only two time-series data corresponding to its

lower-left and upper-right points, i.e., the point with smallest coordinates and the point with greatest coordinates Let denote these points by L[1 n] and U[1 n] The corresponding n-dimensional MBR is denoted by [L, U]

Let F be a dimensionality reduction function Using the function for an n-dimensional MBR [L, U] by reducing only two vertex points L and U we obtain two new points L R

and U R These points form a new MBR [L F , U F] So when does the dimensionality

reduction function F transform the high-dimensional MBR [L, U] into the low-dimensional MBR [L F , U F] The following definition of an MBR-safe transformation was introduced in Moon (2007)

Definition 3.1: A transformation F is MBR-safe if it satisfies the following requirement:

for any n-dimensional time-series data X and any n-dimensional MBR [L, U],

[ , ] F [ F, F]

X∈ L U ⇒X ∈ L U

Trang 9

Figure 2 An MBR-safe transformation

The safety of the transformation f constructed as in (2.4) is asserted by the following

theorem

Theorem 3.2: The dimensionality reduction transformation f constructed as in the formula

(2.4) is MBR-safe

Proof: By the definition of an MBR we have n following double inequalities:

[ ] [ ] [ ], 1, 2, ,

Adding q + 1 first double inequalities and dividing totals by q + 1 we get:

[1] [2] [ 1]

1 [1] [2] [ 1]

1

q

+ + + +

≤ +

+ + + + ≤ +

+ + + + +

This means, L R[1] ≤ XR[1] ≤ UR[1]

Analogously for remaining parts, we obtain:

[ ] [ ] [ ], 1, 2, ,

So, X R∈[L U R, R]

Corollary 2.3 and Theorem 3.2 show that the dimensionality reduction transformation

f preserves planar geometric blocks represented by line segments They point out an important role of the transformation f in computer graphics and image processing

4 Conclusions

In this paper we present a dimensionality reduction transformation for multi-dimensional time-series data and some its applications in matching problem and MBR The transformation is proper, MBR-safe and simpler than existing transformations in Moon et

al (2002) and Moon (2007) Therefore, it may be applied very well in storing large databases of multi-dimensional time-series data, in searching, matching and data mining

These dimensionality reduction processes can be performed in parallel, so the time for dimensionality reduction will be decreased

Trang 10

In the further research we will apply the MBR-safe transformation to multi-media data retrieval and GIS Furthermore, the dimensionality reduction preserves planar geometric blocks Hence, it may be used in computer graphics and image processing as well

Acknowledgements

A part of this paper was presented at the 1st Asian Conference on Intelligent Information and Database Systems held in Dong Hoi, Vietnam in April 2009

The authors are thankful to Vietnam National University, Hanoi for providing support for this research (Project QG-09-01)

References

Agrawal, R., Faloutsos, C and Swami, A (1993) ‘Efficient similarity search in sequence

databases’, Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, USA, pp.69–84

Cornen, T.H., Leiserson, C.E, Rivest, R.L and Stein, C (2001) Introduction to Algorithms, The MIT Press

Faloutsos, C., Ranganathan, M and Manolopoulos, Y (2001) ‘Fast subsequence matching in

time-series databases’, Proceedings of the International Conference on Management of Data,

ACM SIGMOD, pp.419–429

Keogh, E., Chakrabarti, K, Pazzani, M and Mehrotra, S (2000) ‘Dimensionality reduction for fast

similarity search in large time-series database’, Journal of Knowledge and Information Systems, Vol 3, No 3, pp.263–286

Keogh, E., Chakrabarti, K., Mehrotra, S and Pazzani, M (2001) ‘Locally adaptive dimensionality

reduction for indexing large time-series databases’, Proceedings of the International Conference on Management of Data, ACM SIGMOD, pp.151–162

Moon, Y.S (2007) ‘An MBR-safe transformation for high-dimensional MBRs in similar sequence

matching’, Proceedings of the International Conference on Database systems for Advanced Applications, Thailand

Moon, Y.S and Kim, J (2007) ‘A theoretical study on MBR-safe transformations’, Proceedings of the 12th International Conference on Knowledge-Based and Intelligent Information &

Engineering Systems, Italy

Moon, Y.S., Whang, K.Y and Han, W.S (2002) ‘General match: a subsequence matching method

in time-series databases based on generalized windows’, Proceedings of the International Conference on Management of Data, ACM SIGMOD, pp.382–393

Thanh, H.C (2007) ‘Transforming sequential processes of a net system into concurent ones’,

International Journal of Knowledge-based and Intelligent Engineering Systems, Vol 11,

No 6, pp.391–397

Thanh, H.C (2009) ‘Parallel dimensionality reduction transformation for time-series data’, in

Ngoc Thanh Nguyen, Huynh Phan Nguyen and Adam Grzech (Eds.): ACIIDS 2009, IEEE Computer Society, pp.104–108

Định dạng
Số trang	10
Dung lượng	194,5 KB