A novel hybrid method for time series subsequence

2016 IEEE International Conference on Knowledge Engineering and Applications A Novel Hybrid Method for Time Se ries Subsequence Join Vo Duc Vinh, Nguyen Phuc Faculty of Information Techn

Trang 1

2016 IEEE International Conference on Knowledge Engineering and Applications

A Novel Hybrid Method for Time Se ries Subsequence Join

Vo Duc Vinh, Nguyen Phuc

Faculty of Information Technology

Ton Duc Thang University

Vietnam e-mail: vdvinh@ittdt.edu.vn.51303240@studenttdt.edu.vn

Abstract-The exact method JOCOR, proposed by Mueen et al.,

is the first method for joining two time series on subsequence

correlation Although JOCOR requires the time complexity

O(n1lgn), where n is the length of the time series, it is still

time-consuming even for medium-size time series In this paper,

we propose a hybrid method which can run fast er than JOCOR

Our method consists of four main steps First, a list of

subsequences is extracted from the raw time se ries based on

important extrema Second, we apply a nested loop join using a

sliding window and Dynamic Time Warping distance to find all

the matching subsequences in the two time series Third, we

concatenate al1 matching subsequences whose indexes are

adjacent into longer ones and find the pair of subsequences which

has the smallest distance between them Finally, we apply

JOCOR to find the most correlated segments in the two time

series In comparison to JOCOR, our hybrid method performs

much faster while high accuracy is guaranteed

Keywords-correlation coefficients, dynamic time warping,

important extrema, JOCOR, nested loop join, subsequence join,

time series

I INTRODUCTION Subsequence join over time series is considered as one of

the basic problems in time series data mining This problem

appears in many practical applications such as entertainment,

meteorology, economy, fmance, medicine, and engineering [6]

The subsequence join over time series can be viewed in

different defmitions The fIrst defmition of subsequence join is

based on a nested-Ioop algorithm and some distance function

This joining approach returns all pairs 0/ subsequences drawn

from two time series that satisfy a given similarity threshold

The second defInition of subsequence join is to fmd join

segments with maximum correlation coejJicient and requires

only one parameter, the minimum length of the segments

The approach based on the fIrst defInition has some

disadvantages such as time overhead because distance function

is called many times over a lot of iterations To reduce the

runtime for nested-Ioop algorithm, some works tend to

approximately estimate the similarity between two time series

by dividing time series into segments In this approach, Y Lin

et al ([6]) introduced solutions for joining two time series based

on a non-uniform segmentation and a similarity function over a

feature-set Their method is not only difficult to implement but

also requires high computational complexity, especially for

large time series data To avoid these drawbacks, in our

previous work [11], we proposed a method, called EP-M, for

Duong Tuan Anh

Faculty of Computer Science and Engineering of Ho Chi

Minh city University of Technology, Vietnam National University

Vietnam e-mail: dtanh@cse.hcmut.edu.vn

time series subsequence join which is based on important extrema to segment time series And then, we apply a nested loop join approach which uses a sliding window and Dynamic Time Warping (DTW) distance to fInd all the matching subsequences in the two time series within a similarity threshold Although this method executes very fast, it is an approximate method and may have some false dismissals because it might ignore some data points when shifting the sliding window several points at a time

Another recent work on subsequence join using the second defmition which is proposed by Mueen et al in 2014 ([7]) can fInd the exact correlated subsequence Mueen et al introduced

an exhaustive searching method, JOCOR for discovering the most correlated subsequence based on maximizing Pearson's correlation coeffIcient in two given time series Although the authors incorporated several speeding-up techniques to reduce the complexity from O(n4) to O(n2Ign), where n is the length of two time series, the runtime of JOCOR is still unacceptable even for many time series datasets with moderate size

In this paper, we propose a hybrid method for time series subsequence join (using the second defInition) by combining EP-M algorithm and JOCOR algorithm First, we apply EP-M

to fInd the set of all pairs of matching subsequences in the two time series within a similarity threshold From this set we derive the pair of subsequences which has the smallest distance between them The pair of subsequences is used as the candidates of the most correlated subsequences in the two time series Finally, we apply the JOCOR algorithm to post-process the candidates The hybrid method helps to speed-up the process of fInding the most correlated subsequence without causing any false dismissals The experiment results demonstrate that our new hybrid method not only is nearly as accurate as exact JOCOR but also achieves better time efficiency when being compared to exact JOCOR

11 BACKGROUND

A Basic Concepts Definition 1 A time series T= tl, tb , tn is a sequence of n data points measured at equal periods, where n is the length of the time series For most applications, each data point is usually represented by a real value

Definition 2 Given a time series T = t 1 , t 2 ,…, t n of length n,

a subsequence T[i: i + m - 1] = t i , t i+1 ,…, t i+m-1 is a continuous

subsequence of T, starting at position i and length m (m  n)

Trang 2

This work aims to join two time series T, and T2 of length n,

and nl respectively The problem of time series join was

defined by Mueen et al [7] which is described as folIows

Problem (Max-Corre lati on Join): Given two time series T,

and T2 of length n, and nl respectively (assume n, ;::: n2), [md

the most correlated subsequences of T, and T2 with length ;:::

minLength

When joining two time series, we refer to finding the most

correlated subsequence by calculating Pearson's correlation

coefficient The correlation coefficient is defined as folIows

(1)

where x and y are two given time series of equal length n, with

average values Jlx and Jly, and standard deviations ox and oy,

respectively

The value of Pearson' s correlation coefficient ranges in [-1,

1] Besides, the z-normalized Euclidean distance is also a

commonly used measure in time series data mining The

distance between two time series X = x f, x], , Xn and Y = Y f,

yl."" Yn with the same length n is calculated by:

n

i=1

where x, = _I_ (x; - Jlx) and Y, = _I_ (YI -Jly)

Because we just pay attention to maximizing positive

correlations and ignore the negatively correlated subsequences,

we can take advantage of the relationship between Euclidian

distance and positive correlation as follows

In this work, we will take advantage of statistics for

computing correlation coefficient as folIows

(4)

This approach brings us two advantages Firstly, the

algorithm just takes one pass to compute all of these statistic

variables Secondly, it enables us to reuse computations and

reduce the amortized time complexity to constant instead of

line ar [7] In the paper [7], the above formulas will be used for

computing correlation coefficient and z-normalized Euclidian

distance between two subsequences

B JOCOR Algori thm

The main idea of JOCOR is to add some improvements to

the naive algorithm, Join, which finds the most correlated join

segments The algorithm Join computes correlation of all the

possible pairs of segments of all the lengths The pseudo-code

of Join algorithm is given as in Fig 1

To improve Join algorithm, JOCOR tries to reuse the

sufficient statistics for overlapping correlation computation and

then prune unnecessary correlation computation admissibly

(Fig 2) JOCOR applies Fast Fourier Transform (FFT) to

compute the shifted cross product between two time series The

procedure multiply produces an array that contains the sum of the products of the elements in x and y for different shifts of x The output z of the procedure multiply is expressed more precisely as follows

Aigorithm Join(x, y) // return the locations and / / length of the most correlated segments of x and y

1 x := (x -mean(x»/stdv(x)

2 y := (y - mean(y»/stdv(y)

3 n := length(x) ; m := length(y)

4 best:= 0

5 for i := 1 to m - minLength + 1 do

6 for j := 1 to n - minLength + 1 do

7 maxLength := min(m - i + 1, n - j + 1)

8 for len := minLength to maxLength do

9 c:= Correlation(xUj +len-I], y[i:i +len-I])

10 if c > best then best := c Figure I The brute-force algorithm for subsequence join over time series

Aigorithm JOCOR(x, y) / / return the locations and / / length of the most correlated segments of x and y

1 x := (x -mean(x»/stdv(x)

2 y := (y - mean(y»/stdv(y)

3 n := length(x) ; m := length(y)

4 for i := 1 to m do

5 Zi := multiply(x, y[i:m])

6 best:= 0

7 for i := 1 to m - minLength + 1 do

8 for j := 1 to n - minLength + 1 do

9 maxLength := min(m - i + 1, n - j + 1)

10 len := minLength

11 while len � maxLength do

12 sumXY:= Zi[m-i +j] - Zi+len[m-i +j]

13 c := (sumXY - f lxf!y)/(lenO'xO'y)

14 if c > best then best := c

15

16

17

18

compute stepSize

if stepSize � 0 or stepSize ;::: len then stepSize := 0

len := len + stepSize Figure 2 The JOCOR algorithm for subsequence join over time series

Procedure multiply(x, y) / / return the shifted dot products // for x and y (stored in z)

1 n':= length(x), m' := length(y)

2 x:= append( x, n' - zeros)

3 y:= append(reverse(y), (2n' - m ')-zeros)

4 X:= FFT(x); Y := FFT(y)

5 Z :=X.Y

6 z:= iFFT(Z) Figure 3 The procedure for computing the shifted cross product between two

time series

Trang 3

m'

Zk = LY,xk-m,+,

'�I

Here m' is the length of y and n' > m' is the length of x

The procedure multiply aims to calculate Lxy in the fonnula

of computing correlation coefficient (4) Lines 4-5 in JOCOR

aim to populate a set of cross products Z where Zi = multiply(x,

y[i : m]) The cross products in Z are the most important

statistics for any correlation computation between any pair of

segments (Fig 3) The dot product of two subsequences of x

and y starting at j and i -th locations, respectively, with length

len can be computed by Zi[m-i+j] - Zi+len[m-i+j]

The second important improvement of JOCOR is to use a

mechanism to skip some of the length in the loop of line 8

JOCOR computes the step size dynamically instead of

incrementing of the len variable by one Details of how to

compute this step size are given in [7]

C Dynamic Time Warping Measure

In this work, we use Dynamic Time Warping distance since

this distance measure allows non-linear alignments between

two time series to accommodate sequences that are similar, but

locally out of phase

Regarding the calculation of the DTW distance, the major

issue is that implementing it in the classical way, the

comparison of two time series of length I requires the

calculation of the entries of an I x I matrix using dynamic

pro�raming, and therefore the comparison has a complexity of

0(1) To speed up the DTW distance calculation, all

practitioners using DTW constrain the warping path in a giobai

manner by limiting how far it may stray from the diagonal The

subset of matrix that the warping path is allowed to visit is

called warping window or a band Two of the most frequently

used global constraints in the literature are the Sakoe-Chiba

band proposed by Sakoe and Chiba, 1978 [lO] and Itakura

Parallelogram proposed by Itakura, 1975 [3] Sakoe-Chiba band

is the area defined by two straight lines in parallel with the

diagonal and Itakura Parallelogram is the area defmed by the

parallelogram which is symmetric over the diagonal

In this work, we use DTW distance with Sakoe-Chiba band

r Two time series subsequences Q and C are similar to each

other within threshold th if DTW(Q, C)::; th

D lmportant Extrema

Important extrema in a time series contain important change

points of the time series The algorithm for identifying

important extrema was first introduced by Pratt and Fink, 2002

[8] Fink and Gandhi, 2007 [2] proposed the improved variant

of the algorithm for fmding important extrema The concepts of

important extrema of time series in both the papers were used

for time series compression But in this work, we exploit the

concepts of important extrema for segmenting time series into

subsequences

In [2], Fink and Gandhi give the definition of important

extrema and the algorithm that can identify important extrema

from a given time series Intuitively, an important minimum is

the minimum value of some segment and the endpoints of this

segment are much larger than it Similarly, an important

maximum is the maximum value of some segment and the

endpoints of this segment are much smaller than it The

definition of important extrema requires a positive parameter R

which is called compression rate An increase of R leads to the selection of fewer important extreme points

Given a time series T of length n, starting at the beginning of the time series, all important minima and maxima of the time series are identified by using the algorithm given in [2] The algorithm takes linear computational time and constant memory

III THE HYBRID METHOD FOR TIME SERIES SUBSEQUENCE

JO IN Now we present our new hybrid method for time series subsequence join which combines our previous method, EP-M ([11]), and JOCOR algorithm ([7]) The hybrid method exploits the relationship between the two definitions of time series subsequence join The first definition of subsequence join has the spirit of range search which finds all all pairs of subsequences drawn from two time series that satisfy a given similarity threshold th while the second definition has the spirit

of nearest neighbor search which fmds only the most correlated pair of subsequences But it is obvious that from the result of range search, we can derive the result of nearest neighbor search

Based on the above-mentioned rationale, the hybrid method consists of the following main ideas Given two time series, first

we apply EP-M approach which fmds all the matching subsequences in the two time series Second, we concatenate all matching subsequences whose indexes are adjacent into longer ones Then, we find the pair of subsequences which has the smallest distance between them Finally, we apply JOCOR algorithm to the pair of subsequences obtained in the preceding step to find the most correlated pair of subsequences

A The Proposed Method The hybrid algorithm for subsequence join over time series consists of the following steps

Step 1: We extract all important extrema of the two time series T, and T2• The results of this step are two lists of important extrema EPI = (epft, eph , ep1ma and EP2 = (ep2" ep2l."" ep2m2) where ml and m2 are the numbers of important extrema in T, and Tl respectively Afterward, when extracting subsequences from a time se ries (T, or T2), we extract the subsequence bounded by the extrema epi and epi+2'

Step 2: We keep time series T, fixed and for each subsequence s extracted from T, we find all its matching subsequences in T2 by shifting a sliding window of the length equal to the length of s along T2 one data point at a time We store all the resulting subsequences in the result set S

Step 3: At this step, we concatenate all the resulting matching pairs at Step 2 if the indexes of these pairs are adjacent Then, we find the pair of subsequences which has the smallest distance between them

Step 4: At the final step, we will apply JOCOR algorithm to calculate the Pearson's correlation coefficient and fmd the most correlated subsequence among the candidate subsequences found in Step 3

Notice that our previous algorithm EP-M consists of Step 1 and Step 2 in the hybrid algorithm The pseudo-code for describing Step 1, Step 2, Step 3 and Step 4 in the hybrid method is given in Fig 4 and Fig 5

Procedure Subsequence _Matching invokes the procedure DTW EA which computes DTW distance The DTW_EA

Trang 4

procedure applies Early Abandoning technique as mentioned in

the next subsection

Aigorithm Hybrid (TI[l nl], T2[1 n2])

Input: Two time series TI and T2

Output: The pair of subsequences which has the

maximum Pearson's correlation coefficient

l EP 1 = Important_Extrema(TI);

EP2 := Important_Extrema(T2)

For i:= 1 to length(EP 1) -2 do

Subsequence_TI(i) := TI[EP 1(i) to EP 1 (i+2)]

For i:= 1 to length(EP2) -2 do

Subsequence_Tli):= T2[EP2(i) to EP2(i+2)]

2 For i:= 1 to length(EP 1) -2 do

s := Subsequence _ TI(i)

Subsequence _Matching (s, Tb threshold)

Store an the resulting pairs of subsequences in S

3 For each resulting pair of subsequences in S

ifthe indexes of the subsequences are adjacent then

concatenate these subsequences into

longer subsequences, and update the result set S

4 From the result set S, [md the pair of subsequences (SI,

S2) which has the smallest DTW distance between them

5 Apply JOCOR Aigorithm to find the most correlated

subsequences in the pair of candidate subsequences (s I, S2)

found in Step 4

Figure 4 The proposed method HYBRID

In order to make meaningful matching between two time

series, both must be nonnalized In our subsequence join

method, before applying the hybrid algorithm we use min-max

normalization to transform the scale of one time series to the

scale of the other time series based on the maximum and

minimum values in each time series

B Some Other Issues

How to estimate the width of Sakoe-chiba band

According to the research by Ratanamahatana and Keogh,

2005 [9], too large a warping window constraint may hurt

accuracy rather than improving it Their claim is summarized in

the remark "a fittle warping is good while too much warping is

bad." Following this suggestion, in this work, we use 2% as the

fixed width of Sakoe-Chiba band r in DTW distance calculation

That means r is computed by the formula:

r = 0.02*max(length(subsequence,), length(subsequence2))

How to speed up the calculation of DTW distance

To speed up the calculation of DTW distance, in this work,

we use Sakoe-Chiba band to constrain the warping path

Besides, we apply Early Abandoning technique, proposed by Li

and Wang, 2009 [5] to accelerate the computation of DTW

distance

IV EXPERIMENTAL EVALUATION

We implemented the two methods in MA TLAB and carried

out the experiments on an Intel(R) Core(TM) i7-4790, 3.6 GHz,

8GB RAM Pe In experiment, we compare the performance of

our hybrid method to that of exact JOCOR on three

measurements: the correlation coefficient of resulting

subsequence, the runtime of algorithm and the length of the resulting subsequence

A Datasets Procedure Subsequence_Matching(s[l m], T[l n], threshold)

Input: T is a time series, s is a subsequence and threshold

is the similarity threshold Output: The set S of an matching subsequences

1 for i = 1 to n - m + 1 do

2 segment_o!_T= subsequence Ti,l+m-1

3 dtw_distance = DTW_EA(s, segment_oLT, threshold)

4 if (dtw _distance <= threshold) then

5 Store to the pair <s, segment_o!_T> to the result set S

6 endfor end Procedure Figure 1 The procedure for subsequence matching

Our experiments were conducted over the datasets from the VCR Time Series Data Mining archive [4] and from [7] There are 9 datasets used in these experiments The names and lengths

of nine datasets are as folIows: Chromosome (999,541 data points), Stock (2,119,415 data points), EEG (10,957,312 data points), Random Walks (RW2 - 1,600,002 data points), Ratbp (1,296,000 data points), LFS6 (180,214 data points), LightCurve (8,192,002 data points), Currency (15,000 data points) and Salinity (15,000 data points)

The datasets are categorized into two types In the first type

of datasets, each dataset is a long time series In this case, we divide each time series into two equal halves The first subseries will be h The second one will be T2 In the second type of datasets, each dataset is a medium-size time series Datasets of this kind are down loaded from VCR archive The down loaded dataset will be TI Basing on TI, we randomly generate the synthetic dataset T2 by applying the following rule:

In the above formula, + or - is determined by a random process The rationale for the method to generate the dataset T2 from TI is that the dataset T2 should have a high probability to

be correlated with h Time series data T2 is generated after the correspondent dataset has been nonnalized; therefore, there is no effect of noise in T2•

B The Hybrid Method versus Exact JOCOR When operating some task on very long time series, the response time is one of the most challenging factors for researchers In this experiment, we plan to compare the perfonnance of our proposed method with that of exact JOCOR The performance of each method is evaluated by three measurements: the maximum correlation coefficient of resulting subsequence, the runtime of the method, and the length of the resulting subsequence Because the lengths of the resulting subsequences of the two methods are almost the same,

we exclude them from our comparison

From T ABLE I, with datasets Stock, EEG and Chromosome, our hybrid method produced the same maximum correlation values as the exact JOCOR In average of an experiments, the maximum correlation coefficients obtained by the hybrid method re ach nearly 100% similar to JOCOR's

x i = x i-1  |x i-1 - | where

Trang 5

results These experimental results verified the correctness and

the accuracy of the EP-M method in time series subsequence

join even though EP-M is just an approximate method

T ABLE 1 EXPERIMENTAL RESULTS OF THE HYBRID METHOD AND JOCOR

OVER 9 DATASETS (RT: RUNTIME IN SECONDS; MC: MAXIMLM

CORRELATION)

Dataset length= 1 000 length=4000 length=15000 Method

R T MC R T MC R T MC

Stock 6 0.995 549 0.998 2349 0.999 Hybrid

17 0.995 565 0.999 >12hrs N/A JOCOR

RW2 12 0.966 74 0.973 2125 0.951 Hybrid

41 0.966 1026 0.976 10461 0.985 JOCOR

RATBP 7 0.978 136 0.997 10885 0.999 Hybrid

22 0.984 3140 0.999 9488 0.999 JOCOR

LFS6 22 0.992 352 0.986 1989 0.996 Hybrid

24 0.992 600 0.997 8076 0.998 JOCOR

EEG 17 0.877 393 0.890 13886 0.908 Hybrid

41 0.877 1095 0.890 43885 0.908 JOCOR

Curren- 4 0.967 135 0.966 5961 0.966 Hybrid

cy 14 0.977 1057 0.988 7480 0.966 JOCOR

Salinity 5 0.976 47 0.992 1595 0.995 Hybrid

14 0.983 286 0.996 5670 0.997 JOCOR

Chromo 14 0.999 350 0.999 13907 0.999 Hybrid

-some 11 0.999 351 0.999 10030 0.999 JOCOR

Light- 9 1.000 561 1.000 1989 0.996 Hybrid

Curve 8 1.000 496 1.000 8078 0.998 JOCOR

Regarding the runtime, our hybrid method outperforms

JOCOR for seven out of nine datasets Especially, with

LightCurve dataset, in the experiment with 15,000 data points,

the runtime of our method was more than 4 times faster than

that of JOCOR With the length of 4000, in average, the hybrid

method runs faster than JOCOR about 6.46 times The

differences between the two runtimes of the hybrid method and

JOCOR are wider when the length of the datasets increases

Nevertheless, with Chromosome and RA TBP datasets, JOCOR

runs slightly faster than our hybrid method This is because for

this specific time series, the early abandoning technique [5]

applied for DTW calculation had no effect in pruning off costly

distance computations and hence the time series had undergone

several DTW calculations before really beingjoined by JOCOR

algorithm

V CONCLVSIONS AND FUTURE WORK

In this paper, we proposed a new hybrid method to [md the

most correlated subsequences of two time series with

acceptable time efficiency The experimental results show that our hybrid method runs faster than JOCOR while the accuracy

of our approach is nearly the same as that of JOCOR We attribute the high performance of the hybrid method to the effectiveness of EP-M to perform the range search and apply JOCOR to post-process the candidate subsequences obtained from the EP-M

As for future work, we intend to apply a new distance, Complexity Invariant Distance (CID) [1], in the hybrid method rather than DTW in order to improve its time efficiency

REFERENCES [I] G P A Batista, X Wang, and E J Keogh, "Complexity Invariant Distance for Time Series," in Proc ojSIAM Int/ Conf on Data Mining, 20ll, pp 699-710

[2] E Fink and H S Gandhi, "Important extrema of time series," in Froc oj iEEE int/ Conf on System, Man and Cybernetics, Montreal, Canada,

2007, pp 366-372

[3] F Itakura, "Minimum Prediction Residual Principle Applied to Speech Recognition," in IEEE Trans Acoustics, Speech, and Signal Froc, vol ASSP-23, 1975, pp 52-72

[4] E Keogh, 'The VCR Time Series Classification/Clustering," URL www.cs.ucr.edu/-eamonn/time_series_datal,2015

[5] J Li and Y Wang, "Early Abandon to Accelerate Exact Dynamic Time Warping," in The international Arab Journal ojlnjormation Technology, vol 6(2),2009, pp 144-152

[6] Y Lin and M D McCool, "Subseries Join: A Similarity-based Time series Match Approach," in Froc oj FAKDD Intl Conf on, 20iO, Part 1, LNAI 6ll8, pp 238-245

[7] A Mueen, H Hamooni, and T Estrada, "Time Series Join on Subsequence Correlation," in Froc oj ICDM, 2014, pp 450-459 [8] K B Pratt and E Fink, "Search for Patterns in Compressed Time Series," in international Journal oj image and Graphics, vol 2 (1),2002,

pp 89-106

[9] CA Ratanamahatana and E Keogh, "Three myths about Dynamic Time Warping Data Mining ", in: Froc ojSDM'05, 2005

[10] H Sakoe and S Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," in IEEE Trans Acoustics, Speech, and Signal Froc., vol ASSP-26, 1978, pp 43-49

[11] V D Vinh and D T Anh, "Efficient Subsequence Join over Time Series under Dynamic Time Warping " in Recent Developments in Intelligent Information and Database Systems, Studies in Computational intelligence, vol 642 D Krol, L Madeyski, N T Nguyen, Springer,

2016, pp 41-52

Định dạng
Số trang	5
Dung lượng	480,38 KB