2016 IEEE International Conference on Knowledge Engineering and Applications A Novel Hybrid Method for Time Se ries Subsequence Join Vo Duc Vinh, Nguyen Phuc Faculty of Information Techn
Trang 12016 IEEE International Conference on Knowledge Engineering and Applications
A Novel Hybrid Method for Time Se ries Subsequence Join
Vo Duc Vinh, Nguyen Phuc
Faculty of Information Technology
Ton Duc Thang University
Vietnam e-mail: vdvinh@ittdt.edu.vn.51303240@studenttdt.edu.vn
Abstract-The exact method JOCOR, proposed by Mueen et al.,
is the first method for joining two time series on subsequence
correlation Although JOCOR requires the time complexity
O(n1lgn), where n is the length of the time series, it is still
time-consuming even for medium-size time series In this paper,
we propose a hybrid method which can run fast er than JOCOR
Our method consists of four main steps First, a list of
subsequences is extracted from the raw time se ries based on
important extrema Second, we apply a nested loop join using a
sliding window and Dynamic Time Warping distance to find all
the matching subsequences in the two time series Third, we
concatenate al1 matching subsequences whose indexes are
adjacent into longer ones and find the pair of subsequences which
has the smallest distance between them Finally, we apply
JOCOR to find the most correlated segments in the two time
series In comparison to JOCOR, our hybrid method performs
much faster while high accuracy is guaranteed
Keywords-correlation coefficients, dynamic time warping,
important extrema, JOCOR, nested loop join, subsequence join,
time series
I INTRODUCTION Subsequence join over time series is considered as one of
the basic problems in time series data mining This problem
appears in many practical applications such as entertainment,
meteorology, economy, fmance, medicine, and engineering [6]
The subsequence join over time series can be viewed in
different defmitions The fIrst defmition of subsequence join is
based on a nested-Ioop algorithm and some distance function
This joining approach returns all pairs 0/ subsequences drawn
from two time series that satisfy a given similarity threshold
The second defInition of subsequence join is to fmd join
segments with maximum correlation coejJicient and requires
only one parameter, the minimum length of the segments
The approach based on the fIrst defInition has some
disadvantages such as time overhead because distance function
is called many times over a lot of iterations To reduce the
runtime for nested-Ioop algorithm, some works tend to
approximately estimate the similarity between two time series
by dividing time series into segments In this approach, Y Lin
et al ([6]) introduced solutions for joining two time series based
on a non-uniform segmentation and a similarity function over a
feature-set Their method is not only difficult to implement but
also requires high computational complexity, especially for
large time series data To avoid these drawbacks, in our
previous work [11], we proposed a method, called EP-M, for
Duong Tuan Anh
Faculty of Computer Science and Engineering of Ho Chi
Minh city University of Technology, Vietnam National University
Vietnam e-mail: dtanh@cse.hcmut.edu.vn
time series subsequence join which is based on important extrema to segment time series And then, we apply a nested loop join approach which uses a sliding window and Dynamic Time Warping (DTW) distance to fInd all the matching subsequences in the two time series within a similarity threshold Although this method executes very fast, it is an approximate method and may have some false dismissals because it might ignore some data points when shifting the sliding window several points at a time
Another recent work on subsequence join using the second defmition which is proposed by Mueen et al in 2014 ([7]) can fInd the exact correlated subsequence Mueen et al introduced
an exhaustive searching method, JOCOR for discovering the most correlated subsequence based on maximizing Pearson's correlation coeffIcient in two given time series Although the authors incorporated several speeding-up techniques to reduce the complexity from O(n4) to O(n2Ign), where n is the length of two time series, the runtime of JOCOR is still unacceptable even for many time series datasets with moderate size
In this paper, we propose a hybrid method for time series subsequence join (using the second defInition) by combining EP-M algorithm and JOCOR algorithm First, we apply EP-M
to fInd the set of all pairs of matching subsequences in the two time series within a similarity threshold From this set we derive the pair of subsequences which has the smallest distance between them The pair of subsequences is used as the candidates of the most correlated subsequences in the two time series Finally, we apply the JOCOR algorithm to post-process the candidates The hybrid method helps to speed-up the process of fInding the most correlated subsequence without causing any false dismissals The experiment results demonstrate that our new hybrid method not only is nearly as accurate as exact JOCOR but also achieves better time efficiency when being compared to exact JOCOR
11 BACKGROUND
A Basic Concepts Definition 1 A time series T= tl, tb , tn is a sequence of n data points measured at equal periods, where n is the length of the time series For most applications, each data point is usually represented by a real value
Definition 2 Given a time series T = t 1 , t 2 ,…, t n of length n,
a subsequence T[i: i + m - 1] = t i , t i+1 ,…, t i+m-1 is a continuous
subsequence of T, starting at position i and length m (m n)
Trang 2This work aims to join two time series T, and T2 of length n,
and nl respectively The problem of time series join was
defined by Mueen et al [7] which is described as folIows
Problem (Max-Corre lati on Join): Given two time series T,
and T2 of length n, and nl respectively (assume n, ;::: n2), [md
the most correlated subsequences of T, and T2 with length ;:::
minLength
When joining two time series, we refer to finding the most
correlated subsequence by calculating Pearson's correlation
coefficient The correlation coefficient is defined as folIows
(1)
where x and y are two given time series of equal length n, with
average values Jlx and Jly, and standard deviations ox and oy,
respectively
The value of Pearson' s correlation coefficient ranges in [-1,
1] Besides, the z-normalized Euclidean distance is also a
commonly used measure in time series data mining The
distance between two time series X = x f, x], , Xn and Y = Y f,
yl."" Yn with the same length n is calculated by:
n
i=1
where x, = _I_ (x; - Jlx) and Y, = _I_ (YI -Jly)
Because we just pay attention to maximizing positive
correlations and ignore the negatively correlated subsequences,
we can take advantage of the relationship between Euclidian
distance and positive correlation as follows
In this work, we will take advantage of statistics for
computing correlation coefficient as folIows
(4)
This approach brings us two advantages Firstly, the
algorithm just takes one pass to compute all of these statistic
variables Secondly, it enables us to reuse computations and
reduce the amortized time complexity to constant instead of
line ar [7] In the paper [7], the above formulas will be used for
computing correlation coefficient and z-normalized Euclidian
distance between two subsequences
B JOCOR Algori thm
The main idea of JOCOR is to add some improvements to
the naive algorithm, Join, which finds the most correlated join
segments The algorithm Join computes correlation of all the
possible pairs of segments of all the lengths The pseudo-code
of Join algorithm is given as in Fig 1
To improve Join algorithm, JOCOR tries to reuse the
sufficient statistics for overlapping correlation computation and
then prune unnecessary correlation computation admissibly
(Fig 2) JOCOR applies Fast Fourier Transform (FFT) to
compute the shifted cross product between two time series The
procedure multiply produces an array that contains the sum of the products of the elements in x and y for different shifts of x The output z of the procedure multiply is expressed more precisely as follows
Aigorithm Join(x, y) // return the locations and / / length of the most correlated segments of x and y
1 x := (x -mean(x»/stdv(x)
2 y := (y - mean(y»/stdv(y)
3 n := length(x) ; m := length(y)
4 best:= 0
5 for i := 1 to m - minLength + 1 do
6 for j := 1 to n - minLength + 1 do
7 maxLength := min(m - i + 1, n - j + 1)
8 for len := minLength to maxLength do
9 c:= Correlation(xUj +len-I], y[i:i +len-I])
10 if c > best then best := c Figure I The brute-force algorithm for subsequence join over time series
Aigorithm JOCOR(x, y) / / return the locations and / / length of the most correlated segments of x and y
1 x := (x -mean(x»/stdv(x)
2 y := (y - mean(y»/stdv(y)
3 n := length(x) ; m := length(y)
4 for i := 1 to m do
5 Zi := multiply(x, y[i:m])
6 best:= 0
7 for i := 1 to m - minLength + 1 do
8 for j := 1 to n - minLength + 1 do
9 maxLength := min(m - i + 1, n - j + 1)
10 len := minLength
11 while len � maxLength do
12 sumXY:= Zi[m-i +j] - Zi+len[m-i +j]
13 c := (sumXY - f lxf!y)/(lenO'xO'y)
14 if c > best then best := c
15
16
17
18
compute stepSize
if stepSize � 0 or stepSize ;::: len then stepSize := 0
len := len + stepSize Figure 2 The JOCOR algorithm for subsequence join over time series
Procedure multiply(x, y) / / return the shifted dot products // for x and y (stored in z)
1 n':= length(x), m' := length(y)
2 x:= append( x, n' - zeros)
3 y:= append(reverse(y), (2n' - m ')-zeros)
4 X:= FFT(x); Y := FFT(y)
5 Z :=X.Y
6 z:= iFFT(Z) Figure 3 The procedure for computing the shifted cross product between two
time series
Trang 3m'
Zk = LY,xk-m,+,
'�I
Here m' is the length of y and n' > m' is the length of x
The procedure multiply aims to calculate Lxy in the fonnula
of computing correlation coefficient (4) Lines 4-5 in JOCOR
aim to populate a set of cross products Z where Zi = multiply(x,
y[i : m]) The cross products in Z are the most important
statistics for any correlation computation between any pair of
segments (Fig 3) The dot product of two subsequences of x
and y starting at j and i -th locations, respectively, with length
len can be computed by Zi[m-i+j] - Zi+len[m-i+j]
The second important improvement of JOCOR is to use a
mechanism to skip some of the length in the loop of line 8
JOCOR computes the step size dynamically instead of
incrementing of the len variable by one Details of how to
compute this step size are given in [7]
C Dynamic Time Warping Measure
In this work, we use Dynamic Time Warping distance since
this distance measure allows non-linear alignments between
two time series to accommodate sequences that are similar, but
locally out of phase
Regarding the calculation of the DTW distance, the major
issue is that implementing it in the classical way, the
comparison of two time series of length I requires the
calculation of the entries of an I x I matrix using dynamic
pro�raming, and therefore the comparison has a complexity of
0(1) To speed up the DTW distance calculation, all
practitioners using DTW constrain the warping path in a giobai
manner by limiting how far it may stray from the diagonal The
subset of matrix that the warping path is allowed to visit is
called warping window or a band Two of the most frequently
used global constraints in the literature are the Sakoe-Chiba
band proposed by Sakoe and Chiba, 1978 [lO] and Itakura
Parallelogram proposed by Itakura, 1975 [3] Sakoe-Chiba band
is the area defined by two straight lines in parallel with the
diagonal and Itakura Parallelogram is the area defmed by the
parallelogram which is symmetric over the diagonal
In this work, we use DTW distance with Sakoe-Chiba band
r Two time series subsequences Q and C are similar to each
other within threshold th if DTW(Q, C)::; th
D lmportant Extrema
Important extrema in a time series contain important change
points of the time series The algorithm for identifying
important extrema was first introduced by Pratt and Fink, 2002
[8] Fink and Gandhi, 2007 [2] proposed the improved variant
of the algorithm for fmding important extrema The concepts of
important extrema of time series in both the papers were used
for time series compression But in this work, we exploit the
concepts of important extrema for segmenting time series into
subsequences
In [2], Fink and Gandhi give the definition of important
extrema and the algorithm that can identify important extrema
from a given time series Intuitively, an important minimum is
the minimum value of some segment and the endpoints of this
segment are much larger than it Similarly, an important
maximum is the maximum value of some segment and the
endpoints of this segment are much smaller than it The
definition of important extrema requires a positive parameter R
which is called compression rate An increase of R leads to the selection of fewer important extreme points
Given a time series T of length n, starting at the beginning of the time series, all important minima and maxima of the time series are identified by using the algorithm given in [2] The algorithm takes linear computational time and constant memory
III THE HYBRID METHOD FOR TIME SERIES SUBSEQUENCE
JO IN Now we present our new hybrid method for time series subsequence join which combines our previous method, EP-M ([11]), and JOCOR algorithm ([7]) The hybrid method exploits the relationship between the two definitions of time series subsequence join The first definition of subsequence join has the spirit of range search which finds all all pairs of subsequences drawn from two time series that satisfy a given similarity threshold th while the second definition has the spirit
of nearest neighbor search which fmds only the most correlated pair of subsequences But it is obvious that from the result of range search, we can derive the result of nearest neighbor search
Based on the above-mentioned rationale, the hybrid method consists of the following main ideas Given two time series, first
we apply EP-M approach which fmds all the matching subsequences in the two time series Second, we concatenate all matching subsequences whose indexes are adjacent into longer ones Then, we find the pair of subsequences which has the smallest distance between them Finally, we apply JOCOR algorithm to the pair of subsequences obtained in the preceding step to find the most correlated pair of subsequences
A The Proposed Method The hybrid algorithm for subsequence join over time series consists of the following steps
Step 1: We extract all important extrema of the two time series T, and T2• The results of this step are two lists of important extrema EPI = (epft, eph , ep1ma and EP2 = (ep2" ep2l."" ep2m2) where ml and m2 are the numbers of important extrema in T, and Tl respectively Afterward, when extracting subsequences from a time se ries (T, or T2), we extract the subsequence bounded by the extrema epi and epi+2'
Step 2: We keep time series T, fixed and for each subsequence s extracted from T, we find all its matching subsequences in T2 by shifting a sliding window of the length equal to the length of s along T2 one data point at a time We store all the resulting subsequences in the result set S
Step 3: At this step, we concatenate all the resulting matching pairs at Step 2 if the indexes of these pairs are adjacent Then, we find the pair of subsequences which has the smallest distance between them
Step 4: At the final step, we will apply JOCOR algorithm to calculate the Pearson's correlation coefficient and fmd the most correlated subsequence among the candidate subsequences found in Step 3
Notice that our previous algorithm EP-M consists of Step 1 and Step 2 in the hybrid algorithm The pseudo-code for describing Step 1, Step 2, Step 3 and Step 4 in the hybrid method is given in Fig 4 and Fig 5
Procedure Subsequence _Matching invokes the procedure DTW EA which computes DTW distance The DTW_EA
Trang 4procedure applies Early Abandoning technique as mentioned in
the next subsection
Aigorithm Hybrid (TI[l nl], T2[1 n2])
Input: Two time series TI and T2
Output: The pair of subsequences which has the
maximum Pearson's correlation coefficient
l EP 1 = Important_Extrema(TI);
EP2 := Important_Extrema(T2)
For i:= 1 to length(EP 1) -2 do
Subsequence_TI(i) := TI[EP 1(i) to EP 1 (i+2)]
For i:= 1 to length(EP2) -2 do
Subsequence_Tli):= T2[EP2(i) to EP2(i+2)]
2 For i:= 1 to length(EP 1) -2 do
s := Subsequence _ TI(i)
Subsequence _Matching (s, Tb threshold)
Store an the resulting pairs of subsequences in S
3 For each resulting pair of subsequences in S
ifthe indexes of the subsequences are adjacent then
concatenate these subsequences into
longer subsequences, and update the result set S
4 From the result set S, [md the pair of subsequences (SI,
S2) which has the smallest DTW distance between them
5 Apply JOCOR Aigorithm to find the most correlated
subsequences in the pair of candidate subsequences (s I, S2)
found in Step 4
Figure 4 The proposed method HYBRID
In order to make meaningful matching between two time
series, both must be nonnalized In our subsequence join
method, before applying the hybrid algorithm we use min-max
normalization to transform the scale of one time series to the
scale of the other time series based on the maximum and
minimum values in each time series
B Some Other Issues
How to estimate the width of Sakoe-chiba band
According to the research by Ratanamahatana and Keogh,
2005 [9], too large a warping window constraint may hurt
accuracy rather than improving it Their claim is summarized in
the remark "a fittle warping is good while too much warping is
bad." Following this suggestion, in this work, we use 2% as the
fixed width of Sakoe-Chiba band r in DTW distance calculation
That means r is computed by the formula:
r = 0.02*max(length(subsequence,), length(subsequence2))
How to speed up the calculation of DTW distance
To speed up the calculation of DTW distance, in this work,
we use Sakoe-Chiba band to constrain the warping path
Besides, we apply Early Abandoning technique, proposed by Li
and Wang, 2009 [5] to accelerate the computation of DTW
distance
IV EXPERIMENTAL EVALUATION
We implemented the two methods in MA TLAB and carried
out the experiments on an Intel(R) Core(TM) i7-4790, 3.6 GHz,
8GB RAM Pe In experiment, we compare the performance of
our hybrid method to that of exact JOCOR on three
measurements: the correlation coefficient of resulting
subsequence, the runtime of algorithm and the length of the resulting subsequence
A Datasets Procedure Subsequence_Matching(s[l m], T[l n], threshold)
Input: T is a time series, s is a subsequence and threshold
is the similarity threshold Output: The set S of an matching subsequences
1 for i = 1 to n - m + 1 do
2 segment_o!_T= subsequence Ti,l+m-1
3 dtw_distance = DTW_EA(s, segment_oLT, threshold)
4 if (dtw _distance <= threshold) then
5 Store to the pair <s, segment_o!_T> to the result set S
6 endfor end Procedure Figure 1 The procedure for subsequence matching
Our experiments were conducted over the datasets from the VCR Time Series Data Mining archive [4] and from [7] There are 9 datasets used in these experiments The names and lengths
of nine datasets are as folIows: Chromosome (999,541 data points), Stock (2,119,415 data points), EEG (10,957,312 data points), Random Walks (RW2 - 1,600,002 data points), Ratbp (1,296,000 data points), LFS6 (180,214 data points), LightCurve (8,192,002 data points), Currency (15,000 data points) and Salinity (15,000 data points)
The datasets are categorized into two types In the first type
of datasets, each dataset is a long time series In this case, we divide each time series into two equal halves The first subseries will be h The second one will be T2 In the second type of datasets, each dataset is a medium-size time series Datasets of this kind are down loaded from VCR archive The down loaded dataset will be TI Basing on TI, we randomly generate the synthetic dataset T2 by applying the following rule:
In the above formula, + or - is determined by a random process The rationale for the method to generate the dataset T2 from TI is that the dataset T2 should have a high probability to
be correlated with h Time series data T2 is generated after the correspondent dataset has been nonnalized; therefore, there is no effect of noise in T2•
B The Hybrid Method versus Exact JOCOR When operating some task on very long time series, the response time is one of the most challenging factors for researchers In this experiment, we plan to compare the perfonnance of our proposed method with that of exact JOCOR The performance of each method is evaluated by three measurements: the maximum correlation coefficient of resulting subsequence, the runtime of the method, and the length of the resulting subsequence Because the lengths of the resulting subsequences of the two methods are almost the same,
we exclude them from our comparison
From T ABLE I, with datasets Stock, EEG and Chromosome, our hybrid method produced the same maximum correlation values as the exact JOCOR In average of an experiments, the maximum correlation coefficients obtained by the hybrid method re ach nearly 100% similar to JOCOR's
x i = x i-1 |x i-1 - | where
Trang 5results These experimental results verified the correctness and
the accuracy of the EP-M method in time series subsequence
join even though EP-M is just an approximate method
T ABLE 1 EXPERIMENTAL RESULTS OF THE HYBRID METHOD AND JOCOR
OVER 9 DATASETS (RT: RUNTIME IN SECONDS; MC: MAXIMLM
CORRELATION)
Dataset length= 1 000 length=4000 length=15000 Method
R T MC R T MC R T MC
Stock 6 0.995 549 0.998 2349 0.999 Hybrid
17 0.995 565 0.999 >12hrs N/A JOCOR
RW2 12 0.966 74 0.973 2125 0.951 Hybrid
41 0.966 1026 0.976 10461 0.985 JOCOR
RATBP 7 0.978 136 0.997 10885 0.999 Hybrid
22 0.984 3140 0.999 9488 0.999 JOCOR
LFS6 22 0.992 352 0.986 1989 0.996 Hybrid
24 0.992 600 0.997 8076 0.998 JOCOR
EEG 17 0.877 393 0.890 13886 0.908 Hybrid
41 0.877 1095 0.890 43885 0.908 JOCOR
Curren- 4 0.967 135 0.966 5961 0.966 Hybrid
cy 14 0.977 1057 0.988 7480 0.966 JOCOR
Salinity 5 0.976 47 0.992 1595 0.995 Hybrid
14 0.983 286 0.996 5670 0.997 JOCOR
Chromo 14 0.999 350 0.999 13907 0.999 Hybrid
-some 11 0.999 351 0.999 10030 0.999 JOCOR
Light- 9 1.000 561 1.000 1989 0.996 Hybrid
Curve 8 1.000 496 1.000 8078 0.998 JOCOR
Regarding the runtime, our hybrid method outperforms
JOCOR for seven out of nine datasets Especially, with
LightCurve dataset, in the experiment with 15,000 data points,
the runtime of our method was more than 4 times faster than
that of JOCOR With the length of 4000, in average, the hybrid
method runs faster than JOCOR about 6.46 times The
differences between the two runtimes of the hybrid method and
JOCOR are wider when the length of the datasets increases
Nevertheless, with Chromosome and RA TBP datasets, JOCOR
runs slightly faster than our hybrid method This is because for
this specific time series, the early abandoning technique [5]
applied for DTW calculation had no effect in pruning off costly
distance computations and hence the time series had undergone
several DTW calculations before really beingjoined by JOCOR
algorithm
V CONCLVSIONS AND FUTURE WORK
In this paper, we proposed a new hybrid method to [md the
most correlated subsequences of two time series with
acceptable time efficiency The experimental results show that our hybrid method runs faster than JOCOR while the accuracy
of our approach is nearly the same as that of JOCOR We attribute the high performance of the hybrid method to the effectiveness of EP-M to perform the range search and apply JOCOR to post-process the candidate subsequences obtained from the EP-M
As for future work, we intend to apply a new distance, Complexity Invariant Distance (CID) [1], in the hybrid method rather than DTW in order to improve its time efficiency
REFERENCES [I] G P A Batista, X Wang, and E J Keogh, "Complexity Invariant Distance for Time Series," in Proc ojSIAM Int/ Conf on Data Mining, 20ll, pp 699-710
[2] E Fink and H S Gandhi, "Important extrema of time series," in Froc oj iEEE int/ Conf on System, Man and Cybernetics, Montreal, Canada,
2007, pp 366-372
[3] F Itakura, "Minimum Prediction Residual Principle Applied to Speech Recognition," in IEEE Trans Acoustics, Speech, and Signal Froc, vol ASSP-23, 1975, pp 52-72
[4] E Keogh, 'The VCR Time Series Classification/Clustering," URL www.cs.ucr.edu/-eamonn/time_series_datal,2015
[5] J Li and Y Wang, "Early Abandon to Accelerate Exact Dynamic Time Warping," in The international Arab Journal ojlnjormation Technology, vol 6(2),2009, pp 144-152
[6] Y Lin and M D McCool, "Subseries Join: A Similarity-based Time series Match Approach," in Froc oj FAKDD Intl Conf on, 20iO, Part 1, LNAI 6ll8, pp 238-245
[7] A Mueen, H Hamooni, and T Estrada, "Time Series Join on Subsequence Correlation," in Froc oj ICDM, 2014, pp 450-459 [8] K B Pratt and E Fink, "Search for Patterns in Compressed Time Series," in international Journal oj image and Graphics, vol 2 (1),2002,
pp 89-106
[9] CA Ratanamahatana and E Keogh, "Three myths about Dynamic Time Warping Data Mining ", in: Froc ojSDM'05, 2005
[10] H Sakoe and S Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," in IEEE Trans Acoustics, Speech, and Signal Froc., vol ASSP-26, 1978, pp 43-49
[11] V D Vinh and D T Anh, "Efficient Subsequence Join over Time Series under Dynamic Time Warping " in Recent Developments in Intelligent Information and Database Systems, Studies in Computational intelligence, vol 642 D Krol, L Madeyski, N T Nguyen, Springer,
2016, pp 41-52