DSpace at VNU: Discovery of time series k-motifs based on multidimensional index

In this paper, wepropose two novel algorithms for discovering motifs in time series data: The first algorithm is based on R∗-tree and early abandoning technique and the second algorithm

Trang 1

DOI 10.1007/s10115-014-0814-3

R E G U L A R PA P E R

Discovery of time series k-motifs based

on multidimensional index

Nguyen Thanh Son · Duong Tuan Anh

Received: 21 January 2014 / Revised: 16 October 2014 / Accepted: 25 December 2014

Abstract Time series motifs are frequently occurring but previously unknown subsequences

of a longer time series Discovering time series motifs is a crucial task in time series datamining In time series motif discovery algorithm, finding nearest neighbors of a subse-quence is the basic operation To make this basic operation efficient, we can make use

of some advanced multidimensional index structure for time series data In this paper, wepropose two novel algorithms for discovering motifs in time series data: The first algorithm

is based on R∗-tree and early abandoning technique and the second algorithm makes use

of a dimensionality reduction method and state-of-the-art Skyline index We demonstratethat the effectiveness of our proposed algorithms by experimenting on real datasets fromdifferent areas The experimental results reveal that our two proposed algorithms outperformthe most popular method, random projection, in time efficiency while bring out the sameaccuracy

Keywords Time series· k-Motifs · Motif discovery · Multidimensional index ·

R-tree· Skyline index

1 Introduction

Many researchers have been studying the extraction of various characteristics from timeseries data One of these challenges, efficient discovery of ‘motifs’ has received much atten-tion Time series motifs are frequently occurring but previously unknown subsequences of alonger time series which are very similar to each other This motif concept is generalized to

N T Son

Faculty of Information Technology, Ho Chi Minh University

of Technical Education, Ho Chi Minh City, Vietnam

D T Anh (B)

Faculty of Computer Science and Engineering, Ho Chi Minh City

University of Technology, Ho Chi Minh City, Vietnam

e-mail: dtanh@cse.hcmut.edu.vn

Trang 2

k-motifs problem, where the top k-motifs are returned Since its first formalization by Lin

et al [14], discovering motifs has been used to solve problems in several application areas[3,6,9,10,17,19,22,28] and also used as a preprocessing step in several higher level datamining tasks such as time series clustering, time series classification, rule discovery, andsummarization

Among a dozen algorithms for finding motifs that have been proposed in the literature,most of them are algorithms which work on time series transformed by some dimension-ality reduction method or discretization method The most popular algorithm for findingtime series motifs is random projection algorithm proposed by Chiu et al [5] This algo-rithm can find motifs in linear time and is robust to noise However, it still has somedrawbacks: First, if the distribution of the projections is not sufficiently wide, it becomesquadratic in time and space, and second, random projection is based on locality-preservinghashing that is effective for a relative small number of projected dimensions (10–20) [4].Besides random projection, in 2003 and 2005, Tanaka et al proposed two algorithms, MDand EMD that can apply minimum description length principle to determine the optimallength for time series motif during the process of motif discovery Mueen et al [18] pro-posed a tractable exact motif discovery algorithm, called MK algorithm, which can workdirectly on original time series This MK algorithm is an improvement of the brute-forcealgorithm which is an exhaustive search algorithm by using some techniques to speedupthe algorithm Mueen et al showed that while this exact algorithm is still quadratic inthe worst case, it can be up to three orders of magnitude faster than the brute-forcealgorithm We can notice that both the two popular approaches, random projection [5]and MK [18], and some other approaches for finding time series motifs (e.g., [6,9,27])

do not employ the support of any index structure and their computational costs are stillhigh

In time series motif discovery algorithm, finding nearest neighbors of a subsequence is thebasic operation To make this basic operation efficient, we can make use of some advancedindex structure for time series data In our work, we introduce two novel algorithms for

discovering approximate k-motifs in a long time series: The first is based on R∗-tree andearly abandoning technique, and the second makes use of MP_C dimensionality reductionmethod [24] and state-of-the-art Skyline index [16] Both our approaches employ multidi-mensional index structure to speedup the search for nearest neighbors of a subsequence Ourproposed algorithms are disk efficient because they only require a single sequential disk scan

to read the entire time series Besides, these methods can work directly on numerical timeseries data transformed by some dimensionality reduction method but without applying anydiscretization process

We carried out several experiments on time series datasets of various areas to comparethe two proposed algorithms to random projection The experimental results show that bothtwo proposed algorithms outperform random projection algorithm in terms of time efficiencywhile bring out the same accuracy

The rest of the paper is organized as follows In Sect.2, we review related works and basicconcepts on time series motifs Section3introduces the motif discovery algorithm which isbased on R∗-tree and early abandoning technique Section4describes the motif discoveryalgorithm which makes use of the MP_C dimensionality reduction method and Skyline index.Section5presents our experimental evaluation on real datasets In Sect.5, we include someconclusions and remarks on future works

Trang 3

2 Background

2.1 Basic concepts

There have been some different definitions of time series motifs For example, one could

choose the nearest neighbor motif definition [18] which defines the motif of a time series

database as the unordered pair of time series in the database which is the most similar amongall possible pairs However, this motif definition does not take into account the frequency ofthe subsequences Therefore, it is not convenient to use this definition in practical applications

of motifs

In this work, we use the popular and basic definition of time series motifs formalized in[14] In this subsection, we give the definitions of the terms formally

Definition 1 A time series is a real value sequence of length n over time, i.e., if T is a time

series then T = (t1 , , t n ) where t i is a real number

Time series can be very long In data mining, subsections of the time series, which arecalled subsequences, are considered So the definition of a subsequence is needed

Definition 2 Given a time series T = (t1 , , t n ), a subsequence of length m of T is a sequence S = (t i , , t i +m−1 ) with 1 ≤ i ≤ n − m + 1.

In discovering motifs, we need to determine whether a given subsequence is similar toothers This match is defined as follows

Definition 3 Given a threshold R, a positive real number, and a time series T A subsequence

C i of T beginning at position i and a subsequence C j of T beginning at position j , if Distance(C i , C j ) ≤ R then C j is called a matching subsequence of C i

Obviously, the best matches to a subsequence C can be the subsequences that begin just one or two points to the left or the right of C These are called trivial matches The definition

of trivial matches is given as follows

Definition 4 Given a time series T , a subsequence C i of T beginning at position i and a matching subsequence C j of T beginning at position j , C j is called trivial match to C i

if or i = j or there does not exist a subsequence C k beginning at position k such that Distance(C i , C k ) > R and either i < k < j or j < k < i.

The kth most significant motifs in a time series can be defined as follows.

Definition 5 Given a time series T , a subsequence of length n and a threshold R, the most

significant motif in T (called 1-motif) is the subsequence C1that has the highest count of

non-trivial matches The kth most significant in T (call k-motif) is the subsequence C khas the

highest count of non-trivial matches and satisfies Distance (C i , C k ) > 2R, for all 1 ≤ i < k.

Note that in Definition5, we force the set of subsequences in each motif must be mutuallyexclusive It is important because otherwise the two motifs can share the same objects The

set of subsequences in each motif is called the instances of that motif.

Lin et al [14] also introduced the brute-force algorithm to find 1-motif (see Fig 1).This brute-force algorithm works directly on raw time series and requires two user-defined

parameters: threshold R and the length of subsequences n In the brute-force algorithm, we

can see that the basic operation in the inner loop is finding the non-trivial matches for asubsequence in question

Trang 4

Fig 1 The outline of brute-force

algorithm for 1-motif discovery

in time series

Algorithm Find-1-Motif-Brute-Force(T, n, R)

best_motif_count_so_far = 0 best_motif_location_so_far = null;

for i = 1 to length(T) – n + 1

{ count = 0; pointers = null;

for j = 1 to length(T) – n + 1

if Non_Trivial_Match (C[i: i + n – 1], C[j: j + n – 1], R ) {

count = count + 1;

pointers = append (pointers, j);

}

if count > best_motif_count_so_far {

best_motif_count_so_far = count;

best_motif_location_so_far = i;

motif_matches = pointers;

} }

series, is the subsequence that has most non-trivial subsequence matches

Chiu et al [5] proposed random projection algorithm for discovering time series motifs.This work is based on research for pattern discovery from the bioinformatics community[2] The random projection algorithm uses SAX discretization method [15] to represent timeseries subsequences and a collision matrix For each iteration, the algorithm randomly selectssome positions in each SAX representation to act as a mask and traverses the SAX repre-

sentation list If two SAX representations corresponding to subsequences i , j are matched, cell (i , j) in the collision matrix is incremented After the process is repeated an appropriate

number of times, the largest entries in the collision matrix are selected as candidate motifs

At last, the original data corresponding to each candidate motif is checked to verify theresult The complexity of this algorithm is linear in terms of the SAX word length, number ofsubsequences, number of iterations, and number of collisions This algorithm can be used tofind all the motifs with high probability after an appropriate number of iterations even in thepresence of noise However, its complexity becomes quadratic if the distribution of the pro-jections is not wide enough, i.e., if there are a large number of subsequences having the sameprojection

Ferreira et al [6] proposed another approach for discovering approximation motifs fromtime series First, this algorithm transforms subsequences from time series of proteins intoSAX representation, then finds clusters of subsequences and expands the length of eachretrieved motif until the similarity drops below a user-defined threshold It can be used todiscover motifs in multivariate time series or motifs of different sizes Its complexity isquadratic, and the whole dataset must be loaded into main memory

Trang 5

Yankov et al [29] introduced an algorithm to deal with uniform scaling time series Thisapproach uses improved random projection to discover motifs under uniform scaling The

concept of time series motif is redefined in terms of nearest neighbor: The subsequence

motif is a pair of subsequences of a long time series that are nearest to each other The onlyparameter that needs to be defined by the user is the motif length (besides SAX’s parameters).This approach has the same drawbacks as the random projection algorithm and its overheadincreases because of the need to find the best scaling factors

Tanaka and Uehara [25] proposed motif discovery (MD) algorithm the algorithm that canfind motifs from multidimensional time series data First, the MD algorithm transforms multi-ple dimensional time series data into 1-dimensional data by using PCA (Principal ComponentAnalysis) for reducing dimensions of the data Then, it transforms the data into a sequence ofsymbols Finally, it discovers the motif by calculating a description length of a pattern based

on the minimum description length (MDL) principle That means the suitable length of themotif is determined automatically by MD algorithm The MD algorithm is useful and effectivebased on the assumption that the lengths of all the instances of the motif are identically same.However, in real world, the lengths of all instances of a motif are a little bit different fromeach other To overcome this limitation, in 2005, Tanaka et al proposed the extended variant

of MD, called EMD (Extended Motif Discovery) algorithm that includes the two followingmodifications First, EMD transforms the symbol sequence that represents a behavior of agiven time series data to a form in which motif instances of different lengths can be extracted.Second, it uses a new definition of a description length of a time series to process not only motifinstances of the same length but motif instances of different lengths Since in EMD algorithm,the lengths of each instances of a motif can be a bit different from each other, Tanaka et al sug-gested that dynamic time warping (DTW) distance should be used to calculate the distancesbetween the motif instances in this case Due to this suggestion, EMD becomes a complicatedalgorithm with high computational complexity and not easy to implement in practice.The first clustering-based method for time series motif discovery is the one proposed byGruber et al [9] This method employs the concept of significant extreme points that wasproposed by Pratt and Fink [20] The algorithm proposed by Gruber et al for finding timeseries motifs consists of three steps: Extracting significant extreme points, determining motifcandidates from the extracted significant extreme points and clustering the motif candidates.After the clustering step, the cluster with largest number of instances is the 1-motif of thetime series When Gruber et al proposed this method, they applied it in signature verificationand did not compare it to any previous time series motif discovery algorithm

Based on random projection algorithm, Tang and Liao [27] introduced a method that candiscover time series motifs with different lengths The main idea of this method is that first, ituses random projection to discover motifs with short lengths, and then it applies a technique

to concatenate these motifs into longer motifs

Under the new nearest neighbor motif definition, Mueen et al [18] proposed a tractableexact motif discovery algorithm, called MK algorithm, which can work directly on originaltime series This MK algorithm is an improvement of the brute-force algorithm by usingsome techniques to speedup the algorithm It is based on the idea of early abandoning theEuclidean distance calculation when the current cumulative sum is greater than the best-so-far The motif search is guided by heuristic information from the linear ordering of thedistance of an object with respect to a few random reference points Mueen et al showedthat while this exact algorithm is still quadratic in the worst case, it can be up to three orders

of magnitude faster than the brute-force algorithm However, the nearest neighbor definitionadopted by MK is not convenient to be used in practice and the use of Euclidean distancedirectly in the raw data can incur some robustness problems when dealing with noisy data

Trang 6

From previous algorithms for time series motif discovery, we can identify some typicalapproaches for tackling this problem: (i) The approach that is based on locality-preservinghashing, such as [6,27,29]; (ii) the MDL-based approach that can automatically determinethe optimal length for 1-motif, such as MD [25], EMD [26]; (iii) the approach that is based

on segmentation and clustering, such as [9], and (iv) the approach that is based on brute-forcemethod with some speedup techniques, such as MK algorithm [18]

3 Discovering time series motifs based on R ∗ -tree and early abandoning

In this section, we present our first novel algorithm for time series motif discovery The basicintuition behind this algorithm is that a multidimensional index, such as R∗-tree [1] can help

in efficiently retrieving nearest neighbors of a subsequence and the idea of early abandoningintroduced in [18] can be used for reducing the complexity of Euclidean distance calculation

In a multidimensional index structure, such as R∗-tree, each node is associated with aminimum bounding rectangle (MBR) Ifv is an internal node, all the MBRs of its immediate

child node’s entries will be covered by its MBR The MBRs in the nodes of the same levelmight overlap Ifv is a leaf node then its MBR is the minimum bounding rectangle of all the

entries contained inv For each entry in the leaf node, it contains its MBR and a pointer to

the data object represented by this entry

In the proposed algorithm for motif discovery, we create a minimum bounding rectangle

in the m-dimensional space (m n) for each subsequence extracted from a longer time

series through a sliding window Then, each subsequence is inserted into R∗-tree based on

its MBR To find matching neighbors of a subsequence s by searching the R∗-tree, we need

a distance function D region (s, R) between the subsequence s to the MBR R associated with

a node in the index structure such that D region (s, R) ≤ D(s, C), ∀ C, any subsequence C which is contained in the MBR R.

Before introducing the definition of D region (s, R), we will describe how to define the

minimum bounding rectangle for a group of time series in our proposed motif discoveryalgorithm

Notice that a time series of length n can be viewed as a point in n-dimensional space.

Assume that we have built an index structure for a time series database by inserting the

group of l time series objects of length n , C = {c1, c2, , c l} into the MBR-basedmultidimensional index structure And assume that we approximate each time series of

length n by m equal-sized constant value segments (m n) Let U be a leaf node in the index structure and R = R1 , R2, , R m be the MBR associated with U, where

R j = {L j , H j } = {(x jmin , y jmin ), (x jmax , y jmax )} R jis the minimum bounding rectangle (in

the time-value space) containing the j th segments of all the time series data indexed under the node U and L j , H j are the leftmost lower corner and rightmost upper corner, respec-

tively, of R j The MBR associated with a non-leaf node would be the smallest rectangle thatcontains all the MBRs of its immediate child node [1] Here, we can view each MBR as two

sequences which are lower-bound sequence L = {L1 , , L m } and upper-bound sequence

H = {H1 , , H m } of all time series stored at the node U.

In order to calculate the distance between a time series s and the bounding region

R, D region (s, R), we accumulate the distances from all data points in the sequence s to R by computing the distances, d (s ji , R j ), from each data point s ji in the segment j (1 ≤ j ≤ m)

of time series s to the corresponding j th bounding rectangle, R j , of the MBR R and the distance d (s ji , R j ) depends on the fact that s ij is above, in or under R j

Trang 7

Fig 2 An example of how to calculate D r egi on (s, R)

Definition 6 (Group distance function) Given a subsequence s of length n, a group C of

subsequences of length n and a corresponding MBR R for C in the m-dimensional space (m n), i.e., R = R1 , R2, , R m , where R j = {(x jmin , y jmin ), (x jmax , y jmax )} is a pair of endpoints which are the lower and higher endpoints of the major diagonal of R j

The distance function D region (s, R) of the subsequence s from the MBR R is defined as

N is the length of segment j (N = n/m).

Figure2illustrates an example of how to calculate D region (s, R) In this example, s is a sequence consisting of 9 data points, s = {s1 , , s9} = {s11, s12, s13, s21, s22, s23, s31, s32,

sub-s33}, and each segment consists of three data points So R is a sequence of three rectangles,

R = R1 , R2, R3 Therefore, we have:

D region (s, R) = D region1(s1, R1) + D region2(s2, R2) + D region3(s3, R3)

= (s11− y1 max )2+ (s21 − y2 min )2+ (s32 − y3 min )2

Other remaining values are equal to zero since they are inside the region R.

To ensure the correctness of using Dregion (s, R) in searching k-nearest neighbors of a query based on a multidimensional index, this group distance must satisfy the group lower-bound property as follows.

Trang 8

Proof According to the definition of the MBR associated with a node U in the index structure and the definition of the distance function Dregion(s, R), for any subsequence C placed under

a node U and the MBR R associated with U , we have

y jmin ≤ c j i ≤ y jmax , ∀i = 1, , N, ∀ j = 1, , m

D region (s, R) ≤ D(s, C), ∀C in the MBR R

Formula (1) to compute the distance function Dregion (s, R) of the subsequence s from the MBR R can be applied in k-nearest neighbors search or range search for a given time series s

with the support of R∗-tree This distance function is crucial for pruning of subtrees withoutloss of completeness which are dissimilar and for ranking of potentially relevant nodes in

k-nearest neighbor search (or for discarding nodes exceeding the range threhold of range

search)

3.1 Early abandoning technique

Since the complexity of computing Euclidean distance between two time series of length n is

O (n), we need to reduce this complexity In motif discovery, we have to compute Euclidean

distance whenever we need to find nearest neighbors of a given time series Therefore, we canapply the idea of early abandoning The idea of early abandoning is performed as follows:When the Euclidean distance is calculated for a pair of time series, if the cumulative sum isgreater than the current best-so-far distance at a certain point, we can abandon the calculationsince this pair of time series are not matches with other

3.2 The proposed algorithm

Figure 3 presents the algorithm for finding k-motifs defined in Definition 5 with thesupport of R∗-tree and the idea of early abandoning In the algorithm, procedure

NEAREST_NEIGHBORS_R(s i , R∗− tree, R) is used to find non-trivial matches of sequence s i within threshold R based on the index structure R∗− tree Procedure NEAR- EST_NEIGHBORS_R makes use of the concept D region (s, R), the group distance between a subsequence s and an MBR R in the R∗-tree, given by Definition6and satisfying Lemma1.

sub-The procedure NEAREST_NEIGHBORS_R returns the list X which keeps the positions of all non-trivial nearest neighbors of the subsequence s i found based on the group distance

When the list X is obtained, each subsequence s x corresponding to the element x in X will

be accessed and the algorithm calls the function DIS_EARLY_ABAN (s i , s x , R) to compute the Euclidean distance between the two subsequences s i , s x

Trang 9

Algorithm Discovering top k- motifs with the support of R*-tree and the idea of early

// S is a time series of length n, s i is a subsequence of length m in S

// L is a list of k-motifs, C k is the center of k-motifs

//X is the index list of non-trivial nearest neighbors of a subsequence s i

// R is a threshold for matching

Procedure L = FINDING_TOP_k_MOTIF(S, k, m, R)

fori = 1 to n-m+1

{

if (R * -tree != null) X = NEAREST_NEIGHBORS_R(s i, R * -tree, R)

forj = 1 to length(X) // length(X) : the number of items in list X

if (the number of elements in L < k)

number of items in each element

else if (length(X) > number of items in L k )

{

order of the number of items in each element

Fig 3 The algorithm for discovering top k-motifs with the support of R∗-tree

Notice that the function DIS_EARLY _ABAN applies the idea of early abandoning If DIS_EARLY_ABAN (s i , s x , R) is greater than R then x is removed from the list X since s x

is not qualified to be a match with s i If the list X satisfies all the conditions given in the

Definition5, X will be inserted into the list of top k-motifs in such a way that all the elements

in this list must be in decreasing order of the number of entries in each elements of the list.The process is repeated until no more subsequence needs to be examined

Figure 4 describes the two auxiliary procedures in our proposed algorithm: EST_NEIGHBORS_R(s i , R∗−tree, R) and ADD (MBR i , R∗−tree) In the procedure NEAR- EST_NEIGHBORS_R, the trivial matches are rejected by using the relative positions of the

NEAR-subsequences Two subsequences are the non-trivial matches of each other if there is a gap

of at leastw positions between the two subsequences.

Figure 5 describes the function DIS_EARLY_ABAN (x, y, BestSoFar) In the function DIS_EARLY_ABAN, we can see the idea of early abandoning.

To reduce the computational complexity, we can enhance the above algorithm by covering motifs in the time series which have been transformed by some dimensionalityreduction methods such as piecewise aggregate approximation (PAA), discrete Fourier trans-form (DFT), and discrete wavelet transform (DWT)

Trang 10

dis-// Find the non-trivial nearest neighbors of subsequence s i within threshold R using R*-tree

NEAREST_NEIGHBORS_R(s i , R * -tree, R)

ADD(MBR j , R * -tree) // insert the subsequence j into the R*-tree using MBRj

Insert the new entry into the suitable leaf node of the subtree

If the leaf node is overflow

- Split this node into two nodes such that the sum area of the two MBRs of the two split nodes is smallest

- The process of node splitting might be propagated upwards if the parent node is also overflow due to the splitting

Fig 4 Auxiliary procedures for the algorithm that discovers top k-motifs with the support of R∗-tree

// The function for computing Euclidean distance

DIS_EARLY_ABAN(x, y, BestSoFar)

sum = 0; Bsf = BestSoFar * BestSoFar

for (i = 0; i < x.length and sum Bsf; i++)

sum = sum + (x i - y i ) * (x i - y i)

return square_root(sum)

Fig 5 The function for computing Euclidean distance with early abandoning

One limitation of the above algorithm for discovering k-motifs based on R∗-tree and earlyabandoning is that R∗-tree can work well if the number of dimensions is below 20 When thedimensionality becomes higher than 20, R∗-tree degenerates and gives a performances poorerthan that of the case without using the index structure Due to this limitation, we devise another

algorithm for discovering k-motifs which is based on a dimensionality reduction method and

a more efficient multidimensional index, Skyline index [16]

4 Discovering time series motifs based on MP_C method and Skyline index

The core idea of this algorithm for discovering time series k-motifs is using MP_C sionality reduction method and state-of-the-art Skyline index in k-nearest neighbors search

dimen-or range search We select Skyline index since this paradigm fdimen-or indexing time series dataperforms better than traditional multidimensional index structures, especially for time seriesdata with high dimensionality Experimental studies in [16] reveal that Skyline index based

on skyline-bounding-regions results in more efficient index than R∗-tree based on MBRs.

Skyline index adopts Skyline-bounding regions (SBRs) to approximate and represent a group of time series according to their collective shape An SBR is defined in the same time- value space where time series data are defined SBRs allow us to define a distance function

that tightly lower bounds the distance between a query and a group of time series data SBRs

Trang 11

are free of internal overlaps Hence, using the same amount of space in an index node, SBRdefines a better bounding region.

4.1 MP_C representation

The MP_C dimensionality reduction method used in this work was proposed in our previouswork [23] The MP_C (Middle Points and Clipping) is carried out as follows: Given a time

series C of length n C can be seen as a segment and is divided into sub-segments Some

middle points in each sub-segment are chosen To reduce space consumption, the chosenpoints are transformed into a sequence of bits, where 1 represents above the segment averageand 0 represents below, i.e., ifµ is the mean of segment C and c tis one of chosen points,then

c t= 1 if c t > µ

0 otherwiseThe mean of the segment and the bit sequence are recorded as segment features For thesimplicity and the ability of recording the approximate shape of the sequence, in our method,

we use the following simple algorithm:

– Dividing each segment into sub-segments

– Choosing the middle point of each sub-segment

Figure6shows the intuition behind this technique when the number of sub-segments is

6 and the number of middle points selected in each sub-segment is one In this case, thesequence of bit 010111 and theµ value are recorded.

This brings out a clipped representation of middle points which is called MP_C Hence,

it has all the advantages of the bit level representation proposed by [21], while it still allowsthe user to have a choice of compression ratio through determining the number of middlepoints chosen to retain the approximate shape of original time series

We need to define the distance function D M P_C (Q, C ) of the query Q from the MP_C representation C of a time series C such that it satisfies the lower-bound condition

D M P_C (Q, C ) ≤ D(Q, C).

Definition 7 (MP_C Similarity Measure) Given a query Q and a time series C (of length n)

in raw data Both C and Q are divided into N segments (N n) Suppose each segment

has the length ofw Let C be the MP_C representation of C The distance measure between

Q and C in MP_C space, D M P_C (Q, C ), is computed as follows.

Trang 12

D1(Q, C ) and D2(Q, C ) are defined as

where q µ i is the mean value of the i th segment in Q , cµ i is the mean value of the i th segment

in C , bc i is binary representation of c i d (q i , bc i ) is computed by the following formula:

d (q i , bc i ) =

q i if (q i > 0 and bc i = 0) or (q i ≤ 0 and bc i = 1)

0 otherwise

q i is defined as q i = q i − qµ k , where q i belongs to the kth segment in Q.

The proof of D MP_C (Q, C) conforming to the lower bounding condition (that means

D MP_C (Q, C) ≤ D true (Q, C)) is given in our previous work [23] The lower-boundingcondition, an important result given by [7] aims to guarantee that a dimensionality reductionmethod for time series brings out no false dimissals In other words, we can guarantee thecorrectness of a time series dimensionality reduction method if it satisfies the lower-boundingcondition Our MP_C dimensionality reduction method not only satisfies the lower-boundingcondition, but also is an indexable method as shown in the next section

4.2 Skyline index for MP_C

In this subsection, we describe how we can adopt Skyline index for time series compressed byMP_C method First, we introduce the concept of the MP_C Bounding Region (MP_C_BR).Then, we describe the lower-bounding distance function for MP_C_BRs and the use ofMP_C_BRs for indexing and searching time series data

4.2.1 MP_C bounding region

In traditional multidimensional index structure such as R∗-tree [1], minimum bounding tangles (MBRs) are used to group time series data which are mapped into points in a lowdimensional feature space If an MBR is defined in the two-dimensional space in which atime series exists, the overlap between MBRs will be large Overlapping rectangles couldhave negative effect on the search performance So by using the ideas from Skyline index[16], we can represent more accurately the collective shape of a group of time series data with

rec-tighter bounding regions To attain this aim, we use MP_C bounding regions (MP_C_BRs)

for bounding a group of time series data

Definition 8 (MP_C Bounding Region) Given a group C consisting of k MP_C sequences

in N -dimensional feature space The MP_C_BR R of C, is defined as a two-dimensionalregion surrounded by the top and bottom skylines:

R=C max , C min

where

C max=c 1max , c 2max , , c Nmax

C min=c 1min , c 2min , , c Nmin

Trang 13

c’11c’21

c’32c’42c’12

c’22

c’31c’41

Fig 7 An illustration of MP_C_BR a Two time series C1, C2 and their approximate MP_C representations

in four dimensional space b The MP_C_BR of two MP_C sequences C1and C2 C max = {c11, c21, c32, c42}

and C mi n = {c12, c22, c31, c41}

and, for 1≤ i ≤ N,

c imax = maxc i 1 , , c i k

c imin = minc i 1 , , c i kwhere c i j is the i th mean value of the j th MP_C sequence in the group C

Figure7illustrates an example of MP_C_BR In this example, BC i is a bit sequence of

time series C i and the number of middle points selected in each sub-segment is one.Based on the MP_C_BRs, we can build a Skyline index by simply inserting the MP_Csequence into a R∗-tree-like structure.

Once the Skyline index for MP_C has been built, we have to define the group distance

function D region (Q, R) of the query Q from the MP_C_BR R associated with a node in the index structure such that it satisfies the group lower-bound condition D region (Q, R) ≤ D(Q, C) for any time series C in the MP_C_BR R.

Definition 9 (MP_C_BR Distance Function) Let Q be MP_C representation of query Q

in N -dimensional space, the distance function D region (Q , R) of the query Q from the

w is the length of each segment and qµ i is the mean value of the i th segment in the query Q.

c imin (c imax ) is the minimum (maximum) value of the ith segment of the group C of MP_C

sequences in the MP_C_BR R.

The proof of D region (Q , R) conforming to the group lower-bound condition (Lemma1)

is given in our previous work [23]

We can index the MP_C representation of time series data by first building a Skylineindex which is based on a spatial index structure such as R∗-tree [1] Each leaf node in the

R∗-tree contains a MP_C sequence and a pointer referring to an original time series data inthe database The MP_C_BR associated with a non-leaf node is the smallest bounding regionthat spatially contains the MP_C_BRs associated with its immediate child nodes

Trang 14

4.2.2 Subsequence matching algorithm

The algorithm we use for subsequence matching process using MP_C method and Skylineindex consists of three main steps: index building, index searching, and post-processing

For simplicity, we assume that the query sequence Q has the same length w of the sliding window The inputs of the algorithm are time series C, query sequence Q and the threshold

R The output is the set of all the subsequences in C of which are in R-match with Q The

algorithm is outlined as follows:

S1 [Index Building] Use a sliding window of size w to divide the time series C into

subse-quences of lengthw from C and apply MP_C transformation on each such subsequence.

Store the features transformed from all such subsequences in Skyline index

S2 [Index searching] Apply MP_C transformation on query sequence Q Search the index

to find the candidate set of the subsequences on C of which are in R-match with Q S3 [Post-processing] Examine the original subsequences of the time series C which corre-

spond to the candidate set obtained at step 2 to discard the false alarms

4.2.3 Node insertion algorithm

The algorithm which we use for inserting an MP_C sequence to Skyline index is similar tothe insert algorithm introduced in [8] It includes four main steps

S1 [Find a position for inserting a MP_C sequence] Descent the tree from the root node

to find the best leaf node L for inserting the new entry.

S2 [Add the MP_C sequence to the leaf node] If L has enough space for another entry, insert the sequence Otherwise, split the node L.

S3 [Propagate changes upward] Ascend from the leaf node L to the root node Adjust

MP_C_BRs and propagate node splits if necessary

S4 [Grow the tree taller] If the root of the tree is split because of propagation, create a new

root whose children are the two resulting nodes

At each level of the tree, the process of finding a position for a new entry selects the nodewhose MP_C_BR needs the least enlargement to include this entry If the new entry has avalue which is outside the limits defined by the segment in MP_C_BR, the value of thatsegment is updated so that the MP_C_BR can entirely contain the new entry If a node needs

to be split, its entries are redistributed as in Guttman’s algorithm [8]

4.3 The proposed algorithm

Figure8presents our algorithm for finding approximate k-motifs with the support of Skyline

index In this motif discovery algorithm, first, subsequences are extracted from a longertime series through a sliding window and they are transformed into lower dimensionality by

applying MP_C method Then for each MP_C representation s i of the subsequence s i, the

algorithm finds all its non-trivial matches within a range R among the subsequences that had

been inserted into the Skyline index

In this algorithm, procedure NEAREST_NEIGHBORS_SKYLINE (s i , Skyline index, R)

is invoked to search the non-trivial matches of the MP_C subsequence s i within range R.

As for a non-leaf node, procedure NEAREST_NEIGHBORS_SKYLINE uses the group

dis-tance function D r egi on (s , R) between an MP_C subsequence s and a Skyline-bounding

region MP_C_BR R in the index structure, defined by Definition 8 and satisfying the

We can index the MP_C representation of time series data by first building a Skylineindex which is based on a spatial index structure such as R∗-tree

Định dạng
Số trang	28
Dung lượng	1,95 MB