In this paper, wepropose two novel algorithms for discovering motifs in time series data: The first algorithm is based on R∗-tree and early abandoning technique and the second algorithm
Trang 1DOI 10.1007/s10115-014-0814-3
R E G U L A R PA P E R
Discovery of time series k-motifs based
on multidimensional index
Nguyen Thanh Son · Duong Tuan Anh
Received: 21 January 2014 / Revised: 16 October 2014 / Accepted: 25 December 2014
© Springer-Verlag London 2015
Abstract Time series motifs are frequently occurring but previously unknown subsequences
of a longer time series Discovering time series motifs is a crucial task in time series datamining In time series motif discovery algorithm, finding nearest neighbors of a subse-quence is the basic operation To make this basic operation efficient, we can make use
of some advanced multidimensional index structure for time series data In this paper, wepropose two novel algorithms for discovering motifs in time series data: The first algorithm
is based on R∗-tree and early abandoning technique and the second algorithm makes use
of a dimensionality reduction method and state-of-the-art Skyline index We demonstratethat the effectiveness of our proposed algorithms by experimenting on real datasets fromdifferent areas The experimental results reveal that our two proposed algorithms outperformthe most popular method, random projection, in time efficiency while bring out the sameaccuracy
Keywords Time series· k-Motifs · Motif discovery · Multidimensional index ·
R-tree· Skyline index
1 Introduction
Many researchers have been studying the extraction of various characteristics from timeseries data One of these challenges, efficient discovery of ‘motifs’ has received much atten-tion Time series motifs are frequently occurring but previously unknown subsequences of alonger time series which are very similar to each other This motif concept is generalized to
N T Son
Faculty of Information Technology, Ho Chi Minh University
of Technical Education, Ho Chi Minh City, Vietnam
D T Anh (B)
Faculty of Computer Science and Engineering, Ho Chi Minh City
University of Technology, Ho Chi Minh City, Vietnam
e-mail: dtanh@cse.hcmut.edu.vn
Trang 2k-motifs problem, where the top k-motifs are returned Since its first formalization by Lin
et al [14], discovering motifs has been used to solve problems in several application areas[3,6,9,10,17,19,22,28] and also used as a preprocessing step in several higher level datamining tasks such as time series clustering, time series classification, rule discovery, andsummarization
Among a dozen algorithms for finding motifs that have been proposed in the literature,most of them are algorithms which work on time series transformed by some dimension-ality reduction method or discretization method The most popular algorithm for findingtime series motifs is random projection algorithm proposed by Chiu et al [5] This algo-rithm can find motifs in linear time and is robust to noise However, it still has somedrawbacks: First, if the distribution of the projections is not sufficiently wide, it becomesquadratic in time and space, and second, random projection is based on locality-preservinghashing that is effective for a relative small number of projected dimensions (10–20) [4].Besides random projection, in 2003 and 2005, Tanaka et al proposed two algorithms, MDand EMD that can apply minimum description length principle to determine the optimallength for time series motif during the process of motif discovery Mueen et al [18] pro-posed a tractable exact motif discovery algorithm, called MK algorithm, which can workdirectly on original time series This MK algorithm is an improvement of the brute-forcealgorithm which is an exhaustive search algorithm by using some techniques to speedupthe algorithm Mueen et al showed that while this exact algorithm is still quadratic inthe worst case, it can be up to three orders of magnitude faster than the brute-forcealgorithm We can notice that both the two popular approaches, random projection [5]and MK [18], and some other approaches for finding time series motifs (e.g., [6,9,27])
do not employ the support of any index structure and their computational costs are stillhigh
In time series motif discovery algorithm, finding nearest neighbors of a subsequence is thebasic operation To make this basic operation efficient, we can make use of some advancedindex structure for time series data In our work, we introduce two novel algorithms for
discovering approximate k-motifs in a long time series: The first is based on R∗-tree andearly abandoning technique, and the second makes use of MP_C dimensionality reductionmethod [24] and state-of-the-art Skyline index [16] Both our approaches employ multidi-mensional index structure to speedup the search for nearest neighbors of a subsequence Ourproposed algorithms are disk efficient because they only require a single sequential disk scan
to read the entire time series Besides, these methods can work directly on numerical timeseries data transformed by some dimensionality reduction method but without applying anydiscretization process
We carried out several experiments on time series datasets of various areas to comparethe two proposed algorithms to random projection The experimental results show that bothtwo proposed algorithms outperform random projection algorithm in terms of time efficiencywhile bring out the same accuracy
The rest of the paper is organized as follows In Sect.2, we review related works and basicconcepts on time series motifs Section3introduces the motif discovery algorithm which isbased on R∗-tree and early abandoning technique Section4describes the motif discoveryalgorithm which makes use of the MP_C dimensionality reduction method and Skyline index.Section5presents our experimental evaluation on real datasets In Sect.5, we include someconclusions and remarks on future works
Trang 32 Background
2.1 Basic concepts
There have been some different definitions of time series motifs For example, one could
choose the nearest neighbor motif definition [18] which defines the motif of a time series
database as the unordered pair of time series in the database which is the most similar amongall possible pairs However, this motif definition does not take into account the frequency ofthe subsequences Therefore, it is not convenient to use this definition in practical applications
of motifs
In this work, we use the popular and basic definition of time series motifs formalized in[14] In this subsection, we give the definitions of the terms formally
Definition 1 A time series is a real value sequence of length n over time, i.e., if T is a time
series then T = (t1 , , t n ) where t i is a real number
Time series can be very long In data mining, subsections of the time series, which arecalled subsequences, are considered So the definition of a subsequence is needed
Definition 2 Given a time series T = (t1 , , t n ), a subsequence of length m of T is a sequence S = (t i , , t i +m−1 ) with 1 ≤ i ≤ n − m + 1.
In discovering motifs, we need to determine whether a given subsequence is similar toothers This match is defined as follows
Definition 3 Given a threshold R, a positive real number, and a time series T A subsequence
C i of T beginning at position i and a subsequence C j of T beginning at position j , if Distance(C i , C j ) ≤ R then C j is called a matching subsequence of C i
Obviously, the best matches to a subsequence C can be the subsequences that begin just one or two points to the left or the right of C These are called trivial matches The definition
of trivial matches is given as follows
Definition 4 Given a time series T , a subsequence C i of T beginning at position i and a matching subsequence C j of T beginning at position j , C j is called trivial match to C i
if or i = j or there does not exist a subsequence C k beginning at position k such that Distance(C i , C k ) > R and either i < k < j or j < k < i.
The kth most significant motifs in a time series can be defined as follows.
Definition 5 Given a time series T , a subsequence of length n and a threshold R, the most
significant motif in T (called 1-motif) is the subsequence C1that has the highest count of
non-trivial matches The kth most significant in T (call k-motif) is the subsequence C khas the
highest count of non-trivial matches and satisfies Distance (C i , C k ) > 2R, for all 1 ≤ i < k.
Note that in Definition5, we force the set of subsequences in each motif must be mutuallyexclusive It is important because otherwise the two motifs can share the same objects The
set of subsequences in each motif is called the instances of that motif.
Lin et al [14] also introduced the brute-force algorithm to find 1-motif (see Fig 1).This brute-force algorithm works directly on raw time series and requires two user-defined
parameters: threshold R and the length of subsequences n In the brute-force algorithm, we
can see that the basic operation in the inner loop is finding the non-trivial matches for asubsequence in question
Trang 4Fig 1 The outline of brute-force
algorithm for 1-motif discovery
in time series
Algorithm Find-1-Motif-Brute-Force(T, n, R)
best_motif_count_so_far = 0 best_motif_location_so_far = null;
for i = 1 to length(T) – n + 1
{ count = 0; pointers = null;
for j = 1 to length(T) – n + 1
if Non_Trivial_Match (C[i: i + n – 1], C[j: j + n – 1], R ) {
count = count + 1;
pointers = append (pointers, j);
}
if count > best_motif_count_so_far {
best_motif_count_so_far = count;
best_motif_location_so_far = i;
motif_matches = pointers;
} }
series, is the subsequence that has most non-trivial subsequence matches
Chiu et al [5] proposed random projection algorithm for discovering time series motifs.This work is based on research for pattern discovery from the bioinformatics community[2] The random projection algorithm uses SAX discretization method [15] to represent timeseries subsequences and a collision matrix For each iteration, the algorithm randomly selectssome positions in each SAX representation to act as a mask and traverses the SAX repre-
sentation list If two SAX representations corresponding to subsequences i , j are matched, cell (i , j) in the collision matrix is incremented After the process is repeated an appropriate
number of times, the largest entries in the collision matrix are selected as candidate motifs
At last, the original data corresponding to each candidate motif is checked to verify theresult The complexity of this algorithm is linear in terms of the SAX word length, number ofsubsequences, number of iterations, and number of collisions This algorithm can be used tofind all the motifs with high probability after an appropriate number of iterations even in thepresence of noise However, its complexity becomes quadratic if the distribution of the pro-jections is not wide enough, i.e., if there are a large number of subsequences having the sameprojection
Ferreira et al [6] proposed another approach for discovering approximation motifs fromtime series First, this algorithm transforms subsequences from time series of proteins intoSAX representation, then finds clusters of subsequences and expands the length of eachretrieved motif until the similarity drops below a user-defined threshold It can be used todiscover motifs in multivariate time series or motifs of different sizes Its complexity isquadratic, and the whole dataset must be loaded into main memory
Trang 5Yankov et al [29] introduced an algorithm to deal with uniform scaling time series Thisapproach uses improved random projection to discover motifs under uniform scaling The
concept of time series motif is redefined in terms of nearest neighbor: The subsequence
motif is a pair of subsequences of a long time series that are nearest to each other The onlyparameter that needs to be defined by the user is the motif length (besides SAX’s parameters).This approach has the same drawbacks as the random projection algorithm and its overheadincreases because of the need to find the best scaling factors
Tanaka and Uehara [25] proposed motif discovery (MD) algorithm the algorithm that canfind motifs from multidimensional time series data First, the MD algorithm transforms multi-ple dimensional time series data into 1-dimensional data by using PCA (Principal ComponentAnalysis) for reducing dimensions of the data Then, it transforms the data into a sequence ofsymbols Finally, it discovers the motif by calculating a description length of a pattern based
on the minimum description length (MDL) principle That means the suitable length of themotif is determined automatically by MD algorithm The MD algorithm is useful and effectivebased on the assumption that the lengths of all the instances of the motif are identically same.However, in real world, the lengths of all instances of a motif are a little bit different fromeach other To overcome this limitation, in 2005, Tanaka et al proposed the extended variant
of MD, called EMD (Extended Motif Discovery) algorithm that includes the two followingmodifications First, EMD transforms the symbol sequence that represents a behavior of agiven time series data to a form in which motif instances of different lengths can be extracted.Second, it uses a new definition of a description length of a time series to process not only motifinstances of the same length but motif instances of different lengths Since in EMD algorithm,the lengths of each instances of a motif can be a bit different from each other, Tanaka et al sug-gested that dynamic time warping (DTW) distance should be used to calculate the distancesbetween the motif instances in this case Due to this suggestion, EMD becomes a complicatedalgorithm with high computational complexity and not easy to implement in practice.The first clustering-based method for time series motif discovery is the one proposed byGruber et al [9] This method employs the concept of significant extreme points that wasproposed by Pratt and Fink [20] The algorithm proposed by Gruber et al for finding timeseries motifs consists of three steps: Extracting significant extreme points, determining motifcandidates from the extracted significant extreme points and clustering the motif candidates.After the clustering step, the cluster with largest number of instances is the 1-motif of thetime series When Gruber et al proposed this method, they applied it in signature verificationand did not compare it to any previous time series motif discovery algorithm
Based on random projection algorithm, Tang and Liao [27] introduced a method that candiscover time series motifs with different lengths The main idea of this method is that first, ituses random projection to discover motifs with short lengths, and then it applies a technique
to concatenate these motifs into longer motifs
Under the new nearest neighbor motif definition, Mueen et al [18] proposed a tractableexact motif discovery algorithm, called MK algorithm, which can work directly on originaltime series This MK algorithm is an improvement of the brute-force algorithm by usingsome techniques to speedup the algorithm It is based on the idea of early abandoning theEuclidean distance calculation when the current cumulative sum is greater than the best-so-far The motif search is guided by heuristic information from the linear ordering of thedistance of an object with respect to a few random reference points Mueen et al showedthat while this exact algorithm is still quadratic in the worst case, it can be up to three orders
of magnitude faster than the brute-force algorithm However, the nearest neighbor definitionadopted by MK is not convenient to be used in practice and the use of Euclidean distancedirectly in the raw data can incur some robustness problems when dealing with noisy data
Trang 6From previous algorithms for time series motif discovery, we can identify some typicalapproaches for tackling this problem: (i) The approach that is based on locality-preservinghashing, such as [6,27,29]; (ii) the MDL-based approach that can automatically determinethe optimal length for 1-motif, such as MD [25], EMD [26]; (iii) the approach that is based
on segmentation and clustering, such as [9], and (iv) the approach that is based on brute-forcemethod with some speedup techniques, such as MK algorithm [18]
3 Discovering time series motifs based on R ∗ -tree and early abandoning
In this section, we present our first novel algorithm for time series motif discovery The basicintuition behind this algorithm is that a multidimensional index, such as R∗-tree [1] can help
in efficiently retrieving nearest neighbors of a subsequence and the idea of early abandoningintroduced in [18] can be used for reducing the complexity of Euclidean distance calculation
In a multidimensional index structure, such as R∗-tree, each node is associated with aminimum bounding rectangle (MBR) Ifv is an internal node, all the MBRs of its immediate
child node’s entries will be covered by its MBR The MBRs in the nodes of the same levelmight overlap Ifv is a leaf node then its MBR is the minimum bounding rectangle of all the
entries contained inv For each entry in the leaf node, it contains its MBR and a pointer to
the data object represented by this entry
In the proposed algorithm for motif discovery, we create a minimum bounding rectangle
in the m-dimensional space (m n) for each subsequence extracted from a longer time
series through a sliding window Then, each subsequence is inserted into R∗-tree based on
its MBR To find matching neighbors of a subsequence s by searching the R∗-tree, we need
a distance function D region (s, R) between the subsequence s to the MBR R associated with
a node in the index structure such that D region (s, R) ≤ D(s, C), ∀ C, any subsequence C which is contained in the MBR R.
Before introducing the definition of D region (s, R), we will describe how to define the
minimum bounding rectangle for a group of time series in our proposed motif discoveryalgorithm
Notice that a time series of length n can be viewed as a point in n-dimensional space.
Assume that we have built an index structure for a time series database by inserting the
group of l time series objects of length n , C = {c1, c2, , c l} into the MBR-basedmultidimensional index structure And assume that we approximate each time series of
length n by m equal-sized constant value segments (m n) Let U be a leaf node in the index structure and R = R1 , R2, , R m be the MBR associated with U, where
R j = {L j , H j } = {(x jmin , y jmin ), (x jmax , y jmax )} R jis the minimum bounding rectangle (in
the time-value space) containing the j th segments of all the time series data indexed under the node U and L j , H j are the leftmost lower corner and rightmost upper corner, respec-
tively, of R j The MBR associated with a non-leaf node would be the smallest rectangle thatcontains all the MBRs of its immediate child node [1] Here, we can view each MBR as two
sequences which are lower-bound sequence L = {L1 , , L m } and upper-bound sequence
H = {H1 , , H m } of all time series stored at the node U.
In order to calculate the distance between a time series s and the bounding region
R, D region (s, R), we accumulate the distances from all data points in the sequence s to R by computing the distances, d (s ji , R j ), from each data point s ji in the segment j (1 ≤ j ≤ m)
of time series s to the corresponding j th bounding rectangle, R j , of the MBR R and the distance d (s ji , R j ) depends on the fact that s ij is above, in or under R j
Trang 7Fig 2 An example of how to calculate D r egi on (s, R)
Definition 6 (Group distance function) Given a subsequence s of length n, a group C of
subsequences of length n and a corresponding MBR R for C in the m-dimensional space (m n), i.e., R = R1 , R2, , R m , where R j = {(x jmin , y jmin ), (x jmax , y jmax )} is a pair of endpoints which are the lower and higher endpoints of the major diagonal of R j
The distance function D region (s, R) of the subsequence s from the MBR R is defined as
N is the length of segment j (N = n/m).
Figure2illustrates an example of how to calculate D region (s, R) In this example, s is a sequence consisting of 9 data points, s = {s1 , , s9} = {s11, s12, s13, s21, s22, s23, s31, s32,
sub-s33}, and each segment consists of three data points So R is a sequence of three rectangles,
R = R1 , R2, R3 Therefore, we have:
D region (s, R) = D region1(s1, R1) + D region2(s2, R2) + D region3(s3, R3)
= (s11− y1 max )2+ (s21 − y2 min )2+ (s32 − y3 min )2
Other remaining values are equal to zero since they are inside the region R.
To ensure the correctness of using Dregion (s, R) in searching k-nearest neighbors of a query based on a multidimensional index, this group distance must satisfy the group lower-bound property as follows.
Trang 8Proof According to the definition of the MBR associated with a node U in the index structure and the definition of the distance function Dregion(s, R), for any subsequence C placed under
a node U and the MBR R associated with U , we have
y jmin ≤ c j i ≤ y jmax , ∀i = 1, , N, ∀ j = 1, , m
D region (s, R) ≤ D(s, C), ∀C in the MBR R
Formula (1) to compute the distance function Dregion (s, R) of the subsequence s from the MBR R can be applied in k-nearest neighbors search or range search for a given time series s
with the support of R∗-tree This distance function is crucial for pruning of subtrees withoutloss of completeness which are dissimilar and for ranking of potentially relevant nodes in
k-nearest neighbor search (or for discarding nodes exceeding the range threhold of range
search)
3.1 Early abandoning technique
Since the complexity of computing Euclidean distance between two time series of length n is
O (n), we need to reduce this complexity In motif discovery, we have to compute Euclidean
distance whenever we need to find nearest neighbors of a given time series Therefore, we canapply the idea of early abandoning The idea of early abandoning is performed as follows:When the Euclidean distance is calculated for a pair of time series, if the cumulative sum isgreater than the current best-so-far distance at a certain point, we can abandon the calculationsince this pair of time series are not matches with other
3.2 The proposed algorithm
Figure 3 presents the algorithm for finding k-motifs defined in Definition 5 with thesupport of R∗-tree and the idea of early abandoning In the algorithm, procedure
NEAREST_NEIGHBORS_R(s i , R∗− tree, R) is used to find non-trivial matches of sequence s i within threshold R based on the index structure R∗− tree Procedure NEAR- EST_NEIGHBORS_R makes use of the concept D region (s, R), the group distance between a subsequence s and an MBR R in the R∗-tree, given by Definition6and satisfying Lemma1.
sub-The procedure NEAREST_NEIGHBORS_R returns the list X which keeps the positions of all non-trivial nearest neighbors of the subsequence s i found based on the group distance
When the list X is obtained, each subsequence s x corresponding to the element x in X will
be accessed and the algorithm calls the function DIS_EARLY_ABAN (s i , s x , R) to compute the Euclidean distance between the two subsequences s i , s x
Trang 9Algorithm Discovering top k- motifs with the support of R*-tree and the idea of early
// S is a time series of length n, s i is a subsequence of length m in S
// L is a list of k-motifs, C k is the center of k-motifs
//X is the index list of non-trivial nearest neighbors of a subsequence s i
// R is a threshold for matching
Procedure L = FINDING_TOP_k_MOTIF(S, k, m, R)
fori = 1 to n-m+1
{
if (R * -tree != null) X = NEAREST_NEIGHBORS_R(s i, R * -tree, R)
forj = 1 to length(X) // length(X) : the number of items in list X
if (the number of elements in L < k)
number of items in each element
else if (length(X) > number of items in L k )
{
order of the number of items in each element
Fig 3 The algorithm for discovering top k-motifs with the support of R∗-tree
Notice that the function DIS_EARLY _ABAN applies the idea of early abandoning If DIS_EARLY_ABAN (s i , s x , R) is greater than R then x is removed from the list X since s x
is not qualified to be a match with s i If the list X satisfies all the conditions given in the
Definition5, X will be inserted into the list of top k-motifs in such a way that all the elements
in this list must be in decreasing order of the number of entries in each elements of the list.The process is repeated until no more subsequence needs to be examined
Figure 4 describes the two auxiliary procedures in our proposed algorithm: EST_NEIGHBORS_R(s i , R∗−tree, R) and ADD (MBR i , R∗−tree) In the procedure NEAR- EST_NEIGHBORS_R, the trivial matches are rejected by using the relative positions of the
NEAR-subsequences Two subsequences are the non-trivial matches of each other if there is a gap
of at leastw positions between the two subsequences.
Figure 5 describes the function DIS_EARLY_ABAN (x, y, BestSoFar) In the function DIS_EARLY_ABAN, we can see the idea of early abandoning.
To reduce the computational complexity, we can enhance the above algorithm by covering motifs in the time series which have been transformed by some dimensionalityreduction methods such as piecewise aggregate approximation (PAA), discrete Fourier trans-form (DFT), and discrete wavelet transform (DWT)
Trang 10dis-// Find the non-trivial nearest neighbors of subsequence s i within threshold R using R*-tree
NEAREST_NEIGHBORS_R(s i , R * -tree, R)
ADD(MBR j , R * -tree) // insert the subsequence j into the R*-tree using MBRj
Insert the new entry into the suitable leaf node of the subtree
If the leaf node is overflow
- Split this node into two nodes such that the sum area of the two MBRs of the two split nodes is smallest
- The process of node splitting might be propagated upwards if the parent node is also overflow due to the splitting
Fig 4 Auxiliary procedures for the algorithm that discovers top k-motifs with the support of R∗-tree
// The function for computing Euclidean distance
DIS_EARLY_ABAN(x, y, BestSoFar)
sum = 0; Bsf = BestSoFar * BestSoFar
for (i = 0; i < x.length and sum Bsf; i++)
sum = sum + (x i - y i ) * (x i - y i)
return square_root(sum)
Fig 5 The function for computing Euclidean distance with early abandoning
One limitation of the above algorithm for discovering k-motifs based on R∗-tree and earlyabandoning is that R∗-tree can work well if the number of dimensions is below 20 When thedimensionality becomes higher than 20, R∗-tree degenerates and gives a performances poorerthan that of the case without using the index structure Due to this limitation, we devise another
algorithm for discovering k-motifs which is based on a dimensionality reduction method and
a more efficient multidimensional index, Skyline index [16]
4 Discovering time series motifs based on MP_C method and Skyline index
The core idea of this algorithm for discovering time series k-motifs is using MP_C sionality reduction method and state-of-the-art Skyline index in k-nearest neighbors search
dimen-or range search We select Skyline index since this paradigm fdimen-or indexing time series dataperforms better than traditional multidimensional index structures, especially for time seriesdata with high dimensionality Experimental studies in [16] reveal that Skyline index based
on skyline-bounding-regions results in more efficient index than R∗-tree based on MBRs.
Skyline index adopts Skyline-bounding regions (SBRs) to approximate and represent a group of time series according to their collective shape An SBR is defined in the same time- value space where time series data are defined SBRs allow us to define a distance function
that tightly lower bounds the distance between a query and a group of time series data SBRs
Trang 11are free of internal overlaps Hence, using the same amount of space in an index node, SBRdefines a better bounding region.
4.1 MP_C representation
The MP_C dimensionality reduction method used in this work was proposed in our previouswork [23] The MP_C (Middle Points and Clipping) is carried out as follows: Given a time
series C of length n C can be seen as a segment and is divided into sub-segments Some
middle points in each sub-segment are chosen To reduce space consumption, the chosenpoints are transformed into a sequence of bits, where 1 represents above the segment averageand 0 represents below, i.e., ifµ is the mean of segment C and c tis one of chosen points,then
c t= 1 if c t > µ
0 otherwiseThe mean of the segment and the bit sequence are recorded as segment features For thesimplicity and the ability of recording the approximate shape of the sequence, in our method,
we use the following simple algorithm:
– Dividing each segment into sub-segments
– Choosing the middle point of each sub-segment
Figure6shows the intuition behind this technique when the number of sub-segments is
6 and the number of middle points selected in each sub-segment is one In this case, thesequence of bit 010111 and theµ value are recorded.
This brings out a clipped representation of middle points which is called MP_C Hence,
it has all the advantages of the bit level representation proposed by [21], while it still allowsthe user to have a choice of compression ratio through determining the number of middlepoints chosen to retain the approximate shape of original time series
We need to define the distance function D M P_C (Q, C ) of the query Q from the MP_C representation C of a time series C such that it satisfies the lower-bound condition
D M P_C (Q, C ) ≤ D(Q, C).
Definition 7 (MP_C Similarity Measure) Given a query Q and a time series C (of length n)
in raw data Both C and Q are divided into N segments (N n) Suppose each segment
has the length ofw Let C be the MP_C representation of C The distance measure between
Q and C in MP_C space, D M P_C (Q, C ), is computed as follows.
Trang 12D1(Q, C ) and D2(Q, C ) are defined as
where q µ i is the mean value of the i th segment in Q , cµ i is the mean value of the i th segment
in C , bc i is binary representation of c i d (q i , bc i ) is computed by the following formula:
d (q i , bc i ) =
q i if (q i > 0 and bc i = 0) or (q i ≤ 0 and bc i = 1)
0 otherwise
q i is defined as q i = q i − qµ k , where q i belongs to the kth segment in Q.
The proof of D MP_C (Q, C) conforming to the lower bounding condition (that means
D MP_C (Q, C) ≤ D true (Q, C)) is given in our previous work [23] The lower-boundingcondition, an important result given by [7] aims to guarantee that a dimensionality reductionmethod for time series brings out no false dimissals In other words, we can guarantee thecorrectness of a time series dimensionality reduction method if it satisfies the lower-boundingcondition Our MP_C dimensionality reduction method not only satisfies the lower-boundingcondition, but also is an indexable method as shown in the next section
4.2 Skyline index for MP_C
In this subsection, we describe how we can adopt Skyline index for time series compressed byMP_C method First, we introduce the concept of the MP_C Bounding Region (MP_C_BR).Then, we describe the lower-bounding distance function for MP_C_BRs and the use ofMP_C_BRs for indexing and searching time series data
4.2.1 MP_C bounding region
In traditional multidimensional index structure such as R∗-tree [1], minimum bounding tangles (MBRs) are used to group time series data which are mapped into points in a lowdimensional feature space If an MBR is defined in the two-dimensional space in which atime series exists, the overlap between MBRs will be large Overlapping rectangles couldhave negative effect on the search performance So by using the ideas from Skyline index[16], we can represent more accurately the collective shape of a group of time series data with
rec-tighter bounding regions To attain this aim, we use MP_C bounding regions (MP_C_BRs)
for bounding a group of time series data
Definition 8 (MP_C Bounding Region) Given a group C consisting of k MP_C sequences
in N -dimensional feature space The MP_C_BR R of C, is defined as a two-dimensionalregion surrounded by the top and bottom skylines:
R=C max , C min
where
C max=c 1max , c 2max , , c Nmax
C min=c 1min , c 2min , , c Nmin
Trang 13c’11c’21
c’32c’42c’12
c’22
c’31c’41
Fig 7 An illustration of MP_C_BR a Two time series C1, C2 and their approximate MP_C representations
in four dimensional space b The MP_C_BR of two MP_C sequences C1and C2 C max = {c11, c21, c32, c42}
and C mi n = {c12, c22, c31, c41}
and, for 1≤ i ≤ N,
c imax = maxc i 1 , , c i k
c imin = minc i 1 , , c i kwhere c i j is the i th mean value of the j th MP_C sequence in the group C
Figure7illustrates an example of MP_C_BR In this example, BC i is a bit sequence of
time series C i and the number of middle points selected in each sub-segment is one.Based on the MP_C_BRs, we can build a Skyline index by simply inserting the MP_Csequence into a R∗-tree-like structure.
Once the Skyline index for MP_C has been built, we have to define the group distance
function D region (Q, R) of the query Q from the MP_C_BR R associated with a node in the index structure such that it satisfies the group lower-bound condition D region (Q, R) ≤ D(Q, C) for any time series C in the MP_C_BR R.
Definition 9 (MP_C_BR Distance Function) Let Q be MP_C representation of query Q
in N -dimensional space, the distance function D region (Q , R) of the query Q from the
w is the length of each segment and qµ i is the mean value of the i th segment in the query Q.
c imin (c imax ) is the minimum (maximum) value of the ith segment of the group C of MP_C
sequences in the MP_C_BR R.
The proof of D region (Q , R) conforming to the group lower-bound condition (Lemma1)
is given in our previous work [23]
We can index the MP_C representation of time series data by first building a Skylineindex which is based on a spatial index structure such as R∗-tree [1] Each leaf node in the
R∗-tree contains a MP_C sequence and a pointer referring to an original time series data inthe database The MP_C_BR associated with a non-leaf node is the smallest bounding regionthat spatially contains the MP_C_BRs associated with its immediate child nodes
Trang 144.2.2 Subsequence matching algorithm
The algorithm we use for subsequence matching process using MP_C method and Skylineindex consists of three main steps: index building, index searching, and post-processing
For simplicity, we assume that the query sequence Q has the same length w of the sliding window The inputs of the algorithm are time series C, query sequence Q and the threshold
R The output is the set of all the subsequences in C of which are in R-match with Q The
algorithm is outlined as follows:
S1 [Index Building] Use a sliding window of size w to divide the time series C into
subse-quences of lengthw from C and apply MP_C transformation on each such subsequence.
Store the features transformed from all such subsequences in Skyline index
S2 [Index searching] Apply MP_C transformation on query sequence Q Search the index
to find the candidate set of the subsequences on C of which are in R-match with Q S3 [Post-processing] Examine the original subsequences of the time series C which corre-
spond to the candidate set obtained at step 2 to discard the false alarms
4.2.3 Node insertion algorithm
The algorithm which we use for inserting an MP_C sequence to Skyline index is similar tothe insert algorithm introduced in [8] It includes four main steps
S1 [Find a position for inserting a MP_C sequence] Descent the tree from the root node
to find the best leaf node L for inserting the new entry.
S2 [Add the MP_C sequence to the leaf node] If L has enough space for another entry, insert the sequence Otherwise, split the node L.
S3 [Propagate changes upward] Ascend from the leaf node L to the root node Adjust
MP_C_BRs and propagate node splits if necessary
S4 [Grow the tree taller] If the root of the tree is split because of propagation, create a new
root whose children are the two resulting nodes
At each level of the tree, the process of finding a position for a new entry selects the nodewhose MP_C_BR needs the least enlargement to include this entry If the new entry has avalue which is outside the limits defined by the segment in MP_C_BR, the value of thatsegment is updated so that the MP_C_BR can entirely contain the new entry If a node needs
to be split, its entries are redistributed as in Guttman’s algorithm [8]
4.3 The proposed algorithm
Figure8presents our algorithm for finding approximate k-motifs with the support of Skyline
index In this motif discovery algorithm, first, subsequences are extracted from a longertime series through a sliding window and they are transformed into lower dimensionality by
applying MP_C method Then for each MP_C representation s i of the subsequence s i, the
algorithm finds all its non-trivial matches within a range R among the subsequences that had
been inserted into the Skyline index
In this algorithm, procedure NEAREST_NEIGHBORS_SKYLINE (s i , Skyline index, R)
is invoked to search the non-trivial matches of the MP_C subsequence s i within range R.
As for a non-leaf node, procedure NEAREST_NEIGHBORS_SKYLINE uses the group
dis-tance function D r egi on (s , R) between an MP_C subsequence s and a Skyline-bounding
region MP_C_BR R in the index structure, defined by Definition 8 and satisfying the
... thecorrectness of a time series dimensionality reduction method if it satisfies the lower-boundingcondition Our MP_C dimensionality reduction method not only satisfies the lower-boundingcondition, but... fdimen-or indexing time series dataperforms better than traditional multidimensional index structures, especially for time seriesdata with high dimensionality Experimental studies in [16] reveal that... in our previous work [23]We can index the MP_C representation of time series data by first building a Skylineindex which is based on a spatial index structure such as R∗-tree