Specifically, the mined patterns are incorporated with syn-thetic constraints, namely spatiotemporal sequence length restriction, minimum andmaximum timing gap between events, time windo
Trang 1DOI 10.1007/s00607-013-0333-1
A technique for extracting behavioral sequence patterns from GPS recorded data
Thi Hong Nhan Vu · Yang Koo Lee · The Duy Bui
Received: 29 December 2012 / Accepted: 14 May 2013
© Springer-Verlag Wien 2013
Abstract The mobile wireless market has been attracting many customers
Techni-cally, the paradigm of anytime-anywhere connectivity raises previously unthinkablechallenges, including the management of million of mobile customers, their profiles,the profiles-based selective information dissemination, and server-side computinginfrastructure design issues to support such a large pool of users automatically andintelligently In this paper, we propose a data mining technique for discovering frequentbehavioral patterns from a collection of trajectories gathered by Global PositioningSystem Although the search space for spatiotemporal knowledge is extremely chal-lenging, imposing spatial and temporal constraints on spatiotemporal sequences makesthe computation feasible Specifically, the mined patterns are incorporated with syn-thetic constraints, namely spatiotemporal sequence length restriction, minimum andmaximum timing gap between events, time window of occurrence of the whole pattern,inclusion or exclusion event constraints, and frequent movement patterns predictive
of one ore more classes The algorithm for mining all frequent constrained patterns
is named cAllMOP Moreover, to control the density of pattern regions a clusteringalgorithm is exploited The proposed method is efficient and scalable Its efficiency
is better than that of the previous algorithms AllMOP and GSP with respect to thecompactness of discovered knowledge, execution time, and memory requirement
T H N Vu (B) · T D Bui
Human Machine Interaction Laboratory, Vietnam National University,
144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
e-mail: vthnhan@gmail.com
Y K Lee
Robot/Cognitive System Research Department, Electronics and Telecommunication
Research Institute, Daejeon, Republic of Korea
Trang 2Keywords Behavioral sequence patterns· Location-based services · Trajectorymining
Mathematics Subject Classification 68
1 Introduction
The availability and increasingly high accuracy of Global Positioning System (GPS)receivers attached to vehicles in today’s transportation technology allows recordingall the trajectories that are the traces of moving users as well as their portable devices.These trajectories contain detailed information about personal and vehicular mobilebehaviors and therefore reveal interesting practical opportunities to find behavioralpatterns to be used, for example, in traffic and sustainable mobility management and
to study the accessibility of services
The development of continuous minimization of electronics technologies in play devices and in wireless communications as well as improved performance ofgeneral computing technologies enable the deployment of mobile Location-based Ser-vices(LBSs) These services integrate data derived from the users’ requests with otheruser information in a multidimensional database [5,8] Accumulated data are usedfor later modification of the services and for long-term decision making LBSs, such
dis-as information systems (e.g., shopping, tourist, traffic information system supportingqueries pertaining to physical user location) or server-side selective information dis-semination based approach (e.g., targeted advertisement based on user profiles andlocation information) are emerging application scenarios with far-reaching implica-tion To aid decision making and customization, data mining techniques can be applied
to discover interesting knowledge about the behaviors of users For example, classes
of users which exhibit similar behaviors can be identified These classes can be acterized by various attributes of the class members or the services they requested.Sequences of service requests made by users can also be analyzed to discover regu-larities in such sequences These regularities can later be applied to make intelligentpredictions about users’ future behavior given the requests the user made in the past[20]
char-In this paper, we focus on a discussion of techniques for discovering frequent ment patterns from a spatiotemporal database We present a new algorithm called cAll-MOP for discovering all frequent movement patterns with the following constraints:(1) length restriction; (2) minimum timing gap between events; (3) maximum timinggap; (4) a time window of occurrence of the whole pattern; (5) inclusion and exclusionevent constraints; (6) patterns predictive of one or more classes To fulfill the task offinding frequent unconstraint patterns, trajectories are frequently modeled as discretemoving points However, knowledge of movements is limited by the ability of thedevices used to measure them Complete knowledge of a movement is impossible, butmovement can be detected, stored, modeled, and analyzed with some degree of accu-racy Aiming at the purpose of reducing the error in the observed locations, trajectoriesare reconstructed by re-sampling their positions and are then generalized In the miningprocess, the utilization of syntactic constraints and spatiotemporal proximity feature
move-of the application domain makes the computation become feasible Moreover, because
Trang 3users moves in a thematically partitioned space, we take into account the concept ofgraph and the transitive property of similarity measure of paths in graph during theprocess of candidate generation, which helps avoid unnecessary candidate pattern pro-duction In addition, to control the density of the regions in patterns and automaticallyadjust the shape and size of the regions we employ a grid-based clustering technique.The performance of cAllMOP is better than that of the algorithms AllMOP in [18]and GSP in [11] in terms of the compactness of the discovered knowledge, executiontime, and memory requirement It therefore can be applied well to LBS systems.
2 Related work
Sequential patterns mining is informally descried as the discovery of inter-transactionpatterns in large customer databases [2,6,11,14,21] A sequence is a set of tem-porally ordered itemsets Since the set of frequent sequences is a superset of fre-quent itemsets, sequential pattern mining algorithms often utilize some of the ideasproposed for the discovery of association rules [1,19] One can divide approachesfor finding frequent itemsets based on two criteria: by their strategy to traversethe search space and by their strategy to determine the support values of item-sets Based on the first criteria today’s common approaches are either breadth-firstsearch or depth-first search A comparison of these approaches revealed that all ofthe methods have some types of data for which they performed better than the oth-ers [19] This data mining task, except for transaction time, is in a sense dimen-sionless Nevertheless, most of data describing events in the real world associatewith space and time Thus far, work on spatiotemporal data mining has mainlyfocused on the models and structures for indexing spatiotemporal objects [4,7,10]rather than discovering movement patterns Spatiotemporal pattern mining has beentreated as a generalization of pattern mining in time-series data [3,5,13,16,17].The algorithm offered in [9] discovers spatiotemporal periodic patterns from tra-jectories of equal length, which are then exploited in an index structure to supportthe execution of spatiotemporal queries We are concerned with trajectories of ran-dom length and the problem of imprecise sampled points Besides, another methodDFS_MINE in [12] was proposed to discover spatiotemporal sequential patternsfor weather prediction It seeks the relationships between time-varying attributesfor fixed location, but does not show how to apply to movement pattern mining
in which one needs to seek the relationships between time-varying locations ofobjects with stable attributes This problem was treated by our algorithm maxMOP in[18]
An unconstraint search can produce millions of rules or may not even be tractable
in some cases Discovery of sequences incorporating constraints has already receivedsome attention in categorical domains [11,15], but to the best of our knowledge thisproblem has not been addressed in continuous domain, especially for spatiotemporaldata The algorithm GSP in [11] was the first one to consider minimum and maximumgaps as well as time window GSP is an iterative algorithm, which counts candidate
sequences of length k in the kth database scan GSP requires as many full data scans
as the longest frequent sequence
Trang 43 Problem definition
Definition 1 (Trajectory) Trajectory of a moving object with identifier ojis defined
as a finite sequence of points {(p1, vt1), (p2, vt2), , (pn, vtn)} in the X × Y × V T
space, where point piis represented by coordinates(xi, yi) at the sampled time vtifor
The spatial organization of the map M is represented as a set of regions The region
is related to a specific thematic interpretation of space So, M is represented as a finite set of regions {a1,…,an} such that∪n
i=1ai = M with ai ∩ aj = φ and i = j The
moving possibility of an object from region to region is represented by a directed
graph After decomposing M, we get a hierarchical structure as introduced in [3].However, in this study we assume that a region of the lower level is ‘fully contained’
in a region of the higher level
Let T be the maximal timestamp among timestamps of the trajectories in the moving object database DB Let ojidenote the position of the moving object oj, for 1≤ j ≤
sequence of points oj1o2j .oj
Kfor 1≤ K ≤ T
Definition 2 (Spatiotemporal sequence) Given a minimal temporal intervalτ a
spa-tiotemporal sequence is a list of temporally ordered region labels S = (a1 , t1), (a2,
this length is determined by the function length (S).
A location at time t is called an event A sequence composed of k events is denoted
as k-sequence For example (R1 , t1), (R2, t2), (R2, t3) is 3-sequence.
Definition 3 (Subsequence) For a sequence S1, if region a1 occurs before a2, we denote it as a1 < a2 We say S1is a subsequence of another one S2if there exists a
one-to-one order preserving function f that maps regions in S1 to regions in S2such
that for every ai ∈ S1: (1) ai ∩ f (ai ) = φ, (2) if ai < ajthen f (ai) < f (aj), and (3)
tai +1− tai = tf (ai+1) − tf (ai).
Definition 4 (Frequent movement pattern) A trajectory is said to comply with moving
sequence S if for each region ai ∈ S at vti, the point oj
iof the trajectory is in aiat the
same time The support support (S) of the sequence S can be defined as the number
of trajectories in DB complying with it If support (S) ≥ min_sup where min_sup is a
user-specified minimum support threshold, then S is called a frequent pattern.
To control the density of a pattern region the density based partitioning method is
exploited Each region ai of pattern S is dense if the set of positions Ai = {oj
i|oj
i∈ ai}
forms a dense cluster According to the definition of [16], a dense cluster is defined with
two parameters r and MinPts points We apply a modified version of the partitioning
method in the consideration of a multi-level spatiotemporal grid Progressing from
Trang 5Fig 1 Spatiotemporal unit
M
γr
finer to coarser one can find locally dense cells, which later can be combined togetherwith dense nearby grid cells to form clusters The size of cell at the lowest level isdecided based on the imprecision degree of the moving points, which will be presented
below In our case, MinPts is equal to the value mi n_sup∗N o So, if all regions in S
are dense, then S is frequent.
Problem definition: Given a database DB of trajectories along with the
maxi-mum speed vmax of the moving objects, the sampling rate t, a reference map
M ⊆ R2 decomposed into regions accompanying with a directed graph graph, minimum support min_sup, the problem formulation is (1) to discover all frequent
movement patterns from the database and (2) to discover patterns with syntacticconstraints
4 Process of discovering behavioral movement patterns
4.1 Movement summarization
To make the representation of a trajectory more precise we need to re-sample movingpoints The sampling error across time was proved to be an ellipse [15] given theobject’s maximal speedvmaxand two consecutive moving points The error ellipse isused as measure for the size of the sampling error per line segment In the worst case,the error is a circle and this is the case we deal with here To make the operation moreflexible and simpler we operate on it minimum bounding rectangle (MBR), which isalso the cell the map explained below
For a grid threshold r and without time, the reference map M is decomposed into
nx×nyarray of equal sized cells When including time, M is decomposed into uniform
spatiotemporal units (see Fig.1) The choice of cell size r will affect the accuracy of
the obtained result In fact, the object’s maximum velocityvmaxand the chosen samplingρ influence this choice Re-sampling rate and cell size must be selected sothat a trajectory produces at least one hit in each cell that it visits As a rule of thumb,
re-the parameters r and ρ must be selected such that (vmax / ρ) (r√2) Additionally,
temporal extentγ is a priori determined and may change depending on the application
As a rule of thumb, it should be chose such that 1 ρ γ, as ρ γ is a measure for hitnumber expectation per cell [14]
Trang 6The reference map M having it origin, a point with coordinate (x0, y0), is
repre-sented as a regular grid and stored in an array D[1 : nx , 1 : ny] Each element D[i, j]corresponds to one cell Dij that is also a page in which the moving points are assigned.For a movement, we eliminate all consecutive points falling in the same cells andkeep only the first point with its corresponding timestamp Assume that after projecting
all the moving points in the database DB into cells (pages), we obtain the result
presented in Fig.2a We find out that there are two cells D20 and D12 containingmore than one point, so we remove the second point in them, (25,7) and (16,29),respectively Finally, the preprocessed database is represented in Fig.2b
4.2 Data set transformation
Physically, the data structure of each cell in the spatiotemporal sequence is constructed
in the form of (Dij, oj, vti) in which Dij contains a pointer pointing to the page D[i, j]
where the position of object ojat timevtiis stored In case, the lifespan of all trajectoriesbelong to the same weekdays we omit the date when representing timestamps Figure3
is an example of transforming time series of locations into spatiotemporal sequenceswith the minimum temporal intervalγ = 30 Ultimately, the database of trajectories
is converted into a set of spatiotemporal sequences, each associated with a distinct
identifier oj
4.3 Strategy for mining all frequent patterns with syntactic constraints
The considered constraints include minimum gap, maximum gap, and time window
of validity of the pattern, classes of frequent and confident rules
We directly prune candidates that violate syntactic constraints while finding quent patterns The task is accomplished by extending the algorithm AllMOP, the
fre-method here is named cAllMOP It takes as its input the set MS of
spatiotempo-ral sequences The candidate generating mechanism of our technique is based onbreadth-first search strategy used by GSP with an additional temporal join operationand a technique for pruning candidates Moreover, due to the complexity of data typehere, a clustering method to control the dense regions of the patterns is exploited.The concept of directed graph also helps avoid the creation of redundant candidatepatterns
The algorithm makes multiple passes, producing longer patterns on the base ofshorter ones, until no more patterns can be created Firstly, we explain how to find outfrequent 1-patterns from which longer ones will be generated
Different from the concept of items defined in GSP, not only the labels of patternregions, but their shapes and sizes play an important role in the process of frequentpattern discovery The shape and size of a region changes from pattern to pattern, theytherefore need to be automatically adjusted at each pass This is the reason why theprior techniques cannot be directly applied to our problem The issue is dealt with
in the following way First, the set of generalized trajectories are decomposed into
groups of moving points, each Ai = {oji|oj
i is position inside ai,} for one timestamp
Trang 8Fig 3 Example spatiotemporal
(D31, 9:00), (D21, 9:30)>
They are obtained by clustering points in the groups Ai Specifically, to find them, for
each timestampvti, we scan the set MS to determine the frequency of each cell and just
keep frequent ones Next, the consecutive dense cells belonging to the same region ai
are merged into large regions, which might be merged continuously to form clusters.The points lying in the spare cells are assigned to the found clusters by applying a rangequery with diameter(r/√2) The points belonging to no cluster are called outliers and
are eliminated from the cells as soon as they are found The empty cells are discarded
at the same time as well Frequent 1-patterns are maintained in the set F1.
Figure 4a depicts a set of trajectories in a 2D space after passing through the
trajectory generalization operation Assume that the maximal timestamp T is 5 and min_sup count is 2 which will be used in the illustration throughout the algorithm.
In this example, the reference map M consists of six regions denoted by Rj (0≤j≤5) The numbers that are marked in each region Rj index the cells belonging tothat region A cell is denoted by the combination of a number and the region’s index(e.g., the cell indexed by the number 1 in region R1 is denoted as R11, the merge
of R11and R12denoted as R112) Because we consider only the spatial relationship
“full containment”, a cell is contained in just one region Rj Therefore, each region
Rj can be represented by a set of distinct cells (e.g., R1 is composed of three cellsR11, R12, R13).
Moving points are first projected into cells which have pointers pointing to pages
trajectories at this point can be gotten by accessing the pages in which they are cally stored From Fig.4a, we can see that the starting point of object 1 at time t1lieslogically in the cell R13and is practically stored in the element D[2,2] And its next
physi-position at time t2falls in the cell R01and is stored in D[3,2].
The dense regions are then found Figure5a shows the groups of moving pointsobtained after partitioning the trajectories in Fig.4a Different groups are denoted bydifferent shapes of points Consider the example of finding dense regions at time t4
We found three dense cells R31, R32, R33referring to three different pages, namelyD[3,1], D[4,4], and D[5,1] Since all of them are neighboring cells and belong to thesame spatial region R3, they are merged to create a cluster R3 However the cluster R3still points to three pages corresponding to the original cells creating it That means,regions are logically combined while physical pages are preserved Contrary to R3, attime t4only one point falling into the cell R41of the region R4 This point cannot also
be assigned to any existing clusters, so it is an outlier and disposed The same operation
is performed for the other groups of points, all frequent 1-pattern F1is finally obtainedand displayed in Fig.5b
Trang 10Pages
outlier R1 23 R0 1
R0 3 t1
t4
t’3 t'4
(a) The result F1 (b) Clustering points to find dense regions
Fig 5 Example of frequent 1-patterns
In the following, we first see how the algorithm AllMOP works without any constraints.This algorithm forms the basis for cAllMOP
For k = 2, candidate patterns are created by making a temporal join on F1 withitself Let(ai, ti) and (aj, tj) be two 1-pattern in F1, a candidate pattern, for example
(ai , ti), (aj, tj), is created if tj> tiand the regions ai and ajare neighbors determined
by the map and a graph
Then, for k >2 candidate patterns are generated in this way: give a set Fk −1 of
(k −1)-patterns, the candidates for the next pass are enumerated by making a temporal
join on Fk−2with itself A pattern s1 = (a1 , t1), (a2, t2),…, (ak −1, tk −1) joins with
To facilitate candidate generation, MBRs of the patterns regions are exploited If all
intersections of those pairs are empty, the created candidate pattern will be in the form
of cand = (a1 , t1), (c2, t2), , (ck −1, tk −1), (bk, t
criteria Ri Oid = Rj.Oid(or Cand = Ri Ri.Oid=Rj.OidandRi.vti=Rj.vtjRj) That
means, the points articulated must comply with time order and belong to the same
trajectory The support value is then the number of objects ojthat comply with thecandidate patterns
It is noticeable that the regions of the candidate pattern might not dense anymoreafter such a join operation This problem can be handled by performing a validation If
the support value is at least min_sup, then the regions of a k-candidate pattern (k ≥ 2)
must be re-clustered to validate if they are dense The points at some candidate patternwill possibly be grouped to more than one cluster If so, a new candidate pattern will
be created for each cluster Ultimately, it is likely to get more than one actual frequent
Trang 11patterns from a candidate patterns cand During the re-clustering process, some points
may be removed because of being outliers with respect to new clusters
The final aspect plays an important role in the discovery of frequent pattern is how
to prune candidates We exploit the closure property in [1], that is, all subpatterns of
a frequent pattern are also frequent
In addition, to tackle this problem we exploit a list of minimal infrequent patterns
in memory, called MinInfreqList This list is initialized with all infrequent 2-patterns.
Each time a new candidate pattern is generated, we check if it is a superpattern of
any pattern in MinInfreqList If yes, it will be discarded right away without carrying out the temporal join An infrequent pattern cand is inserted into this list when all of the following conditions hold: currently cand does not exist in MinInfreqList; cand is not a superpattern of any patterns in this list; cand generated after the temporal join
operation was found to be infrequent After insertion, we remove all superpatterns of
cand from the list The structure is kept as a link list and sorted in increasing order of
the pattern length
Now we consider an example of generating candidate k-patterns With the concept
of graph and time order compliance, the total number of candidate 2-patterns generated
by AllMOP is 18 illustrated by Fig.6
Next, we perform a temporal join on regions of every candidate pattern in C2 First,let consider the candidate pattern(R123 , t1), (R01, t2) shown in Fig.7a After join-ing the points of R123 with those of R01, the size of cluster R01is still maintainedbut the size of cluster R123is shrunken to R13 Second, consider the candidate pat-tern (R23 , t3), (R3, t4) depicted in Fig.7b After joining the points of R23 withthose of R3, the region R23 is still the same but the remaining points of R3 aregrouped into two clusters R31 and R33 Therefore, two new patterns are created,namely (R23 , t3), (R31, t4) and (R23, t3), (R32, t4) With the same operation
made for the rest of C2, all frequent 2-patterns F2are achieved
In this example, the process repeats the join operation until frequent 5-patterns areall found
Algorithm cAllMOP includes an added constraint check in the pattern discoveryprocess Because time difference between events is mainly concerned when checkingtiming constraints, we simplify the timestamp representation as follows We fix thehour elements in the events’ timestamps and suppose the minute elements are specified
in Table1
An issue related candidate pruning when imposing constraints on the events of apattern is whether the closure property still holds and the candidate pruning strategypresented above is still right A modification of this property can be stated for thiscase
result in a combinatorial blow-up in the number of frequent patterns We thereforeneed to restrict the maximum allowable length of a pattern to make the task tractable.Imposing a restriction on the lengths of patterns has no difficulty We need only to add
Trang 13a check to see if the length (S) ≤ ξ where ξ is the user predefined maximum length.
Length has no effect on the closure property, it is thus closure property preserving
time difference between consecutive events of a pattern First, we find the frequentpatterns in which events occur after a given interval Minimum timing constraint isclosure property preserving, so handling it is straightforward Let(A, tA ) → S and
S → (B, tB ) be two frequent patterns in which A and B and are two different events,
and the symbol→ simply denotes the happens-after relationship between an event and
a sequence We say that the candidate pattern(A, tA ) → S → (B, tB) is frequent
with minimum gap at leastτ, then obviously A and Smust be τ apart (i.e.,∀(ai , ti) ∈S,
the inequality below) In other words,(A, tA ) → S → (B, tB) is infrequent if it
contains an infrequent subpattern
Since S remains self-contained, what we only need to do here is add a minimum
gapτ check in the join operation Two cases need considering in pattern generation
For k = 2, a candidate pattern (ai , ti), (aj, tj) is only generated from joining (ai, ti)
and(aj, tj) if tj− tj > τ and the regions aiand aj are neighbors And for k > 2, the
candidate pattern(A, tA ) → S → (B, tB) is obtained by joining (A, tA) → S
and S → (B, tB ) with (a1, t1), (a2, t2), (an, tn) if ta1−tA > τ and tB−tan> τ.
Figure 8 shows examples of the AllMOP-generated candidates that pass or failminimum gap constraint, assuming that τ=10 We observe that two candidates
4= 130 −120 = 10 = τ Consequentially, the number
of frequent 2-patterns returned by cAllMOP is two patterns less than that of AllMOP
maximum timing constraint on a pattern the closure property no longer holds Considertwo frequent patterns (A, tA ) → S and S → (B, tB) from which the pattern
(A, tA ) → S → (B, tB) is created by the temporal join operation Assuming
...spatiotempo-ral sequences The candidate generating mechanism of our technique is based onbreadth-first search strategy used by GSP with an additional temporal join operationand a technique for. .. class="text_page_counter">Trang 11
patterns from a candidate patterns cand During the re-clustering process, some points
may be removed because of... {oji|oj
i is position inside ai,} for one timestamp
Trang 8Fig