A technique for extracting behavioral sequence patterns from GPS recorded data

Specifically, the mined patterns are incorporated with syn-thetic constraints, namely spatiotemporal sequence length restriction, minimum andmaximum timing gap between events, time windo

Trang 1

DOI 10.1007/s00607-013-0333-1

A technique for extracting behavioral sequence patterns from GPS recorded data

Thi Hong Nhan Vu · Yang Koo Lee · The Duy Bui

Received: 29 December 2012 / Accepted: 14 May 2013

Abstract The mobile wireless market has been attracting many customers

Techni-cally, the paradigm of anytime-anywhere connectivity raises previously unthinkablechallenges, including the management of million of mobile customers, their profiles,the profiles-based selective information dissemination, and server-side computinginfrastructure design issues to support such a large pool of users automatically andintelligently In this paper, we propose a data mining technique for discovering frequentbehavioral patterns from a collection of trajectories gathered by Global PositioningSystem Although the search space for spatiotemporal knowledge is extremely chal-lenging, imposing spatial and temporal constraints on spatiotemporal sequences makesthe computation feasible Specifically, the mined patterns are incorporated with syn-thetic constraints, namely spatiotemporal sequence length restriction, minimum andmaximum timing gap between events, time window of occurrence of the whole pattern,inclusion or exclusion event constraints, and frequent movement patterns predictive

of one ore more classes The algorithm for mining all frequent constrained patterns

is named cAllMOP Moreover, to control the density of pattern regions a clusteringalgorithm is exploited The proposed method is efficient and scalable Its efficiency

is better than that of the previous algorithms AllMOP and GSP with respect to thecompactness of discovered knowledge, execution time, and memory requirement

T H N Vu (B) · T D Bui

Human Machine Interaction Laboratory, Vietnam National University,

144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

e-mail: vthnhan@gmail.com

Y K Lee

Robot/Cognitive System Research Department, Electronics and Telecommunication

Research Institute, Daejeon, Republic of Korea

Trang 2

Keywords Behavioral sequence patterns· Location-based services · Trajectorymining

Mathematics Subject Classification 68

1 Introduction

The availability and increasingly high accuracy of Global Positioning System (GPS)receivers attached to vehicles in today’s transportation technology allows recordingall the trajectories that are the traces of moving users as well as their portable devices.These trajectories contain detailed information about personal and vehicular mobilebehaviors and therefore reveal interesting practical opportunities to find behavioralpatterns to be used, for example, in traffic and sustainable mobility management and

to study the accessibility of services

The development of continuous minimization of electronics technologies in play devices and in wireless communications as well as improved performance ofgeneral computing technologies enable the deployment of mobile Location-based Ser-vices(LBSs) These services integrate data derived from the users’ requests with otheruser information in a multidimensional database [5,8] Accumulated data are usedfor later modification of the services and for long-term decision making LBSs, such

dis-as information systems (e.g., shopping, tourist, traffic information system supportingqueries pertaining to physical user location) or server-side selective information dis-semination based approach (e.g., targeted advertisement based on user profiles andlocation information) are emerging application scenarios with far-reaching implica-tion To aid decision making and customization, data mining techniques can be applied

to discover interesting knowledge about the behaviors of users For example, classes

of users which exhibit similar behaviors can be identified These classes can be acterized by various attributes of the class members or the services they requested.Sequences of service requests made by users can also be analyzed to discover regu-larities in such sequences These regularities can later be applied to make intelligentpredictions about users’ future behavior given the requests the user made in the past[20]

char-In this paper, we focus on a discussion of techniques for discovering frequent ment patterns from a spatiotemporal database We present a new algorithm called cAll-MOP for discovering all frequent movement patterns with the following constraints:(1) length restriction; (2) minimum timing gap between events; (3) maximum timinggap; (4) a time window of occurrence of the whole pattern; (5) inclusion and exclusionevent constraints; (6) patterns predictive of one or more classes To fulfill the task offinding frequent unconstraint patterns, trajectories are frequently modeled as discretemoving points However, knowledge of movements is limited by the ability of thedevices used to measure them Complete knowledge of a movement is impossible, butmovement can be detected, stored, modeled, and analyzed with some degree of accu-racy Aiming at the purpose of reducing the error in the observed locations, trajectoriesare reconstructed by re-sampling their positions and are then generalized In the miningprocess, the utilization of syntactic constraints and spatiotemporal proximity feature

move-of the application domain makes the computation become feasible Moreover, because

Trang 3

users moves in a thematically partitioned space, we take into account the concept ofgraph and the transitive property of similarity measure of paths in graph during theprocess of candidate generation, which helps avoid unnecessary candidate pattern pro-duction In addition, to control the density of the regions in patterns and automaticallyadjust the shape and size of the regions we employ a grid-based clustering technique.The performance of cAllMOP is better than that of the algorithms AllMOP in [18]and GSP in [11] in terms of the compactness of the discovered knowledge, executiontime, and memory requirement It therefore can be applied well to LBS systems.

2 Related work

Sequential patterns mining is informally descried as the discovery of inter-transactionpatterns in large customer databases [2,6,11,14,21] A sequence is a set of tem-porally ordered itemsets Since the set of frequent sequences is a superset of fre-quent itemsets, sequential pattern mining algorithms often utilize some of the ideasproposed for the discovery of association rules [1,19] One can divide approachesfor finding frequent itemsets based on two criteria: by their strategy to traversethe search space and by their strategy to determine the support values of item-sets Based on the first criteria today’s common approaches are either breadth-firstsearch or depth-first search A comparison of these approaches revealed that all ofthe methods have some types of data for which they performed better than the oth-ers [19] This data mining task, except for transaction time, is in a sense dimen-sionless Nevertheless, most of data describing events in the real world associatewith space and time Thus far, work on spatiotemporal data mining has mainlyfocused on the models and structures for indexing spatiotemporal objects [4,7,10]rather than discovering movement patterns Spatiotemporal pattern mining has beentreated as a generalization of pattern mining in time-series data [3,5,13,16,17].The algorithm offered in [9] discovers spatiotemporal periodic patterns from tra-jectories of equal length, which are then exploited in an index structure to supportthe execution of spatiotemporal queries We are concerned with trajectories of ran-dom length and the problem of imprecise sampled points Besides, another methodDFS_MINE in [12] was proposed to discover spatiotemporal sequential patternsfor weather prediction It seeks the relationships between time-varying attributesfor fixed location, but does not show how to apply to movement pattern mining

in which one needs to seek the relationships between time-varying locations ofobjects with stable attributes This problem was treated by our algorithm maxMOP in[18]

An unconstraint search can produce millions of rules or may not even be tractable

in some cases Discovery of sequences incorporating constraints has already receivedsome attention in categorical domains [11,15], but to the best of our knowledge thisproblem has not been addressed in continuous domain, especially for spatiotemporaldata The algorithm GSP in [11] was the first one to consider minimum and maximumgaps as well as time window GSP is an iterative algorithm, which counts candidate

sequences of length k in the kth database scan GSP requires as many full data scans

as the longest frequent sequence

Trang 4

3 Problem definition

Definition 1 (Trajectory) Trajectory of a moving object with identifier ojis defined

as a finite sequence of points {(p1, vt1), (p2, vt2), , (pn, vtn)} in the X × Y × V T

space, where point piis represented by coordinates(xi, yi) at the sampled time vtifor

The spatial organization of the map M is represented as a set of regions The region

is related to a specific thematic interpretation of space So, M is represented as a finite set of regions {a1,…,an} such that∪n

i=1ai = M with ai ∩ aj = φ and i = j The

moving possibility of an object from region to region is represented by a directed

graph After decomposing M, we get a hierarchical structure as introduced in [3].However, in this study we assume that a region of the lower level is ‘fully contained’

in a region of the higher level

Let T be the maximal timestamp among timestamps of the trajectories in the moving object database DB Let ojidenote the position of the moving object oj, for 1≤ j ≤

sequence of points oj1o2j .oj

Kfor 1≤ K ≤ T

Definition 2 (Spatiotemporal sequence) Given a minimal temporal intervalτ a

spa-tiotemporal sequence is a list of temporally ordered region labels S = (a1 , t1), (a2,

this length is determined by the function length (S).

A location at time t is called an event A sequence composed of k events is denoted

as k-sequence For example (R1 , t1), (R2, t2), (R2, t3) is 3-sequence.

Definition 3 (Subsequence) For a sequence S1, if region a1 occurs before a2, we denote it as a1 < a2 We say S1is a subsequence of another one S2if there exists a

one-to-one order preserving function f that maps regions in S1 to regions in S2such

that for every ai ∈ S1: (1) ai ∩ f (ai ) = φ, (2) if ai < ajthen f (ai) < f (aj), and (3)

tai +1− tai = tf (ai+1) − tf (ai).

Definition 4 (Frequent movement pattern) A trajectory is said to comply with moving

sequence S if for each region ai ∈ S at vti, the point oj

iof the trajectory is in aiat the

same time The support support (S) of the sequence S can be defined as the number

of trajectories in DB complying with it If support (S) ≥ min_sup where min_sup is a

user-specified minimum support threshold, then S is called a frequent pattern.

To control the density of a pattern region the density based partitioning method is

exploited Each region ai of pattern S is dense if the set of positions Ai = {oj

i|oj

i∈ ai}

forms a dense cluster According to the definition of [16], a dense cluster is defined with

two parameters r and MinPts points We apply a modified version of the partitioning

method in the consideration of a multi-level spatiotemporal grid Progressing from

Trang 5

Fig 1 Spatiotemporal unit

M

γr

finer to coarser one can find locally dense cells, which later can be combined togetherwith dense nearby grid cells to form clusters The size of cell at the lowest level isdecided based on the imprecision degree of the moving points, which will be presented

below In our case, MinPts is equal to the value mi n_sup∗N o So, if all regions in S

are dense, then S is frequent.

Problem definition: Given a database DB of trajectories along with the

maxi-mum speed vmax of the moving objects, the sampling rate t, a reference map

M ⊆ R2 decomposed into regions accompanying with a directed graph graph, minimum support min_sup, the problem formulation is (1) to discover all frequent

movement patterns from the database and (2) to discover patterns with syntacticconstraints

4 Process of discovering behavioral movement patterns

4.1 Movement summarization

To make the representation of a trajectory more precise we need to re-sample movingpoints The sampling error across time was proved to be an ellipse [15] given theobject’s maximal speedvmaxand two consecutive moving points The error ellipse isused as measure for the size of the sampling error per line segment In the worst case,the error is a circle and this is the case we deal with here To make the operation moreflexible and simpler we operate on it minimum bounding rectangle (MBR), which isalso the cell the map explained below

For a grid threshold r and without time, the reference map M is decomposed into

nx×nyarray of equal sized cells When including time, M is decomposed into uniform

spatiotemporal units (see Fig.1) The choice of cell size r will affect the accuracy of

the obtained result In fact, the object’s maximum velocityvmaxand the chosen samplingρ influence this choice Re-sampling rate and cell size must be selected sothat a trajectory produces at least one hit in each cell that it visits As a rule of thumb,

re-the parameters r and ρ must be selected such that (vmax / ρ) (r√2) Additionally,

temporal extentγ is a priori determined and may change depending on the application

As a rule of thumb, it should be chose such that 1 ρ γ, as ρ γ is a measure for hitnumber expectation per cell [14]

Trang 6

The reference map M having it origin, a point with coordinate (x0, y0), is

repre-sented as a regular grid and stored in an array D[1 : nx , 1 : ny] Each element D[i, j]corresponds to one cell Dij that is also a page in which the moving points are assigned.For a movement, we eliminate all consecutive points falling in the same cells andkeep only the first point with its corresponding timestamp Assume that after projecting

all the moving points in the database DB into cells (pages), we obtain the result

presented in Fig.2a We find out that there are two cells D20 and D12 containingmore than one point, so we remove the second point in them, (25,7) and (16,29),respectively Finally, the preprocessed database is represented in Fig.2b

4.2 Data set transformation

Physically, the data structure of each cell in the spatiotemporal sequence is constructed

in the form of (Dij, oj, vti) in which Dij contains a pointer pointing to the page D[i, j]

where the position of object ojat timevtiis stored In case, the lifespan of all trajectoriesbelong to the same weekdays we omit the date when representing timestamps Figure3

is an example of transforming time series of locations into spatiotemporal sequenceswith the minimum temporal intervalγ = 30 Ultimately, the database of trajectories

is converted into a set of spatiotemporal sequences, each associated with a distinct

identifier oj

4.3 Strategy for mining all frequent patterns with syntactic constraints

The considered constraints include minimum gap, maximum gap, and time window

of validity of the pattern, classes of frequent and confident rules

We directly prune candidates that violate syntactic constraints while finding quent patterns The task is accomplished by extending the algorithm AllMOP, the

fre-method here is named cAllMOP It takes as its input the set MS of

spatiotempo-ral sequences The candidate generating mechanism of our technique is based onbreadth-first search strategy used by GSP with an additional temporal join operationand a technique for pruning candidates Moreover, due to the complexity of data typehere, a clustering method to control the dense regions of the patterns is exploited.The concept of directed graph also helps avoid the creation of redundant candidatepatterns

The algorithm makes multiple passes, producing longer patterns on the base ofshorter ones, until no more patterns can be created Firstly, we explain how to find outfrequent 1-patterns from which longer ones will be generated

Different from the concept of items defined in GSP, not only the labels of patternregions, but their shapes and sizes play an important role in the process of frequentpattern discovery The shape and size of a region changes from pattern to pattern, theytherefore need to be automatically adjusted at each pass This is the reason why theprior techniques cannot be directly applied to our problem The issue is dealt with

in the following way First, the set of generalized trajectories are decomposed into

groups of moving points, each Ai = {oji|oj

i is position inside ai,} for one timestamp

Trang 8

Fig 3 Example spatiotemporal

(D31, 9:00), (D21, 9:30)>

They are obtained by clustering points in the groups Ai Specifically, to find them, for

each timestampvti, we scan the set MS to determine the frequency of each cell and just

keep frequent ones Next, the consecutive dense cells belonging to the same region ai

are merged into large regions, which might be merged continuously to form clusters.The points lying in the spare cells are assigned to the found clusters by applying a rangequery with diameter(r/√2) The points belonging to no cluster are called outliers and

are eliminated from the cells as soon as they are found The empty cells are discarded

at the same time as well Frequent 1-patterns are maintained in the set F1.

Figure 4a depicts a set of trajectories in a 2D space after passing through the

trajectory generalization operation Assume that the maximal timestamp T is 5 and min_sup count is 2 which will be used in the illustration throughout the algorithm.

In this example, the reference map M consists of six regions denoted by Rj (0≤j≤5) The numbers that are marked in each region Rj index the cells belonging tothat region A cell is denoted by the combination of a number and the region’s index(e.g., the cell indexed by the number 1 in region R1 is denoted as R11, the merge

of R11and R12denoted as R112) Because we consider only the spatial relationship

“full containment”, a cell is contained in just one region Rj Therefore, each region

Rj can be represented by a set of distinct cells (e.g., R1 is composed of three cellsR11, R12, R13).

Moving points are first projected into cells which have pointers pointing to pages

trajectories at this point can be gotten by accessing the pages in which they are cally stored From Fig.4a, we can see that the starting point of object 1 at time t1lieslogically in the cell R13and is practically stored in the element D[2,2] And its next

physi-position at time t2falls in the cell R01and is stored in D[3,2].

The dense regions are then found Figure5a shows the groups of moving pointsobtained after partitioning the trajectories in Fig.4a Different groups are denoted bydifferent shapes of points Consider the example of finding dense regions at time t4

We found three dense cells R31, R32, R33referring to three different pages, namelyD[3,1], D[4,4], and D[5,1] Since all of them are neighboring cells and belong to thesame spatial region R3, they are merged to create a cluster R3 However the cluster R3still points to three pages corresponding to the original cells creating it That means,regions are logically combined while physical pages are preserved Contrary to R3, attime t4only one point falling into the cell R41of the region R4 This point cannot also

be assigned to any existing clusters, so it is an outlier and disposed The same operation

is performed for the other groups of points, all frequent 1-pattern F1is finally obtainedand displayed in Fig.5b

Trang 10

Pages

outlier R1 23 R0 1

R0 3 t1

t4

t’3 t'4

(a) The result F1 (b) Clustering points to find dense regions

Fig 5 Example of frequent 1-patterns

In the following, we first see how the algorithm AllMOP works without any constraints.This algorithm forms the basis for cAllMOP

For k = 2, candidate patterns are created by making a temporal join on F1 withitself Let(ai, ti) and (aj, tj) be two 1-pattern in F1, a candidate pattern, for example

(ai , ti), (aj, tj), is created if tj> tiand the regions ai and ajare neighbors determined

by the map and a graph

Then, for k >2 candidate patterns are generated in this way: give a set Fk −1 of

(k −1)-patterns, the candidates for the next pass are enumerated by making a temporal

join on Fk−2with itself A pattern s1 = (a1 , t1), (a2, t2),…, (ak −1, tk −1) joins with

To facilitate candidate generation, MBRs of the patterns regions are exploited If all

intersections of those pairs are empty, the created candidate pattern will be in the form

of cand = (a1 , t1), (c2, t2), , (ck −1, tk −1), (bk, t

criteria Ri Oid = Rj.Oid(or Cand = Ri Ri.Oid=Rj.OidandRi.vti=Rj.vtjRj) That

means, the points articulated must comply with time order and belong to the same

trajectory The support value is then the number of objects ojthat comply with thecandidate patterns

It is noticeable that the regions of the candidate pattern might not dense anymoreafter such a join operation This problem can be handled by performing a validation If

the support value is at least min_sup, then the regions of a k-candidate pattern (k ≥ 2)

must be re-clustered to validate if they are dense The points at some candidate patternwill possibly be grouped to more than one cluster If so, a new candidate pattern will

be created for each cluster Ultimately, it is likely to get more than one actual frequent

Trang 11

patterns from a candidate patterns cand During the re-clustering process, some points

may be removed because of being outliers with respect to new clusters

The final aspect plays an important role in the discovery of frequent pattern is how

to prune candidates We exploit the closure property in [1], that is, all subpatterns of

a frequent pattern are also frequent

In addition, to tackle this problem we exploit a list of minimal infrequent patterns

in memory, called MinInfreqList This list is initialized with all infrequent 2-patterns.

Each time a new candidate pattern is generated, we check if it is a superpattern of

any pattern in MinInfreqList If yes, it will be discarded right away without carrying out the temporal join An infrequent pattern cand is inserted into this list when all of the following conditions hold: currently cand does not exist in MinInfreqList; cand is not a superpattern of any patterns in this list; cand generated after the temporal join

operation was found to be infrequent After insertion, we remove all superpatterns of

cand from the list The structure is kept as a link list and sorted in increasing order of

the pattern length

Now we consider an example of generating candidate k-patterns With the concept

of graph and time order compliance, the total number of candidate 2-patterns generated

by AllMOP is 18 illustrated by Fig.6

Next, we perform a temporal join on regions of every candidate pattern in C2 First,let consider the candidate pattern(R123 , t1), (R01, t2) shown in Fig.7a After join-ing the points of R123 with those of R01, the size of cluster R01is still maintainedbut the size of cluster R123is shrunken to R13 Second, consider the candidate pat-tern (R23 , t3), (R3, t4) depicted in Fig.7b After joining the points of R23 withthose of R3, the region R23 is still the same but the remaining points of R3 aregrouped into two clusters R31 and R33 Therefore, two new patterns are created,namely (R23 , t3), (R31, t4) and (R23, t3), (R32, t4) With the same operation

made for the rest of C2, all frequent 2-patterns F2are achieved

In this example, the process repeats the join operation until frequent 5-patterns areall found

Algorithm cAllMOP includes an added constraint check in the pattern discoveryprocess Because time difference between events is mainly concerned when checkingtiming constraints, we simplify the timestamp representation as follows We fix thehour elements in the events’ timestamps and suppose the minute elements are specified

in Table1

An issue related candidate pruning when imposing constraints on the events of apattern is whether the closure property still holds and the candidate pruning strategypresented above is still right A modification of this property can be stated for thiscase

result in a combinatorial blow-up in the number of frequent patterns We thereforeneed to restrict the maximum allowable length of a pattern to make the task tractable.Imposing a restriction on the lengths of patterns has no difficulty We need only to add

Trang 13

a check to see if the length (S) ≤ ξ where ξ is the user predefined maximum length.

Length has no effect on the closure property, it is thus closure property preserving

time difference between consecutive events of a pattern First, we find the frequentpatterns in which events occur after a given interval Minimum timing constraint isclosure property preserving, so handling it is straightforward Let(A, tA ) → S and

S → (B, tB ) be two frequent patterns in which A and B and are two different events,

and the symbol→ simply denotes the happens-after relationship between an event and

a sequence We say that the candidate pattern(A, tA ) → S → (B, tB) is frequent

with minimum gap at leastτ, then obviously A and Smust be τ apart (i.e.,∀(ai , ti) ∈S,

the inequality below) In other words,(A, tA ) → S → (B, tB) is infrequent if it

contains an infrequent subpattern

Since S remains self-contained, what we only need to do here is add a minimum

gapτ check in the join operation Two cases need considering in pattern generation

For k = 2, a candidate pattern (ai , ti), (aj, tj) is only generated from joining (ai, ti)

and(aj, tj) if tj− tj > τ and the regions aiand aj are neighbors And for k > 2, the

candidate pattern(A, tA ) → S → (B, tB) is obtained by joining (A, tA) → S

and S → (B, tB ) with (a1, t1), (a2, t2), (an, tn) if ta1−tA > τ and tB−tan> τ.

Figure 8 shows examples of the AllMOP-generated candidates that pass or failminimum gap constraint, assuming that τ=10 We observe that two candidates

4= 130 −120 = 10 = τ Consequentially, the number

of frequent 2-patterns returned by cAllMOP is two patterns less than that of AllMOP

maximum timing constraint on a pattern the closure property no longer holds Considertwo frequent patterns (A, tA ) → S and S → (B, tB) from which the pattern

(A, tA ) → S → (B, tB) is created by the temporal join operation Assuming

spatiotempo-ral sequences The candidate generating mechanism of our technique is based onbreadth-first search strategy used by GSP with an additional temporal join operationand a technique for. .. class="text_page_counter">Trang 11

patterns from a candidate patterns cand During the re-clustering process, some points

may be removed because of... {oji|oj

i is position inside ai,} for one timestamp

Trang 8

Fig

Định dạng
Số trang	26
Dung lượng	1,07 MB