1. Trang chủ
  2. » Tất cả

An efficient algorithm for mining maximal co location pattern using instance trees

6 4 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề An Efficient Algorithm for Mining Maximal Co-location Pattern Using Instance Trees
Tác giả Dai Phong Le, Cao Dai Pham, Van Tuan Luu, Vanha Tran, Dang Hai Nguyen
Trường học Le Quy Don Technical University
Chuyên ngành Information and Computer Science
Thể loại Research Paper
Năm xuất bản 2021
Thành phố Hanoi
Định dạng
Số trang 6
Dung lượng 1,65 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

of Information Technology Specialization FPT University Hanoi, Vietnam hatv14@fe.edu.vn Dang Hai Nguyen Institute of System Integration Le Quy Don Technical University Hanoi, Vietnam ngu

Trang 1

An Efficient Algorithm for Mining Maximal

Co-location Pattern Using Instance-trees

Dai Phong Le

Institute of System Integration

Le Quy Don Technical University

Hanoi, Vietnam

ledaiphong.isi@lqdtu.edu.vn

Cao Dai Pham Institute of System Integration

Le Quy Don Technical University

Hanoi, Vietnam daipc.isi@lqdtu.edu.vn

Van Tuan Luu Institute of System Integration

Le Quy Don Technical University

Hanoi, Vietnam tuanlv.isi@lqdtu.edu.vn Vanha Tran

Dept of Information Technology Specialization

FPT University Hanoi, Vietnam hatv14@fe.edu.vn

Dang Hai Nguyen Institute of System Integration

Le Quy Don Technical University

Hanoi, Vietnam nguyendanghai.mta@gmail.com

Abstract—Prevalent co-location patterns, which refer to groups

of features whose instances frequently appear together in nearby

geographic space, are one of the main branches of spatial data

mining As the data volume continues to increase, it is redundant

if all patterns are discovered Maximal co-location patterns

(MCPs) are a compressed representation of all these patterns

and they provide a new insight into the interaction among

different spatial features to discover more valuable knowledge

from data sets Increasing the volume of spatial data sets makes

discovering MCPs still very challenging We dedicate this study

to designing an efficient MCP mining algorithm First, features in

size-2 patterns are regarded as a sparse graph, MCP candidates

are generated by enumerating maximal cliques from the sparse

graph Second, we design two instance-tree structures, star

neighbor- and sibling node-based instance-trees to store neighbor

relationships of instances All maximal co-location instances of

the candidates are yielded efficiently from these instance-tree

structures Finally, a MCP candidate is marked as prevalent if

its participation index, which is calculated based on the maximal

co-location instances, is not smaller than a minimum prevalence

threshold given by users The efficiency of the proposed algorithm

is proved by comparison with the previous algorithms on both

synthetic and real data sets

Index Terms—data mining, maximal co-location pattern, star

neighbor, instance-tree

I INTRODUCTION With the development of the global positing system (GPS)

enabled mobile and hand-held devices, many applications

are designed based on geo-location data, e.g., peer-to-peer

ridesharing, ride service hailing, and food delivery The

valu-able knowledge discovered from spatial data makes these

application services more and more accurately and they can

provide personalized services And prevalent co-location

pat-terns (PCPs), which are groups of spatial features (e.g., hotels,

restaurants, convenience stores in point of interest data) with

their instances (e.g., a specific hotel, restaurant, or convenience

store), are one of the main branches of spatial data mining

Fig 1 shows a distribution of a point of interest data set

in Tokyo, Japan As can be seen, the instances of hotels,

restaurants, and convenience stores are frequently located to-gether in neighbors of each other There are four PCPs formed

in this data set, {Hotel, Restaurant}, {Hotel, Convenience store}, {Restaurant, Convenience store}, and {Hotel, Restau-rant, Convenience store} {Hotel, RestauRestau-rant, Convenience store} is a MCP because it does not have any super-PCPs PCP mining has been proved to be an effective tool for discovering valuable knowledge from spatial data sets and it is applied to many fields such as environmental management [1], mobile communications [2], social science [3], and location-based services [4]

Fig 1: The distribution of a point of interest data set

If a PCP has no super-patterns, it is a MCP It is challenging

to discover MCPs when the numbers of features and instances are large and/or the distribution of data is dense In this study, we focus on developing an efficient algorithm of MCP mining by employing two efficient instance-tree structures The two structures are designed to store neighbor relationships

of instances Co-location instances of MCP candidates can

be efficiently collected from these structures Therefore, the efficiency of the mining process can be improved

The remainder of this study is organized as follows Section

2 shows the problem statement and related work The proposed algorithms are represented in Section 3 Section 4 makes a promise to improve the mining efficiency of our algorithms

by experiments Section 5 concludes this work

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

Trang 2

C.4 D.2 A.2

C.1 D.1 A.3

B.1

A.4 C.3

B.4

B.2 A.1

f.i: the i-th instance of feature f.

: having neighbor relationships.

E.1 D.3 C.2 B.3 E.2

{A, B} {A, C} {A, D} {B, C} {C, D}

A.2 B.2 A.1 C.3 A.1 D.2 B.1 C.2 C.1 D.1 A.3 B.1 A.2 C.4 A.2 D.1 B.2 C.4 C.1 D.2 A.3 B.4 A.3 C.2 A.3 D.3 B.3 C.1 C.1 D.3 A.4 B.3 3/4 3/4 3/4 3/3 B.4 C.3 C.2 D.3 3/4 4/4 0.75 0.75 4/4 4/4 C.4 D.1 0.75 1 3/4 3/3 I(c) T(c) 0.75

PR(c)

Candidate c

PI(c)

{B, D} {A, E} {C, E} {A, B, C} {A, B, D} {A, C, D} {B, C, D} {A, B, C, D}

B.1 D.3 A.2 E.2 C.3 E.1 A.2 B.2 C.4 A.2 B.2 D.1 A.2 C.4 D.1 B.1 C.2 D.3 A.2 B.2 C.4 D.1 B.2 D.1 1/4 1/2 1/4 1/2 A.3 B.1 C.2 A.3 B.1 D.3 A.3 C.2 D.3 B.2 C.4 D.1 A.3 B.1 C.2 D.3 B.4 D.2 0.25 0.25 2/4 2/4 2/4 A.3 B.4 D.3 2/4 2/4 2/3 2/4 2/4 2/3 2/4 2/4 2/4 2/3 B.4 D.3 0.5 2/4 3/4 2/3 0.5 0.5 0.5 3/4 3/3 0.5

0.75

Fig 2: An example of co-location pattern mining

II PROBLEM STATEMENT AND RELATED WORK

A Problem statement

Given (1) a set of spatial features F = {f1, , fm}, and

a set of their instances I = {I1, , Im}, with Ii (1 ≤ i ≤

m) corresponds to instances of feature fi, each instance in

Ii is a triple tuple hfeature type, instance ID, locationi; (2)

a neighbor relationship R on the instance set I, R normally

uses a Euclidean distance metric with a distance threshold d,

if the distance between two instances that belong to different

feature types is smaller than or equal to d, the two instances

have a neighbor relationship; and (3) minprev is a minimum

prevalence threshold to evaluate the prevalence of a pattern

A subset of F, c = {f1, , fk} (1 ≤ k ≤ m) is a

size-k co-location pattern, I(c) is a co-location instance of

c whose instances have the neighbor relationship R with

each other A set of I(c) is called the table instance of c,

T(c) The participation ratio of feature fi in c is denoted

PR(c, fi), which is the fraction of the instances of fi that

participate in T {c} The participation index of c is denoted

P I(c) = min{P R(c, fi)}, fi ∈ c If the PI(c) is not smaller

than minprev, c is marked as a PCP If a PCP c has no any

prevalent super-patterns, c is called a MCP

Fig 2 shows an example of co-location pattern mining

There are five features, A, B, C, D, and E The instances

of A are A.1, A.2, A.3, and A.4 Assuming that c = {A,

B, C, D} is a candidate and the co-location instances of c

are {A.2, B.2, C.4, D.1}, and {A.3, B.1, C.2, D.3} The

participation ratio of each feature in c is P R(c, A) = 2/4,

P R(c, B) = 2/4, P R(c, C) = 2/4, P R(c, D) = 2/3 Thus,

P I(c) = min{2/4, 2/4, 2/4, 2/3} = 0.5 If users set minprev

= 0.4, P I(c) > minprev, hence c is a prevalent CP Similarly,

{A, B}, {A, C}, {A, D}, {B, C}, {C, D}, {B, D}, {A, B, C},

{A, C, D}, {B, C, D} are prevalent Since {A, B, C, D} has

no prevalent super-patterns, it becomes a MCP While {A, B,

C} is not a MCP since it has a prevalent super-pattern

The problem of CP mining is discovering all PCPs from

a given data set In furthermore, to represent compactly the

patterns in the minig result, a set of MCPs are required

B Related work

Join-based [5] is known as the first algorithm in the PCP

mining domain It uses an expensive join operation to collect

table instances To tackle this weakness, many algorithms which no longer use join operations are developed [6]–[8] However, these algorithms mentioned above are difficult to handle with the increase in the volume of data Hence, many mining PCP algorithms on big data have proposed [9]–[11] The mining result normally contains a large numbers of PCPs, it is difficult for users to absorb, understand and apply them Hence, the notion of MCPs is proposed Yoo et al [12] designed a MCP mining algorithm called MAXColoc It converts instance neighborhood transactions to feature neigh-borhood transactions and then building a feature type tree

to generate candidates The table instance of each candidate

is collected by using a star instance mechanism However, this mechanism becomes very time-consuming when data sets are dense or large since it needs to examine the neighbor relationship of the instances in the all subsets of it

An order-clique-based (OCB) approach for discovering MCPs is also developed [13] The candidates are generated

by using a P2-tree For collecting co-location instances, they construct two tree structures, a Neib-tree to save the neighbor relationship of instances and an Ins-tree to collect co-location instances However, when data sets are dense or big, these trees become very luxuriant, it takes a lot of time when it copies all sub-trees of a candidate from Neib-tree to Ins-tree And it needs to allocate a large amount memory space since

it must remain the two trees in memory in all mining process

A sparse-graph and condensed tree-based (SGCT) algorithm [14] is developed recently to mine MCPs The candidates are generated by using a maximal clique enumeration algorithm [15] The table instance of each candidate is collected by using

a hierarchical verification scheme to construct a condensed instance-tree However, the scheme is a one-by-one inspection and it becomes very expensive when data is dense and the size

of candidates is long, the performance of SGCT drops sharply For a summary of the mentioned MCP algorithms, there are two aspects concerned: (1) reduce the number of MCP candidates; (2) build various data structures to collect table instances efficiently However, each algorithm has its own disadvantages when dealing with dense or/and large data sets Regarding the first aspect, because in practical applications, the number of features is small (generally within 100) [13], there is no difference in efficiency between the various meth-ods of generating MCP candidates Our full attention on the

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

Trang 3

(1) A data set

(2) d

(3) min prev

Materialize neighbor relationships

Find size 2 patterns

Generate maximal candidates

Construct Inst-trees

Calculate PIs, filter maximal patterns

Fig 3: The proposed mining framework

C

B E

Fig 4: The relationship

of features in size-2 CPs second aspect by developing two instance-tree structures

III THE PROPOSED ALGORITHMS

Fig 3 shows the framework of the proposed algorithm The

first phase requires users to input a spatial data set, a distance

threshold d, and a minimum prevalence threshold minprev

The neighbor relationship of instances is materialized under

d in the second phase The third phase finds size-2 PCPs

A set of MCP candidates are generated based on the

size-2 patterns in the fourth phase The fifth phase collects the

table instance of each candidate by constructing a

instance-tree The sixth phase calculates participation indexes and filters

prevalent MCPs In this study, we mainly focuses on the fourth

phase, two efficient instance-trees are devised

A Star neighbors

Definition 1: The star neighbor (SN) of an instance iq is

defined as a set of instances jp, SN = {jp | jp > iq, p 6=

q, 1 ≤ p, q ≤ m} and iq is called the center instance

For example, Table I lists the star neighbor of each instance

in the data set shown in Fig 2

TABLE I: Star neighbors of instances in Fig 2

Center

instance Star neighborinstances

Center instance Star neighborinstances A.1 C.3, D.2 A.2 B.2, C.4, D.1, E.2

A.3 B.1, B.4, C.2, D.3 A.4 B.3

: empty set.

B Generating candidates

According to the anti-monotonicity property of PCPs [6], if

a size-(k>2) pattern c is prevalent, all size-2 patterns which

are generated by the features in c must be prevalent Hence,

size-k candidates can be generated based on size-2 PCPs It is

easy to find that the relationship of features in size-2 PCPs can

be plotted as an undirected graph G2F whose nodes are the

features in the size-2 PCPs and edges are these size-2 PCPs

Definition 2: A size 2-feature graph, G2F(V, E) is a set of

vertices V = {fi | fi is a feature of the size-2 PCPs} and a

set of edges E = {(fi, fj) | (fi, fj) is a size-2 PCPs}

For example, the data set in Fig 2, if users give minprev

= 0.2, we can obtain size-2 PCPs are: {A, B}, {A, C}, {A,

D}, {A, E}, {B, C}, {B, D}, {C, D}, and {C, E} Fig 4

illustrates the G2F graph constructed by the size-2 PCPs It

can be seen that the MCP candidates are equal to the maximal

cliques which are enumerated in G2F

To enumerate all maximal cliques from G2F, we employ

a maximal clique enumeration algorithm developed in [12], [15] Algorithm 1 describes the process of generating MCP candidates from size-2 PCPs, where Γ(fi) is a set of vertices that directly connect to fi For details of Algorithm 1, please refer to [15]

For example, running Algorithm 1 in Fig 4, two maximal cliques are yielded, {A, B, C, D} and {A, C, E} The two are considered as candidates for discovering prevalent MCPs Algorithm 1: Generating candidate maximal patterns Input: an undirected graph constructed by size-2 prevalent patterns, G2F(V, E);

Output: a set of candidate maximal pattens, CMPs;

1 Initialize P = V , Q = ∅, X = ∅;

2 for fi in a degeneracy ordering f1, , fmof (V, E) do

3 P = Γ(fi) ∩ {fi+1, , fm} ;

4 X = Γ(fi) ∩ {f0, , fi−1} ;

5 BronKerboshPivot(P , {fi}, X) ;

7 BronKerboshPivot(P , Q, X):

8 if P ∪ X = ∅ then

9 CM P s.add(Q) ;

11 Choose a pivot u in P ∪ X with |P ∩ Γ(u)| = max

12 for v ∈ P \ Γ(u) do

13 BronKerboshPivot(P ∩ Γ(v), Q ∪ {v}, X ∩ Γ(v)) ;

14 P = P \ {v} ;

15 X = X \ {v} ;

17 return CMPs ;

C A star neighbor-based instance-tree to collect co-location instances

Definition 3: A star neighbor-based instance-tree (STN-IT)

of a candidate is defined as follows: (1) The tree has one root The root is the center instance which is determined by an item

in SNs; (2) Each node is an instance in the star neighbor of the center instance; (3) A qualified node is the intersection of the star neighbor of its parent with the instances in the star neighbor of the center instance; (4) The tree-depth of STN-IT

is equal to (k-1) with k is the size of the candidate

Algorithm 2 shows the pseudocode of constructing a star neighbor-based instance-tree for collecting the table instance

of a MCP candidate c The first phase initializes a star neighbor-based instance-tree STNIT by using an item it which

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

Trang 4

has the center instance feature type that is the first feature in

c and its neighbor contains all remainder feature types in c

Then, these instances in the star neighbor whose feature type is

equal to the second feature in c are added as children (Step 1)

A variable depth is defined as the tree-depth The second phase

iterates each tree-leaf leaf in STNIT and gets the intersection

of the star neighbor of it and the star neighbor of leaf (Steps

2-5) The third phase adds the result of the intersection as the

children of leaf (Steps 6-7) If the intersection is empty, the

leaf is removed (Step 9) The fourth phase deletes all leaves

that their depths are smaller than the size of c (Step 10)

For example, SN(A.2) = {B.2, C.4, D.1, E.2} and it is a

satisfied item of c = {A, B, C, D} The STN-IT of A.2 can

be plotted in Fig 5 First, A.2 is added as the root and B.2 is

added as a child of A.2 (Fig 5a) Then, the intersection of it

and the star neighbor of B.2 is required, com = it ∩ SN(B.2)

= {B.2, C.4, D.1, E.2} ∩ {C.4, D.1} = {C.4, D.1}, thus C.4

and D.1 are added as children of B.2 (Fig 5b) Fig 5c shows

after appending the children of C.4 Next, D.1 which is a child

of B.2, is deleted since it is a leaf with its depth is smaller

than the size of c (k = 4) Finally, {A.2, B.2, C.4, D.1} is

regarded as a co-location instance of {A, B, C, D}

Algorithm 2: Constructing a STN-IT tree

Input: a candidate maximal pattern, c; an item it in

SCSNIs of c; the star neighbor, SN;

Output: a star neighbor-based instance-tree, STNIT;

1 STNIT = initialTree(c, it) ;

2 depth = ST N IT getDepth;

3 while depth < k do

4 for leaf ∈ ST N IT getLeaves do

5 if leaf getDepth == depth then

6 com = getIntersection(it, SN (leaf ));

14 depth = STNIT.getDepth;

16 STNIT = refinementTree(STNIT);

17 return STNIT;

D A sibling node-based instance-tree to collect co-location

instances

The star neighbor-based instance-tree is constructed for each

instance in the satisfied candidate star neighbor items, in this

Section, a new instance-tree that deals with the all instances

simultaneously is designed

Definition 4: A sibling node-based instance-tree (SBN-IT)

of a candidate is defined as follows: (1) The tree has one

root named Root; (2) The children of the root are the center

instances in SNs; (3) A qualified node is the intersection of

A.2

B.2

A.2 B.2

A.2 B.2

D.1

A.2 B.2

D.1

(a)

A.2

B.2

A.2 B.2

A.2 B.2

D.1

A.2 B.2

D.1

(b)

A.2

B.2

A.2 B.2

A.2 B.2

D.1

A.2 B.2

D.1

A.2

B.2

A.2 B.2

A.2 B.2

D.1

A.2 B.2

D.1

Fig 5: The star neighbor-based instance-tree of A.2 (a) Initialize a STN-IT tree (b) Add the children of B.2 (c) Add the children of C.4 (d) The final STN-IT tree

its star neighbor with its sibling nodes; (4) The tree-depth of SBN-IT is equal to the size of the candidate

Algorithm 3 describes the pseudocode of the process of con-structing SBN-IT A sibling node-based instance-tree, SBNIT

is initialized in the first phase The children of the root are all the center instances in the satisfied candidate star neighbor items All the star neighbors of the center instances are added

as children of themselves (Step 1) The second phase iterates each leaf of SBN-IT to get the intersection between the sibling nodes and the star neighbor of the leaf (Steps 3-6) Note that, only a leaf that has its feature type is equal to the one feature type obtained by (depth - 1) index in the candidate with depth is the tree-depth of SBN-IT, other leaves can be directly deleted In the third phase, appending the intersection

as children of the leaf if it is not empty (Steps 7-8) Finally, a refinement function is called to delete all leaves if their depths are not equal to the size of the candidate (Step 14)

Algorithm 3: Constructing a SBN-IT tree Input: a candidate maximal pattern, c; an item it in SCSNIs of c; all item in SCSNIs, items;

Output: a sibling node-based instance-tree, SBNIT;

1 SBNIT = initialTree(c, items) ;

2 depth= SBNIT.getDepth;

3 while depth ≤ k do

4 for leaf ∈ SBNIT.getLeaves do

5 if leaf.feature == c[depth - 1] then

6 sibl= getSibling(leaf );

7 com= getIntersection(sibl, SN(leaf ));

14 SBNIT.delete(leaf )

17 depth= SBNIT.getDepth;

19 SBNIT = refinementTree(SBNIT);

20 return SBNIT;

For example, Fig 6 presents the process of constructing the

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

Trang 5

B.2 C.4 D.1

(a)

(b)

Root

A.3

A.2

B.2

C.4 D.1

Root

A.3 B.1

C.2 D.3

(c)

B.4

B.4

A.2

B.2 C.4

D.1

Root

A.3 B.1

C.2 D.3

B.4

E.2

E.2

E.2 (a)

A.2

B.2 C.4 D.1

(a)

(b)

Root

A.3

B.1 C.2 D.3

A.2

B.2 C.4 D.1

Root

A.3 B.1 C.2 D.3

(c)

B.4

B.4

A.2

B.2 C.4

D.1

Root

A.3 B.1 C.2 D.3

B.4

E.2

E.2

E.2

(b)

A.2

B.2 C.4 D.1

(a)

(b)

Root

A.3

A.2

B.2 C.4 D.1

Root

A.3 B.1 C.2 D.3

(c)

B.4

B.4

A.2

D.1

Root

A.3 B.1 C.2 D.3

B.4

E.2

E.2

E.2

(c)

Fig 6: Constructing SBN-IT for {A, B, C, D} (a) Initialize a SBN-IT tree (b) Add children (c) The final SBN-IT tree sibling node-based instance-tree for the candidate pattern c =

{A, B, C, D} First, the satisfied candidate star neighbor items

are {A.2: B.2, C.4, D.1, E.2} and {A.3: B.1, B.4, C.2, D.3}

In the first phase, a SBN-IT tree is constructed as shown in

Fig 6a with a root and A.2, A.3 are two children of the root

All instances in the star neighbors of A.2 and A.3 are added as

children of A.2 and A.3, respectively Iterating each leaf in the

SBN-IT tree, for example, considering B.2, we get the sibling

nodes of B.2 is sibl = {C.4, D.1, E.2}, and the star neighbor

of B.2 is SN(B.2) = {C.4, D.1}, then com = sibl ∩ SN(B.2) =

{C.4, D.1} Thus, C.4 and D.1 are added as children of B.2

In the next iterator, C.4 is considered Since the feature type

of C.4 is C and it is different to c[depth -1] = B (the tree-depth

now is 2), C.4 is directly removed Fig 6b shows the result

when all sibling nodes of feature B are processed A complete

SBN-IT tree is plotted in Fig 6c It can be seen that each

branch of the tree is a co-location instance of the candidate

IV COMPUTATIONAL EXPERIMENTS

A set of experiments is designed to evaluate the performance

of the proposed algorithm When the framework in Fig 1

uses the star neighbor-based instance-tree and the sibling

node-based instance-tree, we name the mining algorithm

MCPM-STN-IT and MCPM-SBN-IT, respectively SGCT [14] is

cho-sen for comparison with our algorithms since it is the most

recent MCP mining algorithm and has been proven to be

superior to MAXColoc [12] and OCB [13] All algorithms are

implemented in C++ and performed on an Intel Core i7-3770

3.40GHz PC running Windows 7 with 16G main memory

A Data sets

Two synthetic data sets are generated by a synthetic data

generator which is similar to [5] The numbers of features

and instances of the two are 50 and 20,000, respectively The

spatial areas are set to 500×500 for dense data and 1000×1000

for sparse data Moreover, there are two real POI data sets in

our experiments They are collected from facilities such as

bank, parking lot, and hotel in Guangzhou (49,566 instances,

44 features) and Shanghai (67,824 instances, 50 features),

China Their distributions are plotted in Fig 7

B Performance study

a) The effectiveness: Table II lists the execution time

in each phase of each algorithm For the sparse data set,

Fig 7: The distribution of (a) Guangzhou (b) Shanghai the distance and prevalence thresholds are set to 16 and 0.4, respectively The two thresholds are set to 13 and 0.6, respectively when the dense data set is used As can be seen that: (1) The proportion of generating MCP candidates is very small in the total cost; (2) The largest fraction of the computation time is devoted to constructing instance-trees to collect table instances The neighbor relationship of instances

in SGCT is verified one-by-one, it takes more execution time compared with the proposed algorithms which effectively reduce searching space The different gap in computation time becomes larger when data sets are dense

TABLE II: The execution time of each phase of the algorithms Algorithm SGCT MCPM-STN-IT MCPM-SBN-IT

Factor (s)

Data set

Sparse Dense Sparse Dense Sparse Dense

T gen neighbors 0.159 0.247 0.149 0.212 0.195 0.21

T find size 2 patterns 0.321 0.625 0.28 0.498 0.287 0.494

T gen candidates 0.003 0.003 0.003 0.003 0.003 0.003

T constr inst trees 29.541 316.139 1.137 124.912 1.924 19.889

T calc PI filter patterns 0.294 1.537 0.068 0.403 0.081 0.324

T total 30.318 318.551 1.637 126.028 2.490 20.920

Fig 8: The execution times on different instances (a) Sparse (b) Dense

b) The scalability: First, we compare the effect of differ-ent numbers of instances As shown in Fig 8, as the number

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

Trang 6

(a) (b) (c) (d)

Fig 9: The scalability in different distance thresholds on (a) Synthetic sparse data (b) Dense data (c) Guangzhou (d) Shanghai

Fig 10: The scalability in different prevalence thresholds on (a) Sparse data (b) Dense data (c) Guangzhou (d) Shanghai

of instances increases, the proposed algorithms shows better

performance

Second, we evaluate the performance of the proposed

algo-rithms on different distance thresholds Fig 9a and 9b show the

results examined in the synthetic data sets when the prevalence

thresholds are fixed at 0.4 and 0.6 for the sparse and dense data

sets, respectively Fig 9c and 9d compares the computation

time of these algorithms when they are performed on the two

real data sets with the prevalence thresholds is set to 0.4 As

can be seen, MCPM-STN-IT and MCPM-SBN-IT show less

execution time

Third, the scalability of the proposed algorithms in terms

of the minimum prevalence threshold is examined We set the

distance thresholds to 20 and 13 for the sparse and dense data

sets, respectively Fig 10a and 10b show the results Overall,

with the increase of the prevalence threshold, the execution

times of all algorithms are decreasing in cost However, SGCT

takes more execution time in small values of the prevalence

threshold When the proposed algorithms are performed on the

two real data sets, the distance thresholds are set to 300m and

250m, respectively The comparison of the execution times is

shown in Fig 10c and 10d As can be seen that the proposed

algorithms show better performance

V CONCLUSION AND FUTURE WORK

Two efficient instance-trees named STN-IT and SBN-IT

are designed to collect table instances of candidate

maxi-mal co-location patterns in this study The two instance-tree

structures effectively reduce the search space when examining

the neighbor relationship of instances Fast construction of

instance-trees supports increasing the speed of collecting table

instances Therefore, the performance of discovering maximal

co-location patterns is improved By examining on both

syn-thetic and real data sets, the proposed algorithm shows more

efficiently than the existing algorithms

REFERENCES [1] W Liu, Q Liu, M Deng, J Cai, and J Yang, “Discovery of statisti-cally significant regional co-location patterns on urban road networks,” International Journal of Geographical Information Science, pp 1–24, 2021.

[2] V Tran, L Wang, and H Chen, “Discovering spatial co-location patterns

by automatically determining the instance neighbor,” in Fuzzy Systems and Data Mining V IOS Press, 2019, pp 583–590.

[3] Z He, M Deng, Z Xie, L Wu, Z Chen, and T Pei, “Discovering the joint influence of urban facilities on crime occurrence using spatial co-location pattern mining,” Cities, vol 99, p 102612, 2020 [4] V Tran and L Wang, “Delaunay triangulation-based spatial colocation pattern mining without distance thresholds,” Statistical Analysis and Data Mining, vol 13, no 3, pp 282–304, 2020.

[5] Y Huang, S Shekhar, and H Xiong, “Discovering colocation patterns from spatial data sets: a general approach,” IEEE Transactions on Knowledge and data engineering, vol 16, no 12, pp 1472–1485, 2004 [6] J S Yoo and S Shekhar, “A joinless approach for mining spatial coloca-tion patterns,” IEEE Transaccoloca-tions on Knowledge and Data Engineering, vol 18, no 10, pp 1323–1337, 2006.

[7] V Tran, L Wang, and L Zhou, “A spatial co-location pattern mining framework insensitive to prevalence thresholds based on overlapping cliques,” Distributed and Parallel Databases, pp 1–38, 2021 [8] V Tran, L Wang, and L Zhou, “Mining spatial co-location patterns based on overlap maximal clique partitioning,” in 20th IEEE Interna-tional Conference on Mobile Data Management, 2019, pp 467–472 [9] A M Sainju and Z Jiang, “Mining colocation from big geo-spatial event data on gpu,” 2021.

[10] J S Yoo, D Boulware, and D Kimmey, “Parallel co-location mining with mapreduce and nosql systems,” Knowledge and Information Sys-tems, pp 1–31, 2019.

[11] A M Sainju, D Aghajarian, Z Jiang, and S Prasad, “Parallel grid-based colocation mining algorithms on gpus for big spatial event data,” IEEE Transactions on Big Data, vol 6, no 1, pp 107–118, 2018 [12] J S Yoo and M Bow, “Mining maximal co-located event sets,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining Springer, 2011, pp 351–362.

[13] L Wang, L Zhou, J Lu, and J Yip, “An order-clique-based approach for mining maximal co-locations,” Information Sciences, vol 179, no 19,

pp 3370–3382, 2009.

[14] X Yao and L Peng, “A fast space-saving algorithm for maximal co-location pattern mining,” Expert Syst Appl., vol 63, pp 310–323, 2016 [15] D Eppstein and D Strash, “Listing all maximal cliques in large sparse real-world graphs,” in International Symposium on Experimental Algorithms Springer, 2011, pp 364–375.

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

Ngày đăng: 18/02/2023, 05:29