Managing and Mining Graph Data part 17 docx

The worst-case time complexity of Algorithm 4.1 is??? where ? and ? are the sizes of graph ? and graph pattern ?, respectively.. Neighborhood Subgraph Given graph ?, node ? and radius ?,

Trang 1

for subgraph isomorphism Procedure Search(𝑖) iterates on the 𝑖𝑡ℎ node to find feasible mappings for that node ProcedureCheck(𝑢𝑖,𝑣)examines if𝑢𝑖

can be mapped to 𝑣 by considering their edges Line 12 maps 𝑢𝑖 to𝑣 Lines 13–16 continue to search for the next node or if it is the last node, evaluate the graph-wide predicate If it is true, then a feasible mapping𝜙 : 𝑉 (𝒫) → 𝑉 (𝐺) has been found and is reported (line 15) Line 16 stops searching immediately

if only one mapping is required

The graph pattern and the graph are represented as a vertex set and an edge set, respectively In addition, adjacency lists of the graph pattern are used to support line 21 For line 22, edges of graph𝐺 can be represented in a hashtable where keys are pairs of the end points To avoid repeated evaluation of edge predicates (line 22), another hashtable can be used to store evaluated pairs of edges

The worst-case time complexity of Algorithm 4.1 is𝑂(𝑛𝑘) where 𝑛 and 𝑘 are the sizes of graph 𝐺 and graph pattern 𝒫, respectively This complexity

is a consequence of subgraph isomorphism that is known to be NP-hard In practice, the running time depends on the size of the search space

We now consider possible ways to accelerate Algorithm 4.1:

1 How to reduce the size of Φ(𝑢𝑖) for each node 𝑢𝑖? How to efficiently retrieveΦ(𝑢𝑖)?

2 How to reduce the overall search spaceΦ(𝑢1)× × Φ(𝑢𝑘)?

3 How to optimize the search order?

We present three techniques that respectively address the above questions The first technique prunes each Φ(𝑢𝑖) individually and retrieves it efficiently through indexing The second technique prunes the overall search space by considering all nodes in the pattern simultaneously The third technique applies ideas from traditional query optimization to find the right search order

4.2 Local Pruning and Retrieval of Feasible Mates

Node attributes can be indexed directly using traditional index structures such as B-trees This allows for fast retrieval of feasible mates and avoids a full scan of all nodes To reduce the size of feasible matesΦ(𝑢𝑖)’s even further,

we can go beyond nodes and consider neighborhood subgraphs of the nodes The neighborhood information can be exploited to prune infeasible mates at an early stage

Definition 4.10 (Neighborhood Subgraph) Given graph 𝐺, node 𝑣 and radius

𝑟, the neighborhood subgraph of node 𝑣 consists of all nodes within distance

𝑟 (number of hops) from 𝑣 and all edges between the nodes.

Trang 2

Node𝑣 is a feasible mate of node 𝑢𝑖only if the neighborhood subgraph of

𝑢𝑖is sub-isomorphic to that of𝑣 (with 𝑢𝑖 mapped to𝑣) Note that if the radius

is 0, then the neighborhood subgraphs degenerate to nodes

Although neighborhood subgraphs have high pruning power, they incur a large computation overhead This overhead can be reduced by representing

neighborhood subgraphs by their light-weight profiles For instance, one can

define the profile as a sequence of the node labels in lexicographic order The pruning condition then becomes whether a profile is a subsequence of the other

P

A B

A 1

B 1

G

A 2

Figure 4.16 A sample graph pattern and graph

A 1

B 1 C 2

A 1

B 1

C 1

A 1

B 1 C 2 B 2

A 1

Nodes

of G Neighborhood sub-graphs of radius 1 Profiles

B 1

B 2

C 1

C 2

ABC

ABCC

ABC BC ABBC

B 2

A2

B 2

C 2

A 2

Search space

Retrieve by nodes:

{A 1 , A 2 } X {B 1 , B 2 } X {C 1 , C 2 } Retrieve by neighborhood subgraphs:

{A 1 } X {B 1 } X {C 2 } Retrieve by profiles of neighborhood subgraphs:

{A 1 } X {B 1 , B 2 } X {C 2 }

Figure 4.17 Feasible mates using neighborhood subgraphs and profiles The resulting search

spaces are also shown for different pruning techniques.

Figure 4.16 shows the sample graph pattern 𝒫 and the database graph 𝐺 again for convenience Figure 4.17 shows the neighborhood subgraphs of

Trang 3

ra-dius 1 and their profiles for nodes of𝐺 If the feasible mates are retrieved using node attributes, then the search space is{𝐴1, 𝐴2} × {𝐵1, 𝐵2} × {𝐶1, 𝐶2} If the feasible mates are retrieved using neighborhood subgraphs, then the search space is{𝐴1}× {𝐵1}× {𝐶2} Finally, if the feasible mates are retrieved using profiles, then the search space is{𝐴1} × {𝐵1, 𝐵2} × {𝐶2} These are shown

in the right side of Figure 4.17

If the node attributes are selective, e.g., many unique attribute values, then one can index the node attributes using a B-tree or hashtable, and store the neighborhood subgraphs or profiles as well Retrieval is done by indexed ac-cess to the node attributes, followed by pruning using neighborhood subgraphs

or profiles Otherwise, if the node attributes are not selective, one may have

to index the neighborhood subgraphs or profiles Recent graph indexing tech-niques [9, 17, 23, 34, 36, 39–42] or multi-dimensional indexing methods such

as R-trees can be used for this purpose

4.3 Joint Reduction of Search Space

We reduce the overall search space iteratively by an approximation algo-rithm called Pseudo Subgraph Isomorphism [17] This prunes the search space

by considering the whole pattern and the spaceΦ(𝑢1)× × Φ(𝑢𝑘) simultane-ously Essentially, this technique checks for each node 𝑢 in pattern𝒫 and its feasible mate𝑣 in graph 𝐺 whether the adjacent subtree of 𝑢 is sub-isomorphic

to that of𝑣 The check can be defined recursively on the depth of the adjacent subtrees: the level𝑙 subtree of 𝑢 is sub-isomorphic to that of 𝑣 only if the level

𝑙− 1 subtrees of 𝑢’s neighbors can all be matched to those of 𝑣’s neighbors

To avoid subtree isomorphism tests, a bipartite graphℬ𝑢,𝑣 is defined between neighbors of𝑢 and 𝑣 If the bipartite graph has a semi-perfect matching, i.e., all neighbors of 𝑢 are matched, then 𝑢 is level 𝑙 sub-isomorphic to 𝑣 In the bipartite graph, an edge is present between two nodes𝑢′and𝑣′only if the level

𝑙− 1 subtree of 𝑢′ is sub-isomorphic to that of𝑣′, or equivalently the bipar-tite graph ℬ𝑢 ′ ,𝑣 ′ at level𝑙− 1 has a semi-perfect matching A more detailed description can be found in [17]

Algorithm 4.2 outlines the refinement procedure At each iteration (lines 3–20), a bipartite graph ℬ𝑢,𝑣 is constructed for each 𝑢 and its feasible mate

𝑣 (lines 5–9) Ifℬ𝑢,𝑣 has no semi-perfect matching, then 𝑣 is removed from Φ(𝑢), thus reducing the search space (line 13)

The algorithm has two implementation improvements on the refinement pro-cedure discussed in [17] First, it avoids unnecessary bipartite matchings A pair⟨𝑢, 𝑣⟩ is marked if it needs to be checked for semi-perfect matching (lines

2, 4) If the semi-perfect matching exists, then the pair is unmarked (lines 10–11) Otherwise, the removal of𝑣 from Φ(𝑢) (line 13) may affect the exis-tence of semi-perfect matchings of the neighboring ⟨𝑢′, 𝑣′⟩ pairs As a result,

Trang 4

Algorithm 4.2: Refine Search Space

Input: Graph Pattern 𝒫, Graph 𝐺, Search space Φ(𝑢1)× × Φ(𝑢𝑘), level𝑙

Output: Reduced search space Φ′(𝑢1)× × Φ′(𝑢𝑘)

begin

1

foreach 𝑢 ∈ 𝒫, 𝑣 ∈ Φ(𝑢) do Mark ⟨𝑢, 𝑣⟩;

2

for 𝑖 ← 1 to 𝑙 do

3

foreach 𝑢 ∈ 𝒫, 𝑣 ∈ Φ(𝑢), ⟨𝑢, 𝑣⟩ is marked do

4

//Construct bipartite graphℬ𝑢,𝑣

5

𝑁𝒫(𝑢), 𝑁𝐺(𝑣): neighbors of 𝑢, 𝑣;

6

foreach 𝑢′ ∈ 𝑁𝒫(𝑢), 𝑣′ ∈ 𝑁𝐺(𝑣) do

7

ℬ𝑢,𝑣(𝑢′, 𝑣′)←

{

1 if 𝑣′ ∈ Φ(𝑢′);

0 otherwise

8

end

9

if ℬ𝑢,𝑣 has a semi-perfect matchingthen

10

Unmark⟨𝑢, 𝑣⟩;

11

else

12

Remove𝑣 from Φ(𝑢);

13

foreach 𝑢′ ∈ 𝑁𝒫(𝑢), 𝑣′ ∈ 𝑁𝐺(𝑣), 𝑣′ ∈ Φ(𝑢′) do

14

Mark⟨𝑢′, 𝑣′⟩;

15

end 16

end

17

end

18

if there is no marked ⟨𝑢, 𝑣⟩ then break;

19

end

20

end

21

these pairs are marked and checked again (line 14) Second, the⟨𝑢, 𝑣⟩ pairs are stored and manipulated using a hashtable instead of a matrix This reduces the space and time complexity from𝑂(𝑘⋅ 𝑛) to 𝑂(∑𝑘𝑖=1∣Φ(𝑢𝑖)∣) The overall time complexity is 𝑂(𝑙⋅∑𝑘𝑖=1∣Φ(𝑢𝑖)∣ ⋅ (𝑑1𝑑2 + 𝑀 (𝑑1, 𝑑2))) where 𝑙 is the refinement level, 𝑑1 and 𝑑2 are maximum degrees of 𝒫 and 𝐺 respectively, and𝑀 () is the time complexity of maximum bipartite matching (𝑂(𝑛2.5) for Hopcroft and Karp’s algorithm [19])

Figure 4.18 shows an execution of Algorithm 4.2 on the example in Fig-ure 4.16 At level 1, 𝐴2 and 𝐶1 are removed from Φ(𝐴) and Φ(𝐶), respec-tively At level 2, 𝐵2 is removed from Φ(𝐵) since the bipartite graphℬ𝐵,𝐵 2 has no semi-perfect matching (note that𝐴2was already removed fromΦ(𝐴)) Whereas the neighborhood subgraphs discussed in Section 4.2 prune in-feasible mates by using local information, the refinement procedure in

Trang 5

C

B 1

A 1

C 2

B 2

A

B

C

B 1

A 1 C 1

A 1

C 2

A 1 B 1

B 2

C 2

Input search space: {A 1 , A 2 } X {B 1 , B 2 } X {C 1 , C 2 } Output search space: {A 1 } X {B 1 } X {C 2 }

B 2

C 2

A C

has no semi-perfect matching

C 2

A 2

A

C 1

Level-1

A

B

C

B 1

A 1 C 1

A 1

C 2

A 1 B 1

B 2

C 2

B 2

A 2

C 2

A 2

B 2

C 1

B 1

Figure 4.18 Refinement of the search space

rithm 4.2 prunes the search space globally The global pruning has a larger overhead and is dependent on the output of the local pruning Therefore, both pruning methods are indispensable and should be used together

4.4 Optimization of Search Order

Next, we consider the search order of Algorithm 4.1 The goal here is to find

a good search order for the nodes Since the search procedure is equivalent to multiple joins, it is similar to a typical query optimization problem [7] Two principal issues need to be considered One is the cost model for a given search order The other is the algorithm for finding a good search order The cost model is used as the objective function of the search algorithm Since the search algorithm is relatively standard (e.g., dynamic programming, greedy algorithm), we focus on the cost model and illustrate that it can be customized

in the domain of graphs

Cost Model. A search order (a.k.a a query plan) can be represented as a rooted binary tree whose leaves are nodes of the graph pattern and each internal node is a join operation Figure 4.19 shows two examples of search orders

We estimate the cost of a join (a node in the query plan tree) as the product

of cardinalities of the collections to be joined The cardinality of a leaf node

is the number of feasible mates The cardinality of an internal node can be estimated as the product of cardinalities of collections reduced by a factor𝛾

Trang 6

A B C

Figure 4.19 Two examples of search orders

Definition 4.11 (Result size of a join) The result size of join 𝑖 is estimated by

𝑆𝑖𝑧𝑒(𝑖) = 𝑆𝑖𝑧𝑒(𝑖.𝑙𝑒𝑓 𝑡)× 𝑆𝑖𝑧𝑒(𝑖.𝑟𝑖𝑔ℎ𝑡) × 𝛾(𝑖)

where 𝑖.𝑙𝑒𝑓 𝑡 and 𝑖.𝑟𝑖𝑔ℎ𝑡 are the left and right child nodes of 𝑖 respectively, and 𝛾(𝑖) is the reduction factor.

A simple way to estimate the reduction factor 𝛾(𝑖) is to approximate it by a constant A more elaborate way is to consider the probabilities of edges in the join: Letℰ(𝑖) be the set of edges involved in join 𝑖, then

𝑒(𝑢,𝑣)∈ℰ(𝑖)

𝑃 (𝑒(𝑢, 𝑣))

where 𝑃 (𝑒(𝑢, 𝑣)) is the probability of edge 𝑒(𝑢, 𝑣) conditioned on 𝑢 and 𝑣 This probability can be estimated as

𝑃 (𝑒(𝑢, 𝑣)) = 𝑓 𝑟𝑒𝑞(𝑒(𝑢, 𝑣))

𝑓 𝑟𝑒𝑞(𝑢)⋅ 𝑓𝑟𝑒𝑞(𝑣) where𝑓 𝑟𝑒𝑞() denotes the frequency of the edge or node in the large graph

Definition 4.12 (Cost of a join) The cost of join 𝑖 is estimated by

𝐶𝑜𝑠𝑡(𝑖) = 𝑆𝑖𝑧𝑒(𝑖.𝑙𝑒𝑓 𝑡)× 𝑆𝑖𝑧𝑒(𝑖.𝑟𝑖𝑔ℎ𝑡)

Definition 4.13 (Cost of a search order) The total cost of a search order Γ is

estimated by

𝐶𝑜𝑠𝑡(Γ) =∑

𝑖∈Γ

𝐶𝑜𝑠𝑡(𝑖)

For example, let the input search space be {𝐴1} × {𝐵1, 𝐵2} × {𝐶2} If

we use a constant reduction factor 𝛾, then 𝐶𝑜𝑠𝑡(𝐴 ⊳⊲ 𝐵) = 1× 2 = 2, 𝑆𝑖𝑧𝑒(𝐴 ⊳⊲ 𝐵) = 2𝛾, 𝐶𝑜𝑠𝑡((𝐴 ⊳⊲ 𝐵) ⊳⊲ 𝐶) = 2𝛾× 1 = 2𝛾 The total cost is

2 + 2𝛾 Similarly, the total cost of (𝐴 ⊳⊲ 𝐶) ⊳⊲ 𝐵 is 1 + 2𝛾 Thus, the search order(𝐴 ⊳⊲ 𝐶) ⊳⊲ 𝐵 is better than (𝐴 ⊳⊲ 𝐵) ⊳⊲ 𝐶

Trang 7

Search Order. The number of all possible search orders is exponential in the number of nodes It is expensive to enumerate all of them As in many query optimization techniques, we consider only left-deep query plans, i.e., the outer node of each join is always a leaf node The traditional dynamic programming would take an 𝑂(2𝑘) time complexity for a graph pattern of size 𝑘 This is not scalable to large graph patterns Therefore, we adopt a simple greedy approach in our implementation: at join 𝑖, choose a leaf node that minimizes the estimated cost of the join

5 Experimental Study

In this section, we evaluate the performance of the presented graph pattern matching algorithms on large real and synthetic graphs The graph specific optimizations are compared with an SQL-based implementation as described

in Figure 4.2 MySQL server 5.0.45 is used and configured as: storage en-gine=MyISAM (non-transactional), key buffer size = 256M Other parameters are set as default For each large graph, two tables V(vid, label) and E(vid1, vid2) are created as in Figure 4.2 B-tree indices are built for each field of the tables

The presented graph pattern matching algorithms were written in Java and compiled with Sun JDK 1.6 All the experiments were run on an AMD Athlon

64 X2 4200+ 2.2GHz machine with 2GB memory running MS Win XP Pro

5.1 Biological Network

the real dataset is a yeast protein interaction network [2] This graph consists

of 3112 nodes and 12519 edges Each node represents a unique protein and each edge represents an interaction between proteins

To allow for meaningful queries, we add Gene Ontology (GO) [14] terms

to the proteins The Gene Ontology is a hierarchy of categories that describes cellular components, biological processes, and molecular functions of genes and their products (proteins) Each GO term is a node in the hierarchy and has one or more parent GO Terms Each protein has one or more GO terms

We use high level GO terms as labels of the proteins (183 distinct labels in total) We index the node labels using a hashtable, and store the neighborhood subgraphs and profiles with radius 1 as well

Clique Queries. The clique queries are generated with sizes (number

of nodes) between 2 and 7 (sizes greater than 7 have no answers) For each size, a complete graph is generated with each node assigned a random label The random label is selected from the top 40 most frequent labels A total of

1000 clique queries are generated and the results are averaged The queries are divided into two groups according to the number of answers returned: low

Trang 8

hits (less than 100 answers) and high hits (more than 100 answers) Queries having no answers are not counted in the statistics Queries having too many hits (more than 1000) are terminated immediately and counted in the group of high hits

To evaluate the pruning power of the local pruning (Section 4.2) and the

global pruning (Section 4.3), we define the reduction ratio of search space as

𝛾(Φ, Φ0) = ∣Φ(𝑢1)∣ × × ∣Φ(𝑢𝑘)∣

∣Φ0(𝑢0)∣ × × ∣Φ0(𝑢𝑘)∣ whereΦ0refers to the baseline search space

10−20

10−15

10−10

10−5

100

Clique size

Retrieve by subgraphs

Refined search space

(a) Low hits

10−10

10−8

10−6

10−4

10−2

100

Clique size

Retrieve by subgraphs Refined search space

(b) High hits

Figure 4.20 Search space for clique queries

0

50

100

150

200

250

300

350

Clique size

Retrieve by profiles

Refine search space

Search w/ opt order

Search w/o opt order

(a) Individual steps

100

101

102

103

104

105

Clique size

Optimized Baseline SQL−based

(b) Total query processing

Figure 4.21 Running time for clique queries (low hits)

Figure 4.20 shows the reduction ratios of search space by different methods

“Retrieve by profiles” finds feasible mates by checking profiles and “Retrieve

by subgraphs” finds feasible mates by checking neighborhood subgraphs

Trang 9

(tion 4.2) “Refined search space” refers to the global pruning discussed in Sec-tion 4.3 where the input search space is generated by “Retrieve by profiles” The maximum refinement levelℓ is set as the size of the query As can be seen from the figure, the refinement procedure always reduces the search space re-trieved by profiles Retrieval by subgraphs results in the smallest search space This is due to the fact that neighborhood subgraphs for a clique query is actu-ally the entire clique

Figure 4.21(a) shows the average processing time for individual steps under varying clique sizes The individual steps include retrieval by profiles, retrieval

by subgraphs, refinement, search with the optimized order (Section 4.4), and search without the optimized order The time for finding the optimized order is negligible since we take a greedy approach in our implementation As shown

in the figure, retrieval by subgraphs has a large overhead although it produces

a smaller search space than retrieval by profiles Another observation is that the optimized order improves upon the search time

Figure 4.21(b) shows the average total query processing time in comparison

to the SQL-based approach on low hits queries The “Optimized” processing consists of retrieval by profiles, refinement, optimization of search order, and search with the optimized order The “Baseline” processing consists of re-trieval by node attributes and search without the optimized order on the base-line space The query processing time in the “Optimized" case is improved greatly due to the reduced search space

The SQL-based approach takes much longer time and does not scale to large clique queries This is due to the unpruned search space and the large number

of joins involved Whereas our graph pattern matching algorithm (Section 4.1)

is exponential in the number of nodes, the SQL-based approach is exponential

in the number of edges For instance, a clique of size 5 has 10 edges This requires 20 joins between nodes and edges (as illustrated in Figure 4.2)

5.2 Synthetic Graphs

The synthetic graphs are generated using a simple Erd˝os-R«enyi [13] ran-dom graph model: generate𝑛 nodes, and then generate 𝑚 edges by randomly choosing two end nodes Each node is assigned a label (100 distinct labels in total) The distribution of the labels follows Zipf’s law, i.e., probability of the

𝑥𝑡ℎlabel𝑝(𝑥) is proportional to 𝑥−1 The queries are generated by randomly extracting a connected subgraph from the synthetic graph

We first fix the size of synthetic graphs 𝑛 as 10𝐾, 𝑚 = 5𝑛, and vary the query size between 4 and 20 Figure 4.22 shows the search space and pro-cessing time for individual steps Unlike clique queries, the global pruning produces the smallest search space, which outperforms the local pruning by full neighborhood subgraphs

Trang 10

4 8 12 16 20

10−40

10−30

10−20

10−10

100

Query size

Refined search space

(a) Search space

0 20 40 60 80 100

Query size

Retrieve by profiles Retrieve by subgraphs Refine search space Search w/ opt order Search w/o opt order

(b) Time for individual steps

Figure 4.22 Search space and running time for individual steps (synthetic graphs, low hits)

100

101

102

103

Query size

(a) Varying query sizes (graph size:

10K)

100

101

102

103

104

Graph size (x1000)

(b) Varying graph sizes (query size: 4)

Figure 4.23 Running time (synthetic graphs, low hits)

Figure 4.23 shows the total time with varying query sizes and graph sizes

As can be seen, The SQL-based approach is not scalable to large queries, though it scales to large graphs with small queries In either case, the “Op-timized” processing produces the smallest running time

To summarize the experimental results, retrieval by profiles has much less overhead than that of retrieval by subgraphs The refinement step (Section 4.3) greatly reduces the search space The overhead of the search step is well com-pensated by the extensive reduction of search space A practical combination would be retrieval by profiles, followed by refinement, and then search with

an optimized order This combination scales well with various query sizes and graph sizes SQL-based processing is not scalable to large queries Overall, the optimized processing performs orders of magnitude better than the SQL-based approach While small improvements in SQL-based implementations can be

Định dạng
Số trang	10
Dung lượng	1,71 MB