Managing and Mining Graph Data part 24 ppsx

We also addressed how to support distance-aware queries such as to find the shortest distance between two nodes in a large directed graph using the 2-hop cover, and how to support graph

Trang 1

Cheng et al in [11, 12] consider𝐴,→𝐷 as a R-join (like 𝜃-join), and process

a graph pattern matching as a sequence of R-joins The issue is how to select

join order They propose a dynamic programming algorithm to determine the

R-join order in [11] They also propose an R-join/R-semijoin approach in [12].

The basic idea is to divide the join-index based approach into two steps namely filter and fetch The filter steps shares the similarity with semijoin, and the

fetch step is to join Cheng et al study how to select R-join/R-semijoin order

by interleaving R-joins with R-semijoins, using dynamic programming in [12].

Wang et al in [35] propose a query graph 𝐺𝑞 based on the hash join approach, and consider how to share the processing cost when it needs to process several 𝐴𝑙𝑖𝑠𝑡 and 𝐷𝑙𝑖𝑠𝑡 simultaneously Wang et al propose three basic join operators, namely, IT-HGJoin, T-HGJoin, and Bi-HGJoin The IT-HGJoin processes a subgraph of a query with one descendant and multi-ple ancestors, for exammulti-ple, 𝐴,→𝐷 ∧ 𝐵,→𝐷 The T-HGJoin process a sub-graph of a query with one ancestor and multiple descendants, for example,

𝐴,→𝐶 ∧ 𝐴,→𝐷 The Bi-HGJoin processes a complete bipartite subgraph

of a query with multiple ancestors and multiple descendants, for example

𝐴,→𝐶 ∧𝐴,→𝐷∧𝐵,→𝐶 ∧𝐵,→𝐷 A general query graph 𝐺𝑞will be processed

by a set of subgraph queries using IT-HGJoin, T-HGJoin, and Bi-HGJoin

11 Conclusions and Summary

In this chapter, we presented a survey on reachability queries We dis-cussed several coding-based approaches using traversal, dual-labeling, tree cover, chain cover, path-tree cover, 2-hop cover, and 3-hop cover approaches

We also addressed how to support distance-aware queries such as to find the shortest distance between two nodes in a large directed graph using the 2-hop cover, and how to support graph pattern matching using the existing graph-based coding schema As future work, it becomes important how to use the graph-based coding schema to support more real large graph-based applica-tions

References

[1] R Agrawal, A Borgida, and H V Jagadish Efficient management of

transitive relationships in large data and knowledge bases In Proceedings

of the 1989 ACM SIGMOD international conference on Management of data (SIGMOD 1989), 1989.

[2] K Anyanwu and A Sheth 𝜌-queries: enabling querying for semantic

associations on the semantic web In Proceedings of the 12th international conference on World Wide Web (WWW 2003), 2003.

Trang 2

Graph Reachability Queries: A Survey 213 [3] B Berendt and M Spiliopoulou Analysis of navigation behaviour in web

sites integrating multiple information systems The VLDB Journal, 9(1),

2000

[4] R Bramandia, J Cheng, B Choi, and J X Yu Updating recursive XML

views without transitive closure To appear in VLDB J., 2009.

[5] R Bramandia, B Choi, and W K Ng On incremental maintenance of

2-hop labeling of graphs In Proceedings of the 17th international conference

on World Wide Web (WWW 2008), 2008.

[6] D Brickley and R V Guha Resource Description Framework (RDF) Schema Specification 1.0 W3C Recommendation, 2000

[7] N Bruno, N Koudas, and D Srivastava Holistic twig joins: optimal XML

pattern matching In Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD 2002), 2002.

[8] L Chen, A Gupta, and M E Kurul Stack-based algorithms for pattern

matching on dags In Proceedings of the 31nd international conference on Very large data bases (VLDB 2005), 2005.

[9] Y Chen and Y Chen An efficient algorithm for answering graph

reach-ability queries In Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), 2008.

[10] J Cheng and J X Yu On-line exact shortest distance query

process-ing In Proceedings of the 12th International Conference on Extending Database Technology (EDBT 2009), 2009.

[11] J Cheng, J X Yu, and B Ding Cost-based query optimization for multi

reachability joins In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA 2007), 2007.

[12] J Cheng, J X Yu, B Ding, P S Yu, and H Wang Fast graph pattern

matching In Proceedings of the 24th International Conference on Data Engineering (ICDE 2008).

[13] J Cheng, J X Yu, X Lin, H Wang, and P S Yu Fast computation

of reachability labeling for large graphs In Proceedings of the 10th In-ternational Conference on Extending Database Technology (EDBT 2006),

2006

[14] J Cheng, J X Yu, X Lin, H Wang, and P S Yu Fast computing

reach-ability labelings for large graphs with high compression rate In Proceed-ings of the 11th International Conference on Extending Database Technol-ogy (EDBT 2008), 2008.

[15] J Cheng, J X Yu, and N Tang Fast reachability query processing In

Proceedings of the 11th International Conference on Database Systems for Advanced Applications (DASFAA 2006), 2006.

Trang 3

[16] Y J Chu and T H Liu On the shortest arborescence of a directed graph.

Science Sinica, 14:1396–1400, 1965.

[17] E Cohen, E Halperin, H Kaplan, and U Zwick Reachability and

dis-tance queries via 2-hop labels In Proceedings of the 13th annual ACM-SIAM symposium on Discrete algorithms (SODA 2002), 2002.

[18] T H Cormen, C E Leiserson, R L Rivest, and C Stein Introduction

to algorithms MIT Press, 2001.

[19] S DeRose, E Maler, and D Orchard XML linking language (XLink) version 1.0 2001

[20] S DeRose, E Maler, and D Orchard XML pointer language (XPointer) version 1.0 2001

[21] J Edmonds Optimum branchings J Research of the National Bureau

of Standards, 71B:233–240, 1967.

[22] M Fernandez, D Florescu, A Levy, and D Suciu A query language for

a web-site management system SIGMOD Rec., 26(3), 1997.

[23] H He, H Wang, J Yang, and P S Yu Compact reachability labeling

for graph-structured data In Proceedings of the 2005 ACM CIKM Inter-national Conference on Information and Knowledge Management (CIKM 2005), pages 594–601, 2005.

[24] H V Jagadish A compression technique to materialize transitive closure

ACM Trans Database Syst., 15(4):558–598, 1990.

[25] R Jin, Y Xiang, N Ruan, and D Fuhry 3-HOP: A high-compression

in-dexing scheme for reachability query In Proceedings of the 2009 ACM SIGMOD international conference on Management of data (SIGMOD 2009), 2009.

[26] R Jin, Y Xiang, N Ruan, and H Wang Efficiently answering

reacha-bility queries on very large directed graphs In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIG-MOD 2008), 2008.

[27] D S Johnson Approximation algorithms for combinatorial problems In

Proceedings of the 5th annual ACM symposium on Theory of computing (STOC 1973), 1973.

[28] L Roditty and U Zwick A fully dynamic reachability algorithm for

directed graphs with an almost linear update time In Proceedings of the

36 annual ACM symposium on Theory of computing (STOC 2004), 2004.

[29] R Schenkel, A Theobald, and G Weikum Hopi: An efficient

connec-tion index for complex XML document collecconnec-tions In Proceedings of the 9th International Conference on Extending Database Technology (EDBT 2004), 2004.

Trang 4

Graph Reachability Queries: A Survey 215 [30] R Schenkel, A Theobald, and G Weikum Efficient creation and in-cremental maintenance of the HOPI index for complex XML document

collections In Proceedings of the 21th International Conference on Data Engineering (ICDE 2005), 2005.

[31] K Simon An improved algorithm for transitive closure on acyclic

di-graphs Theor Comput Sci., 58(1-3):325–346, 1988.

[32] S TrißI and U Leser Fast and practical indexing and querying of very

large graphs In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD 2007), 2007.

[33] J van Helden, A Naim, R Mancuso, , M Eldridge, L Wernisch,

D Gilbert, and S Wodak Reresenting and analysing molecular and

cellu-lar function using the computer Journal of Biological Chemistry,

381(9-10), 2000

[34] H Wang, H He, J Yang, P S Yu, and J X Yu Dual labeling: Answering

graph reachability queries in constant time In Proceedings of the 22th International Conference on Data Engineering (ICDE 2006), 2006.

[35] H Wang, J Li, J Luo, and H Gao Hash-base subgraph query processing

method for graph-structured XML documents Proceedings VLDB Endow-ment, 1(1), 2008.

[36] H Wang, W Wang, X Lin, and J Li Labeling scheme and structural

joins for graph-structured XML data In Proceedings of the 7th Asia-Pacific Web Conference on Web Technologies Research and Development (APWeb 2005), 2005.

Trang 5

EXACT AND INEXACT GRAPH MATCHING: METHODOLOGY AND APPLICATIONS

Kaspar Riesen

Institute of Computer Science and Applied Mathematics, University of Bern

Neubr-uckstrasse 10, CH-3012 Bern, Switzerland

riesen@iam.unibe.ch

Xiaoyi Jiang

Department of Mathematics and Computer Science, University of M-unster

Einsteinstrasse 62, D-48149 M-unster, Germany

xjiang@math.uni-muenster.de

Horst Bunke

Institute of Computer Science and Applied Mathematics, University of Bern

Neubr-uckstrasse 10, CH-3012 Bern, Switzerland

bunke@iam.unibe.ch

Abstract Graphs provide us with a powerful and flexible representation formalism which

can be employed in various fields of intelligent information processing The process of evaluating the similarity of graphs is referred to as graph matching Two approaches to this task exist, viz exact and inexact graph matching The former approach aims at finding a strict correspondence between two graphs

to be matched, while the latter is able to cope with errors and measures the difference of two graphs in a broader sense The present chapter reviews some fundamental concepts of both paradigms and shows two recent applications of graph matching in the fields of information retrieval and pattern recognition.

Keywords: Exact and Inexact Graph Matching, Graph Edit Distance, Information Retrieval

by means of Graph Matching, Graph Embedding via Graph Matching

C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,

Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_7, 217

Trang 6

218 MANAGING AND MINING GRAPH DATA

1 Introduction

After many years of research, the fields of pattern recognition, machine learning and data mining have reached a high level of maturity [4] Power-ful methods for classification, clustering, information retrieval, and other tasks have become available However, the vast majority of these approaches rely

on object representations given in terms of feature vectors Such object repre-sentations have a number of useful properties For instance, the dissimilarity,

or distance, of two objects can be easily computed by means of the Euclidean distance Moreover, a large number of well-established methods for data min-ing, information retrieval, and related tasks in intelligent information process-ing are available Recently, however, a growprocess-ing interest in graph-based object representation can be observed [16] Graphs are powerful and universal data structures able to explicitly model networks of relationships between substruc-tures of a given object Thereby, the size as well as the complexity of a graph can be adopted to the size and complexity of a particular object (in contrast to vectorial approaches where the number of features has to be fixed beforehand) Yet, after the initial enthusiasm induced by the “smartness” and flexibility of graph representations in the late seventies, a number of problems became evi-dent First, working with graphs is unequally more challenging than working with feature vectors, as even basic mathematic operations cannot be defined

in a standard way, but must be provided depending on the specific applica-tion Hence, almost none of the common methods for data mining, machine learning, or pattern recognition can be applied to graphs without significant modifications

Second, graphs suffer from of their own flexibility For instance, computing the distances of a pair of objects, which is an important task in many areas,

is linear in the number of data items in the case where vectors are employed The same task for graphs, however, is much more complex, since one cannot simply compare the sets of nodes and edges, which are generally unordered and of different size More formally, when computing graph dissimilarity or similarity one has to identify common parts of the graphs by considering all of their subgraphs Regarding that there are𝑂(2𝑛) subgraphs of a graph with 𝑛 nodes, the inherent difficulty of graph comparisons becomes obvious

Despite adverse mathematical and computational conditions in the graph domain, various procedures for evaluating proximity, i.e similarity or dissimi-larity, of graphs have been proposed in the literature [15] The process of

evalu-ating the similarity of two graphs is commonly referred to as graph matching.

The overall aim of graph matching is to find a correspondence between the nodes and edges of two graphs that satisfies some, more or less, stringent con-straints That is, by means of the graph matching process similar substructures

in one graph are mapped to similar substructures in the other graph Based on

Trang 7

this matching, a dissimilarity or similarity score can eventually be computed indicating the proximity of two graphs

Graph matching has been the topic of numerous studies in computer sci-ence over the last decades Roughly speaking, there are two categories of tasks

in graph matching, viz exact matching and inexact matching In the former

case, for a matching to be successful, it is required that a strict correspondence

is found between the two graphs being matched, or at least among their sub-parts In the latter approach this requirement is substantially relaxed, since also matchings between completely non-identical graphs are possible That is, in-exact matching algorithms are endowed with a certain tolerance to errors and noise, enabling them to detect similarities in a more general way than the exact matching approach Therefore, inexact graph matching is also referred to as

error-tolerant graph matching.

For an extensive review of graph matching methods and applications, the reader is referred to [15] In this chapter, basic notations and definitions are in-troduced (Sect 2) and an overview of standard techniques for exact as well as error-tolerant graph matching is given (Sect 3 and 4) In Sect 3, dissimilarity models derived from graph isomorphism, subgraph isomorphism, and maxi-mum common subgraph are discussed for exact graph matching In Sect 4, inexact graph matching and in particular the paradigm of edit distance applied

to graphs is discussed Finally, two recent applications of graph matching are reviewed First, in Sect 5 an algorithmic framework for information retrieval based on graph matching is described This approach is based on both exact and inexact graph matching procedures and aims at querying large database graphs Secondly, a graph embedding procedure based on graph matching is reviewed in Sect 6 This framework aims at an explicit embedding of graphs in real vector spaces, which establishes access to the rich repository of algorith-mic tools for classification, clustering, regression, and other tasks, originally developed for vectorial representations

2 Basic Notations

Various definitions for graphs can be found in the literature, depending upon the considered application It turns out that the definition given below is suffi-ciently flexible for a large variety of tasks

Definition 7.1 (Graph) Let 𝐿𝑉 and 𝐿𝐸 be a finite or infinite label alphabet for nodes and edges, respectively A graph 𝑔 is a four-tuple 𝑔 = (𝑉, 𝐸, 𝜇, 𝜈), where

𝑉 is the finite set of nodes,

𝐸 ⊆ 𝑉 × 𝑉 is the set of edges,

𝜇 : 𝑉 → 𝐿𝑉 is the node labeling function, and

Trang 8

a b

c d

e f g

(d)

Figure 7.1 Different kinds of graphs: (a) undirected and unlabeled, (b) directed and unlabeled,

(c) undirected with labeled nodes (different shades of gray refer to different labels), (d) directed with labeled nodes and edges.

𝜈 : 𝐸 → 𝐿𝐸 is the edge labeling function.

The number of nodes of a graph𝑔 is denoted by∣𝑔∣, while 𝒢 represents the set of all graphs over the label alphabets𝐿𝑉 and𝐿𝐸

Definition 7.1 allows us to handle arbitrarily structured graphs with uncon-strained labeling functions For example, the labels for both nodes and edges can be given by the set of integers𝐿 ={1, 2, 3, }, the vector space 𝐿 = ℝ𝑛,

or a set of symbolic labels 𝐿 = {𝛼, 𝛽, 𝛾, } Given that the nodes and/or

the edges are labeled, the graphs are referred to as labeled graphs Unlabeled graphs are obtained as a special case by assigning the same label 𝜀 to all nodes

and edges, i.e 𝐿𝑉 = 𝐿𝐸 ={𝜀}

Edges are given by pairs of nodes (𝑢, 𝑣), where 𝑢∈ 𝑉 denotes the source node and𝑣 ∈ 𝑉 the target node of a directed edge Commonly, the two nodes

𝑢 and 𝑣 connected by an edge (𝑢, 𝑣) are referred to as adjacent A graph is termed complete if all pairs of nodes are adjacent Directed graphs directly cor-respond to the definition above In addition, the class of undirected graphs can

be modeled by inserting a reverse edge (𝑣, 𝑢)∈ 𝐸 for each edge (𝑢, 𝑣) ∈ 𝐸 with identical labels, i.e 𝜈(𝑢, 𝑣) = 𝜈(𝑣, 𝑢) In Fig 7.1 some graphs (di-rected/undirected, labeled/unlabeled) are shown

Definition 7.2 (Subgraph) Let 𝑔1 = (𝑉1, 𝐸1, 𝜇1, 𝜈1) and 𝑔2 = (𝑉2, 𝐸2, 𝜇2, 𝜈2) be graphs Graph 𝑔1 is a subgraph of 𝑔2, denoted by

𝑔1 ⊆ 𝑔2, if

(1) 𝑉1⊆ 𝑉2,

(2) 𝐸1 ⊆ 𝐸2,

(3) 𝜇1(𝑢) = 𝜇2(𝑢) for all 𝑢∈ 𝑉1, and

(4) 𝜈1(𝑒) = 𝜈2(𝑒) for all 𝑒∈ 𝐸1.

By replacing condition (2) in Definition 7.2 by the more stringent condition (2’) 𝐸1 = 𝐸2∩ 𝑉1× 𝑉1,

𝑔1 becomes an induced subgraph of𝑔2 If𝑔2is a subgraph of𝑔1, graph𝑔1 is

called a supergraph of𝑔2

Trang 9

(a) (b) (c)

Figure 7.2 Graph (b) is an induced subgraph of (a), and graph (c) is a non-induced subgraph of

(a).

Obviously, a subgraph 𝑔1 is obtained from a graph 𝑔2 by removing some nodes and their incident, as well as possibly some additional, edges from

𝑔2 For 𝑔1 to be an induced subgraph of 𝑔2, some nodes and only their in-cident edges are removed from𝑔2, i.e no additional edge removal is allowed Fig 7.2(b) and 7.2(c) show an induced and a non-induced subgraph of the graph in Fig 7.2(a), respectively

3 Exact Graph Matching

The aim in exact graph matching is to determine whether two graphs, or at least part of them, are identical in terms of structure and labels A common

approach to describe the structure of a graph is to define the adjacency matrix

A= (𝑎𝑖𝑗)𝑛×𝑛of graph𝑔 = (𝑉, 𝐸, 𝜇, 𝜈) (∣𝑔∣ = 𝑛) In this matrix the entry 𝑎𝑖𝑗

is equal to1 if there is an edge (𝑣𝑖, 𝑣𝑗) ∈ 𝐸 connecting the 𝑖-th node 𝑣𝑖 ∈ 𝑉 with the𝑗− 𝑡ℎ node 𝑣𝑗 ∈ 𝑉 , and 0 otherwise

Generally, for the nodes (and also the edges) of a graph there is no unique canonical order Thus, for a single graph with𝑛 nodes, 𝑛! different adjacency matrices exist, since there are 𝑛! possibilities to order the nodes of 𝑔 Con-sequently, for checking two graphs for structural identity, we cannot simply compare their adjacency matrices The identity of two graphs 𝑔1 and 𝑔2 is commonly established by defining a function, termed graph isomorphism, that maps𝑔1 to𝑔2

Definition 7.3 (Graph Isomorphism) Let us consider two graphs denoted by

𝑔1 = (𝑉1, 𝐸1, 𝜇1, 𝜈1) and 𝑔2 = (𝑉2, 𝐸2, 𝜇2, 𝜈2) respectively A graph isomor-phism is a bijective function 𝑓 : 𝑉1→ 𝑉2satisfying

(1) 𝜇1(𝑢) = 𝜇2(𝑓 (𝑢)) for all nodes 𝑢∈ 𝑉1

(2) for each edge 𝑒1 = (𝑢, 𝑣)∈ 𝐸1, there exists an edge

𝑒2= (𝑓 (𝑢), 𝑓 (𝑣))∈ 𝐸2

such that 𝜈1(𝑒1) = 𝜈2(𝑒2)

(3) for each edge 𝑒2 = (𝑢, 𝑣)∈ 𝐸2, there exists an edge

𝑒1 = (𝑓−1(𝑢), 𝑓−1(𝑣))∈ 𝐸1

Trang 10

Figure 7.3 Graph (b) is isomorphic to (a), and graph (c) is isomorphic to a subgraph of (a) Node

attributes are indicated by different shades of gray.

such that 𝜈1(𝑒1) = 𝜈2(𝑒2)

Two graphs are called isomorphic if there exists an isomorphism between them.

Obviously, isomorphic graphs are identical in both structure and labels That

is, a one-to-one correspondence between each node of the first graph and each node of the second graph has to be found such that the edge structure is pre-served and node and edge labels are consistent

Unfortunately, no polynomial runtime algorithm is known for the problem

of graph isomorphism [25] That is, in the worst case, the computational com-plexity of any of the available algorithms for graph isomorphism is exponential

in the number of nodes of the two graphs However, since most scenarios en-countered in practice are often different from the worst case, and furthermore, the labels of both nodes and edges very often help to substantially reduce the complexity of the search, the actual computation time can still be manageable Polynomial algorithms for graph isomorphism have been developed for spe-cial kinds of graphs, such as trees [1], ordered graphs [38], planar graphs [34], bounded-valence graphs [45], and graphs with unique node labels [18] Standard procedures for testing graphs for isomorphism are based on tree search techniques with backtracking The basic idea is that a partial node matching, which assigns nodes from the two graphs to each other, is itera-tively expanded by adding new node-to-node correspondences This expan-sion is repeated until either the edge structure constraint is violated or node

or edge labels are inconsistent In this case a backtracking procedure is ini-tiated, i.e the last node mappings are iteratively undone until a partial node mapping is found for which an alternative extension is possible Obviously, if there is no further possibility for expanding the partial node matching without violating the constraints, the algorithm terminates indicating that there is no isomorphism between the considered graphs Conversely, finding a complete node-to-node correspondence without violating any of the structure or label constraints proves that the investigated graphs are isomorphic In Fig 7.3 (a) and (b) two isomorphic graphs are shown

A well known, and despite its age still very popular, algorithm implementing the idea of a tree search with backtracking for graph isomorphism is described

in [89] A more recent algorithm for graph isomorphism, also based on the idea of tree search, is the VF algorithm and its successor VF2 [17] Here the

Định dạng
Số trang	10
Dung lượng	1,73 MB