Keyword Search in Databases- P16 potx

• Expand phase: Expand the supernodes found in top-n n > k results of the previous phase and add them to input graph to produce an expanded multi-granular graph, by loading all the corre

Trang 1

74 3 GRAPH-BASED KEYWORD SEARCH

memory, and the edges between innernodes are stored in cache or on disk in the form of adjacency lists; the edges between supernode and innernode do not need to be stored explicitly The weight of different kinds of edges are defined as follows

• supernode → supernode (S → S): The edge weight of s1→ s2 is defined as the

mini-mum weight of those edges between the innernodes of s1and that of s2, i.e., w e ((s1, s2))= minv1∈s1,v2∈s2w e ((v1, v2)) , where weight of edge (v1, v2)is defined to be ∞ if it does not exist

• supernode→ innernode (S → I): The edge weight of s1→ v2is defined as w e ((s1, v2))= minv1∈s1w e ((v1, v2)) These edges need not necessarily be explicitly represented During the

graph traversal, if s1is an unexpanded supernode, and there is a supernode s2in the adjacency

list of supernode s1, and s2is expanded, such edges can be enumerated by locating all innernodes

{v2 ∈ s2| the adjacency list of v2contains some inner node in s1}

• innernode → supernode (I → S): The edge weight in this case is defined in an analogous

fashion to the previous case

• innernode → innernode (I → I): Edge weight is the same as in the original graph.

When searching the multi-granular graph, the answers generated may contain supernodes, called

su-pernode answer If an answer does not contain any susu-pernodes, it is called pure answer.The final answer

returned to users must be pure answer The Iterative Expansion Search algorithm (IES) [Dalvi et al.,

2008] is a multi-stage algorithm that is applicable to mulit-granular graphs, as shown in Algo-rithm 27 Each iteration of IES can be broken up into two phases

• Explore phase: Run an in-memory search algorithm on the current state of the multi-granular

graph The multi-granular graph is entirely in memory, whereas the supernode graph is stored

in main memory, and details of expanded supernodes are stored in cache When the search reaches an expanded supernode, it searches on the corresponding innernodes in cache

• Expand phase: Expand the supernodes found in top-n (n > k) results of the previous phase

and add them to input graph to produce an expanded multi-granular graph, by loading all the corresponding innernodes into cache

The graph produced at the end of Expand phase of iteration i acts as the graph for iteration i+ 1 Any in-memory graph search algorithm can be used in the Explore phase that treats all nodes (unexpanded supernode and innernode) in the same way The multi-granular graph is maintained

as a “virtual memory view”, i.e., when visiting an expanded supernode, the algorithm will lookup its expansion in the cache, and load it into the cache if it is not in the cache The algorithm stops

when all top-k results are pure Other termination heuristics can be used to reduce the time taken

for query execution, at the potential cost of missed results

Algorithm 27 restarts search (explore phase) every time from the scratch, which can lead

to significantly increased CPU time Dalvi et al.[2008] propose an alternative approach, called

Trang 2

Algorithm 27 Iterative Expansion Search(G, Q)

Input: a multi-granular graph G, and an l-keyword query Q = {k1, k2, · · · , k l}

Output: top-k pure results.

1: while stopping criteria not satisfied do

2: /* Explore phase */

3: Run any in-memory search algorithm on G to generate the top-n results

4: /* Expand phase */

5: for each result R in top-n results do

6: SN odeSet ← SNodeSet ∪ {all super nodes from R}

7: Expand all supernodes in SNodeSet and add them to G

8: output top-k pure results

Algorithm 28 Iterative Expansion Backward Search(G, Q)

Input: a multi-granular graph G, and an l-keyword query Q = {k1, k2, · · · , k l}

Output: top-k pure results.

1: while less than k pure results generated do

2: Result ← BackwardSearch.GetNextResult()

3: if Result contains a supernode then

4: Expand one or more supernodes in Result and update the SPI trees that contain those

expanded supernodes

5: output top-k pure results

incremental expansion When a supernode answer is generated, one or more supernodes in the answer are expanded However, instead of restarting each time when supernodes are expanded, incremental expansion updates the state of the search algorithm Once the state is updated, search continues from where it left off earlier, on the modified graph Algorithm 28 shows the Incremental Expansion Backward search (IEB) where the in-memory search is implemented by a backward

search algorithm There is one shortest path iterator (SPI) tree per keyword k i, which contains all

nodes “touched” by Dijkstra’s algorithm, including explored nodes and fringe nodes, starting from k i

(or more precisely S i) More accurately, the SPI tree does not contain graph nodes, rather each tree-node of an SPI tree contains a pointer to a graph tree-node From the SPI tree, the shortest path from an explored node to an keyword node can be identified The backward search algorithm expands each SPI tree using Dijkstra’s algorithm When an answer is output by the backward search algorithm,

if it contains any supernode, one or more supernodes from the answer are expanded, otherwise it is output When a supernode is expanded, the SPI trees that contain this supernode should be updated

to include all the innernodes and exclude this supernode

Trang 3

3.5 SUBGRAPH-BASED KEYWORD SEARCH

The previous sections define the answer of a keyword query asQ-subtree, which is a directed subtree We show two subgraph-based notions of answer definition for a keyword query in the

following, namely, r-radius steiner graph, and multi-center induced graph.

3.5.1 r-RADIUS STEINER GRAPH

Li et al.[2008a] define the result of an l-keyword query as an r-radius steiner subgraph The graph

is unweighted and undirected, and the length of a path is defined as the number of edges in it The

definition of r-radius steiner graph is based on the following concepts.

Definition 3.11 Centric Distance Given a graph G and any node v ∈ V (G), the centric distance

of v in G, denoted as CD(v), is the maximum among the shortest distances between v and any node

u ∈ V (G), i.e., CD(v) = max u ∈V (G) dist (u, v)

Definition 3.12 Radius The radius of a graph G, denoted as R (G), is the minimum value among

the centric distances of every node in G, i.e., R (G)= minv ∈V (G) CD(v) G is called an r-radius graph if its radius is exactly r.

Definition 3.13 r -Radius Steiner Graph Given an r-radius graph G and a keyword query Q,

node v in G is called a content node if it contains some of the input keywords Node s is called steiner node if there exist two content nodes, u and v, and s in on the simple path between u and v The subgraph of G composed of the steiner nodes and associated edges is called an r-radius steiner graph (SG) The radius of an r-radius steiner graph can be smaller than r.

Example 3.14 Figure 3.9(a) shows two subgraphs, SG1 and SG2, of the data graph shown in

Figure 3.1(e) In SG1, the centric distance of t1and t8 are CD(t1) = 2 and CD(t8)= 3,

respec-tively In SG2, the centric distance of t1 and t8 are CD(t1) = 3 and CD(t8)= 3, respectively

The radius of SG1and SG2areR (SG1)= 2 andR (SG2)= 3, respectively For a keyword query

Q = {Brussels, EU}, one 2-radius steiner graph is shown in Figure 3.9(b), where t6 contains

key-word “Brussels” and t3contains keyword “EU”, and it is obtained by removing the non-steiner nodes

from SG1

Note that the definition of r-radius steiner graph is based on r-radius subgraph A more general definition of r-radius steiner graph would be any induced subgraph satisfying the following two properties: (1) the radius should be no more than r, (2) every node should be either a content node or a steiner node The actual problem of a keyword query in this setting is to find r-radius subgraphs, and the corresponding r-radius steiner graph is obtained as a post-processing step as

described by the definition

Trang 4

t10 t8 t9 t3

t5

t1

t4

SG1

(a) Two Subgraphs

t9 t3

t1

t6

(b) 2-radius steiner graph

Figure 3.9: 2-radius steiner graph for Q= {Brussels, EU}

The approaches to find r-radius subgraphs are based on the adjacency matrix, M = (m ij ) n ×n, with respect to G D , which is a n × n Boolean matrix An element m ij is 1, if and only if there is an

edge between v i and v j , m ii is 1 for all i M r = M × M · · · × M = (m ij ) n ×n is the r-th power of adjacency matrix M An element m r ij is 1, if and only if the shortest path between v i and v j is less

than or equal to r N i r = {v j |m r

ij = 1} is the set of nodes that have a path to v iwith distance no larger

than r G r

i denotes the subgraph induced by the node set N r

i G r

v i (N v r i )can be interchangeably used

instead of G r

i (N i r ) We use G i G j to denote that G i is a subgraph of G j The r-radius subgraph

is defined based on G r

i ’s The following lemma is used to find all the r-radius subgraphs [Li et al., 2008a]

Lemma 3.15 [ Li et al , 2008a ] Given a graph G, with R (G) ≥ r > 1, ∀i, 1 ≤ i ≤ |V (G)|, G r

i is

an r-radius subgraph, if, ∀v k ∈ N r

i , N r

i N r−1

Note that, the above lemma is a sufficient condition for identifying r-radius subgraphs, but

it is not a necessary condition In principle, there can be, exponentially, many r-radius subgraphs of

G.Li et al.[2008a] only consider n = |V (G)| subgraphs; each is uniquely determined by one node

in G, while other r-radius subgraphs are possible.

An r-radius subgraph G r i is maximal if and only if there is no other r-radius subgraph G r jthat

is a super graph of G r i , i.e G r i G r j.Li et al.[2008a] consider those maximal r-radius subgraphs G r i

as the subgraphs that will generate r-radius steiner subgraphs All these maximal r-radius subgraphs

G r i are found, which can be pre-computed and indexed on the disk, because these maximal r-radius

graph are query independent

Trang 5

The objective here is to find top-k r-radius steiner subgraphs, and ranking functions are introduced to rank the r-radius steiner subgraphs Each keyword term k i has an I R-style score:

Score I R (k i , SG)= ntf (k i , G) × idf (k i )

ndl(G)

that is a normal TF-IDF score, where idf (k i ) indicates the relative importance of keyword k iand

ntf (k i , G) measures the relevance of G to keyword k i Here G is the subgraph from which the

r -radius steiner subgraph is generated Each keyword pair (k i , k j )has a structural score, which

measures the compactness of the two keywords in SG.

Sim( k i , k j |SG) = 1

|C k i ∪ C k j|

v i ∈C ki ,v j ∈C kj

Sim( v i , v j |SG)

where C k i (C k j ) is the set of keyword nodes in SG that contain k i (k j ) , and Sim(v i , v j |SG) =

p ∈path(v i ,v j |SG) (len(p)1+1)2, where path(v i , v j |SG) denote the set of all the paths between v i

and v j in SG and len(p) is the length of path p Intuitively, Sim(k i , k j |SG) measures how close the two keywords, k i and k j , are connected to each other The final score of SG is defined as follows,

Score( {k1, · · · , k l }, SG) =

1≤i<j≤l

Score( k i , k j |SG)

where

Score( k i , k j |SG) = Sim(k i , k j |SG) × (Score I R (k i , SG) + Score I R (k j , SG))

According to the definition of r-radius steiner subgraph, Score(k i , k j |SG) can be directly com-puted on G Then Score(k i , k j |SG) can be pre-computed for each possible keyword pair and each maximal r-radius subgraph G r

i An index can be built by storing a list of maximal r-radius subgraphs G in decreasing order of Score(k i , k j |SG), for each possible keyword pair When a keyword query arrives, it can directly use these lists, and by applying the Threshold Algorithm [Fagin,

1998] the top-k maximal r-radius subgraphs can be obtained, and then the top-k r-radius steiner subgraphs can be computed by refining the corresponding r-radius subgraphs.

3.5.2 MULTI-CENTER INDUCED GRAPH

In contrast to tree-based results that are single-center (root) induced trees, in this section, we consider

query answers that are multi-centered induced subgraphs of G D These are referred to as

commu-nities [Qin et al.,2009b] The vertices of a community R(V , E), V (R) is a union of three subsets,

V = V c ∪ V l ∪ V p , where V l represents a set of keyword nodes (knode), V crepresents a set of center

nodes (cnode) (for every cnode v c ∈ V c , there exists at least a single path such that dist (v c , v l ) ≤ R max

for any v l ∈ V l , where R max is introduced to control the size of a community), and V p represents a

Định dạng
Số trang	5
Dung lượng	125,78 KB