Keyword Search in Databases- P10 docx

It is worth noting that an XML database or World Wide Web also can be modeled as a graph, and there also exists general graphs with textual information stored on the nodes.. Based on the

Trang 1

as dis(t, c) If dis(t, k i ) + dis(t, c)≤Dmax, the tuple twill be projected from the RDB Here, both dis(t, c) and dis(t, k i )are in the range of [0,Dmax] In this phase, all such tuples, t, will

be projected, which are sufficient to compute all multi-center communities, because the set of such tuples contain every keyword-tuple, center-tuple, and path-tuple to compute all communities This

is illustrated in Figure 2.21(c), when l= 2

The new DC() algorithm to compute communities under distinct core semantics is given

in Algorithm 13 Suppose that there are n relations in an RDB for an l-keyword query The first

reduction phase is in lines 1-7 The second/third reduction phases are done in a for-loop (lines 8-14) in which the second reduction phase is line 9, and the third reduction phase is in lines 10-14

Lines 15-17 are similar as done in DC-Naive() to compute communities using P air i, 1≤ i ≤ l, and S relation For the first reduction, it computes G j,i for every keyword k i and every relation

R j separately by calling a procedure PairRoot() (Algorithm 14) PairRoot() is designed in a similar fashion to Pair() The main difference is that PairRoot() computes tuples, t, that are in shortest

distance to a virtual node (keyword or center) withinDmax Take keyword-nodes as an example The shortest distance to a tuple containing a keyword is more important than which tuple contains

a keyword Therefore, we only maintain the shortest distance (line 9 in Algorithm 14) PairRoot() returns a collection of G j,i , for a given keyword k i, for 1≤ j ≤ n Note that Gi=n

j=1G j,i In

lines 3-4, it projects R j using semijoin R j,i ← R j G j,i Here, R j,i ( ⊆ R j )is a set of tuples that are withinDmaxfrom a virtual keyword-node k i Note thatY =n

j=1Y j X j (⊆ R j )is a set of centers

in relation R j (line 7) In line 9, starting from all center nodes (X1, · · · , X n ), it computes W j,i, for

keyword k i, for 1≤ j ≤ n Note that Wi=n

j=1W j,i In lines 10-14, it further projects R j,i out

of R j,i , for a keyword k i, for 1≤ j ≤ n In line 16, it computes P air i, using the projected relations,

R

1,i , R 2,i ,· · · , R

n,i The new algorithm DR() to compute distinct roots is given in Algorithm 15.

Trang 2

C H A P T E R 3

Graph-Based Keyword Search

In this chapter, we show how to answer keyword queries on a general data graph using graph

algorithms It is worth noting that an XML database or World Wide Web also can be modeled as

a graph, and there also exists general graphs with textual information stored on the nodes In the

previous chapter, we discussed keyword search on a relational database (RDB) using the underlying

relational schema that specifies how tuples are connected to each other Based on the primary and

foreign key references defined on a relational schema, an RDB can be modeled as a data graph where

nodes represent tuples and edges represent the foreign key references between tuples

In Section 3.1, we discuss graph models and define the problem, precisely In Section 3.2,

we introduce two algorithms that will be used in the subsequent discussions One is polynomial delay and the other is Dijkstra’s single source shortest path algorithm In Section 3.3, we discuss

several algorithms that find Steiner trees as answers for l-keyword queries We will discuss exact and

approximate algorithms in Section 3.3 In Section 3.4, we discuss algorithms that find tree-structured answers which have a distinct root Some indexing approaches and algorithms that deal with external graphs on disk will be discussed In Section 3.5, we discuss algorithms that find subgraphs

Abstract directed weighted graph: As an abstraction, we consider a general directed graph in this

chapter, G D (V , E), where edges have weight w e ( u, v) For an undirected graph, backward edges

with the same weights can be added to make it to be a directed graph In some definitions, the nodes also have weights to reflect the prestige like the PageRank value [Brin and Page,1998] But the algorithms remain the same with little modifications, so we will assume that only edges have

weights for the ease of presentation We use V (G) and E(G) to denote the set of nodes and the set

of edges for a given graph G, respectively We also denote the number of nodes and the number of edges in graph G, using n = |V (G)| and m = |E(G)| In the following, we discuss how to model

an RDB and XML database as a graph, and how weights are assigned to edges.

The (structure and textual) information stored in an RDB can be captured by a weighted directed graph, G D = (V, E) Each tuple t v in RDB is modeled as a node v ∈ V in G D, associated

with keywords contained in the corresponding tuple For any two nodes u, v ∈ V , there is a directed

edgeu, v (or u → v) if and only if there exists a foreign key on tuple t uthat refers to the primary

key in tuple t v This can be easily extended to other types of connections; for example, the model can be extended to include edges corresponding to inclusion dependencies [Bhalotia et al.,2002],

Trang 3

TID Code Name Capital Government

t1 B Belgium BRU Monarchy

t2 NOR Norway OSL Monarchy

(a) Countries

TID Name Headq #members

t3 EU BRU 25

t4 ESA PAR 17

(b) Organizations

TID Code Name Country Population

t5 ANT Antwerp B 455,148

t6 BRU Brussels B 141,312

t7 OSL Oslo NOR 533,050

(c) Cities

TID Country Organization

t8 B ESA

t9 B EU

t10 NOR ESA

(d) Members

Oslo Norway ESA Brussels

Belgium

Antwerp

EU

GD t8 t9 t3

t5

t6

t1

t4

t2

t7

t10

GA D

(e) Data Graph

Figure 3.1: A small portion of the Mondial RDB and its data graph [Golenberg et al.,2008]

where the values in the referencing column of the referencing relation are contained in the referred column of the referred relation, but the referred column need not to be a key of the referred relation

Example 3.1 Figure 3.1 shows a small portion of the Mondial relational database The Name attributes of the first three relations store text information where keywords can be matched The

directed graph transformed from the RDB, G Dis depicted in the dotted rectangle in Figure 3.1(e)

In Figure 3.1(e), there are keyword nodes for all words appearing in the text attribute of the database

The edge from t i to keyword node w j means that the node t i contains word w j

Weights are assigned to edges to reflect the (directional) proximity of the corresponding tuples,

denoted as w e ( u, v) A commonly used weighting scheme [Bhalotia et al.,2002;Ding et al.,2007]

Trang 4

is as follows For a foreign key reference from t u to t v, the weight for the directed edgeu, v is given

as Eq 3.1, and the weight for the backward edgev, u is given as Eq 3.2.

where N in (v) is the number of tuples that refer to t v , which is the tuple corresponding to node v.

An XML document can be naturally represented as a directed graph Each element is modeled

as a node, the sub-element relationships and ID/IDREF reference relationships are modeled as directed edges One possible weighting scheme [Golenberg et al.,2008] is as follows First, consider

the edges corresponding to sub-element relationship Let out (v → t) denote the number of edges that lead from v to nodes that have the tag t Similarly, in(t → v) denotes the number of edges that lead to v from nodes with tag t The weight of an edge v1 , v2, where the tags of v1and v2 are t1 and t2, respectively, is defined as follows.

w e ( v1 , v2) = log(1 + α · out(v1→ t2 ) + (1 − α) · in(t1 → v2 ))

The general idea is that the edges carry more information if there are a few edges that emanate from

v1 and lead to nodes that have the same tag as v2, or a few edges that enter v2 and emanate from

nodes with the same tag as v1 The weight of edges that correspond to ID references are set to 0, as

they represent strong semantic connections

The web can also be modelled as a directed graph [Li et al.,2001], G D = (V, E), where V

is the set of physical pages, and E is the hyper- or semantic-links connecting these pages For a

keyword query, it finds connected trees called “information unit,” which can be viewed as a logical web document consisting of multiple physical pages as one atomic retrieval unit Other databases, e.g., RDF and OWL, which are two major W3C standards in semantic web, also conform to the node-labeled graph models

Given a directed weighted data graph G D , an l-keyword query consists of a set of l≥ 2

keywords, i.e., Q = {k1 , k2, · · · , k l }.The problem is to find a set of subgraphs of G D,R (G D , Q)=

{R1(V , E), R2(V , E), · · · }, where each R i (V , E) is a connected subgraph of G Dthat contains all

the l keywords Different requirements for the property of subgraphs that should be returned have

been proposed in the literature There are mainly two different structural requirements: (1) a reduced

tree that contains all the keywords that we refer to as tree-based semantics; (2) a subgraph, such as r-radius steiner graph [Li et al.,2008a], and multi-center induced graph [Qin et al.,2009b]; we call

this subgraph-based semantics In the following, we show the tree-based semantics, and we will study

the subgraph-based semantics in Section 3.5 in detail

Tree Answer: In the tree-based semantics, an answer to Q (called aQ-subtree) is defined as any

subtree T of G D that is reduced with respect to Q Formally, there exists a sequence of l nodes in T ,

v1, · · · , v l where v i ∈ V (T ) and v i contains keyword term k i for 1≤ i ≤ l, such that the leaves

of T can only come from those nodes, i.e., leaves(T ) ⊆ {v1 , v2, · · · , v l }, the root of T should also

be from those nodes if it has only one child, i.e., root (T ) ∈ {v1 , v2, · · · , v l}

Trang 5

Belgium EU Brussels EU

Brussels EU

Brussels

EU Brussels EU

Brussels

t6

t3

t9

t3

t1

t6 t1 t3

t6

t3

A3

t3

t6

t1

T2

Figure 3.2: Subtrees [Golenberg et al.,2008]

Example 3.2 Consider the five subgraphs in Figure 3.2 Let’s ignore all the leave nodes (which

are keyword nodes), four of them are directed rooted subtrees, namely T1, A1, A2 and A3, and the subgraph T2 is not a directed rooted subtree For a 2-keyword query Q = {Brussels, EU}, (1) T1is not aQ-subtree, because the root t9 has only one child and t9 does not contain any keywords,

(2) A1 is a Q-subtree, (3) A2 is also a Q-subtree, although the root t3 has only one child,

t3 contains a keyword “EU” Subtree A3 is not a Q-subtree for Q, but it is for query Q= {Belgium, Brussels, EU}

From the above definition of a tree answer, it is not intuitive to distinguish aQ-subtreefrom

a nonQ-subtree, and it also makes the description of algorithms very complex In this chapter,

we adopt a different data graph model [Golenberg et al.,2008; Kimelfeld and Sagiv, 2006b], by

virtually adding a keyword node for every word w appears in the data and by adding a directed edge from each node v to w with weight 0 if v contains w Denote the augmented graph as

G A D = (V A , E A ) Figure 3.1(e) shows the augmented graph of the graph in the dotted rectangle.

Although in Figure 3.1(e), there is only one incoming edge for each keyword node, multiple incoming edges into keyword nodes are allowed in general Note that, there is only one keyword node for each

word w in G A

D, and the augmented graph does not need to be materialized; it can be built on-the-fly

using the inverted index of keywords In G A D, an answer of a keyword query is well defined and captured by the following lemma

Lemma 3.3 [ Kimelfeld and Sagiv , 2006b ] A subtree T of G A

D is aQ-subtree, for a keyword query

Q = {k1 , · · · , k l }, if and only if the set of leaves of T is exactly Q, i.e., leaves(T ) = Q, and the root of

T has at least two children.

The last three subtrees in Figure 3.2 all satisfy the requirements of Lemma 3.3, so they are Q-subtree In the following, we also use G D to denote the augmented graph G A

Dwhen the context is clear, and we use the above lemma to characterizeQ-subtree AlthoughQ-subtreeis popularly used to describe answers to keyword queries, two different weight functions are proposed in the

Định dạng
Số trang	5
Dung lượng	142,92 KB