Keyword Search in Databases- P17 pps

Qin et al.[2009b] enumerate all or the top-k communities in polynomial delay by adopting the Lawler’s procedure [Lawler,1972].. Based on Lawler’s procedure, in order to enumerate the com

Trang 1

3.5 SUBGRAPH-BASED KEYWORD SEARCH 79

Algorithm 29 GetCommunity(G D , C, R max)

Input: a data graph G D , a core C = [c1, · · · , c l ], and a radius threshold R max

Output: A community uniquely determined by C.

1: Find the set of cnodes, V c, by running|C| copies of Dijkstra’s single source shortest path algorithm

2: Run a single copy of Dijkstra’s algorithm to find the shortest distance to the nearest knode, for each node v ∈ V (G D ) , i.e dist k (v)= minc ∈C dist (v, c)

3: Run a single copy of Dijkstra’s algorithm to find the shortest distance from the nearest cnode, for each node v ∈ V (G D ) , i.e dist c (v)= minv c∈Vc dist (v c , v)

4: V ← {u ∈ V (G D ) |dist c (u) + dist k (u) ≤ R max}

5: Construct a subgraphR in G Dinduced byV and return it

set path nodes (pnode) that include all the nodes that appear on any path from a cnode v c ∈ V c to a

knode v l ∈ V l with dist (v c , v l ) ≤ R max E(R) is the set of edges induced by V (R).

A community, R, is uniquely determined by the set of knodes, V l, which is called the core

of the community and denoted as core(R) The weight of a community R, w(R) is defined as the minimum value among the total edge weights from a cnode to every knode; more precisely,

w(R)= min

v c∈Vc

v l∈Vl dist (v c , v l ). (3.8)

For simplicity, we use C to represent a core as a list of l nodes, C = [c1, c2, · · · , c l], and it may

use C[i] to denote c i ∈ C, where c i contains the keyword term k i Based on the definition of

community, once the core C is provided, the community is uniquely determined, and it can be found

by Algorithm 29, which is self-explanatory

Qin et al.[2009b] enumerate all (or the top-k) communities in polynomial delay by adopting

the Lawler’s procedure [Lawler,1972] The general idea is the same as EnumTreePD (Algo-rithm 19) But it is much easier here, becauseEnumTreePDenumerates trees which has structure,

while in this case only the cores are enumerated where each core is just a set of l keyword nodes.

In this problem, the answer space is S1× S2· · · × S l , where each S i is the set of nodes in G Dthat

contains keyword k i A subspace is described by V1× V2· · · , ×V l where V i ⊆ S iand it also can be compactly described by a set of inclusion constraints and exclusion constraints Based on Lawler’s procedure, in order to enumerate the communities in increasing cost order, it is straightforward to

obtain an algorithm whose time complexity of delay is O(l · c(l)), where c(l) is the time complexity

to compute the best community

Two algorithms are proposed for enumerating communities in order with time complexity

O(c(l)): one enumerates all communities in arbitrary order with polynomial delay, and the other

enumerates top-k communities in increasing weight order with polynomial delay In the following,

we discuss the second algorithm

Trang 2

80 3 GRAPH-BASED KEYWORD SEARCH

Algorithm 30 COMM-K(G D , Q, R max)

Input: a data graph G D , keywords set Q = {k1, · · · , k l }, and a radius threshold R max

Output: Enumerate top-K communities in increasing weight order.

1: Find the set of knodes {S1, · · · , S l } and their corresponding neighborhood nodes {N1, · · · , N l}

2: Find the best core (with lowest weight) and the corresponding weight from {N1, · · · , N l},

denoted (C, weight)

3: InitializeH← ∅;H insert(C, weight, 1, ∅)

4: while H = ∅ and less than K communities output do

5: g←H pop();{g = (C, weight, pos, prev)}

6: R← GetCommunity(G D , g.C, R max ) , and output R

7: ∀i ∈ [1, l] : update N i to be the neighborhood nodes of g.C[i], V i ← S i

8: update{V1, · · · , V l } by following the links g.prev recursively

9: for i = l downto g.pos do

10: V i ← V i − {g.C[i]}, update N i to be the neighborhood nodes of V i

11: Find the best core from the current{N1, · · · , N l }, denoted (C, weight)

12: H insert(C, weight, i, g) if Cexists

13: V i ← V i ∪ {g.C[i]}, update N i to be the neighborhood nodes of V i

Algorithm 30 shows the high-level pseudocode.His a priority heap, used to store the inter-mediate and potential cores with additional information The general idea is to consider the entire

set of potential cores as an l-dimensional space S1× S2· · · × S l, and at each step, divide a subspace into smaller subspaces and find a best core in each newly generated subspace At any intermediate step, the whole set of subspaces are disjoint, and the union is guaranteed to cover the whole space Each time a core with the lowest weight is removed fromH, it is guaranteed to be the next com-munity in order (line 5) The best core of a subspace V1× V2· · · × V l , where V i ⊂ S i, is found as

follows (lines 2,11) First, a neighborhood nodeset N i is found for each set V i, which consists of

all the nodes with a shortest distance no greater than R max to at least one of the nodes in V i This can be done by running a shortest path algorithm Second, a linear scan of the nodes can find the

best core with the best center and weight When the next best core g.C is found, the subspace from which g.C is found is partitioned into several subspaces (lines 9-13); the best core from each newly

generated subspace is found (line 11) and inserted intoH (line 12) Each entry inHconsists of

four fields, (C, weight, pos, prev), where C is the core and weight is the corresponding weight, pos and pre is used to reconstruct efficiently the subspace (without storing the description of the subspace explicitly) from which C is computed.

Algorithm 30 enumerates top-k communities in increasing weight order, with time complexity O(l(n log n + m)), and using space O(l2· k + l · n + m) [Qin et al.,2009b] Note that, finding the

best core in a subspace (under inclusion constraints and exclusion constraints) also takes time c(l)=

O(l(n log n + m)) According to discussion of EnumTreePD, it is easy to get an enumeration

Trang 3

3.5 SUBGRAPH-BASED KEYWORD SEARCH 81

algorithm with delay l · c(l) However, information can be shared during consecutive execution of

Line 11 ofEnumTreePD, so Algorithm 30 can enumerate communities with delay c(l).

Trang 5

C H A P T E R 4

Keyword Search in XML

Databases

In this chapter, we focus on keyword search in XML databases where an XML database is treated as

a large data tree We introduce various semantics to answer a keyword query on XML tree, and we

discuss efficient algorithms to find the answers under such semantics A main difference between this chapter and the previous chapters is that the underlying data structure is a large tree instead of

a large graph

In Section 4.1, we introduce several important concepts and definitions such as Lower Com-mon Ancestor (LCA), Smallest LCA (SLCA), Exclusive LCA (ELCA), and Compact LCA

(CLCA) Their properties and the relationships amongLCA,SLCAandELCAwill be discussed

In Section 4.2, we discuss the algorithms that find answers based onSLCA In Section 4.3, we dis-cuss the algorithms that focus on identifying meaningful return information We disdis-cuss algorithm

to find answers based onELCAin Section 4.4 In Section 4.5, in brief, we give several approaches based on meaningLCA, interconnection, and relevance oriented ranking.

XML is modeled as a rooted and labeled tree, such as the one shown in Figure 4.1 Each internal node v in the tree corresponds to an XML element, called element node, and is labeled with a tag/label

t ag(v) Each leaf node of the tree corresponds to a data value, called value node For example, in

Figure 4.1, “Dean” and “Title” are element nodes, “John” and “Ben” are value nodes In this model, the attribute nodes are modeled as children of the associated element node, and we do not distinguish them from element nodes

Each node (element node or value node) in the XML tree is assigned an unique Dewey ID The Dewey ID of nodes are assigned in the following way: the relative position of each node among

its siblings are recorded, and the concatenation of these relative positions using dot ’.’ starting from

the root composes the Dewey ID of the nodes For example, the node with Dewey ID 0.1.2.1 (Students) is the second child of its parent node 0.1.2 (Class) We denote the Dewey ID of a node

v as pre(v), as it is compatible with the preorder numbering, i.e., a node v1precedes another node

v2 in the preorder left-to-right depth-first traversal of the tree, if and only if pre(v1) < pre(v2)

The < relationship between two Dewey IDs is the same as comparing between two sequences Besides the order information preserved by the Dewey ID, it also can be used to detect sibling and

ancestor-descendant relationships between nodes

Tiêu đề	Keyword Search in Databases
Trường học	Standard University
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Standard City

Định dạng
Số trang	5
Dung lượng	108,28 KB