Keyword Search in Databases- P19 pot

Finding the match nodes for all the SLCAs can be done efficiently by one scan of SLCAs and one scan of S1, · · · , Sl , provided that the nodes in SLCAs are in increasing Dewey ID order.

Trang 1

4.2.1 PROPERTIES OF LCA AND SLCA

Property 4.9 Given a set S and two nodes vi and vj with vi < vj, then closest (vi , S) ≤

closest (vj, S)

closest (vi, S) = rm(vi , S) and closest (vj , S) = lm(vj , S) , rm(vi , S) > lm(vj, S) Recall that

closest (v, S) is chosen from lm(v, S) and rm(v, S), and lm(vi , S) ≤ lm(vj , S) and rm(vi , S) ≤

rm(vj, S) if all exists If lm(vj , S) < rm(vi, S) , then lm(vj , S) ≤ lm(vi , S) , therefore lm(vi , S) =

lm(vj, S) by the fact that lm(vi , S) ≤ lm(vj , S) Similarly, we can get that rm(vi , S) = rm(vj , S)

Also, we can learn that lm(vi , S) = rm(vi , S) , otherwise closest (vi , S) = lm(vi , S)

Let lm denote lm(vi , S) and rm denote rm(vi , S) It holds that lm < vi < vj < rm

Ac-cording to Property 4.2, lca(lm, vj ) lca(lm, vi ) and lca(rm, vi ) lca(rm, vj ) According to

the definition of closest, lca(lm, vi ) ≺ lca(rm, vi ) and lca(rm, vj ) lca(lm, vj ) , which is a

Property 4.10 Let V and U be lists of nodes, e.g., V = {v1, · · · , vl} and U = {u1, · · · , ul }, such

that V ≤ U, e.g., vi ≤ ui for 1 ≤ i ≤ l Let lca(V ) and lca(U) be the LCA of nodes in V and U ,

respectively Then,

1 if lca(V ) ≥ lca(U), then lca(U) lca(V ),

2 if lca(V ) < lca(U ), then

• either lca(V ) ≺ lca(U),

• or lca(V ) ⊀ lca(U), then for any W with U ≤ W, lca(V ) ⊀ lca(W).

Proof This is an extension of Property 4.3 to more than two nodes The proof is by induction,

when V and U contain only two nodes, it is proven in Property 4.3 Assume that it is true for

V , U and W , we prove it is true for V, U, W, where V= V ∪ {vl }, U = U ∪ {ul }, with vl ≤ ul

One important property of lca is that lca(V) = lca(lca(V ), vl ) If lca(U ) lca(V ), then either

lca(U) lca(V) or lca(V) ≺ lca(U) Otherwise, lca(V ) < lca(U ), according to Property 4.3,

there are three cases of lca(V) and lca(U) , and we only need to prove the last case, i.e the case

that lca(V) < ⊀ lca(U) Then for any W= W ∪ {wl }, if lca(U) ≤ lca(W), then we are done; otherwise lca(W ) ≺ lca(U), then lca(V) ⊀ lca(W) , because lca(W) lca(W) 2

Trang 2

Table 4.0:

id k1 k2 · · · kl

idm

· · ·

id2

id1

Figure 4.3: Stack Data Structure

4.2.2 EFFICIENT ALGORITHMS FOR SLCAS

In this section, we consider three algorithms, namely StackAlgorithm , IndexedLookupEa-ger , and ScanEager [Xu and Papakonstantinou, 2005], that find all the slca(S1, · · · , Sl ) effi-ciently Each algorithm has a different characteristic, and it works efficient in some situations Mul-tiwaySLCA further improves the performance of IndexedLookupEager by proposing some heuristics but with the same worst case time complexity as IndexedLookupEager Note that these algorithms only get all the SLCAs , but they do not keep the match nodes for the SLCAs Finding the match nodes for all the SLCAs can be done efficiently by one scan of SLCAs and one

scan of S1, · · · , Sl , provided that the nodes in SLCAs are in increasing Dewey ID order.

Stack Algorithm:This is an adaptation of the stack based sort-merge algorithm [ Guo et al., 2003] to compute all the SLCAs It uses a stack, each stack entry has a pair of components (id, keyword),

as shown in Figure 4.3 Assume the id components from the bottom entry to a stack entry en are id1, · · · , idm, respectively, then the stack entry en denotes the node with the Dewey ID

id1.id2 · · · idm keyword is an array of length l of Boolean values, where keyword[i] = true means that the subtree rooted at the node denoted by the entry contains keyword ki directly or indirectly.

The general idea of StackAlgorithm is to use a stack to simulate the postorder traversal

of a virtual XML tree formed by the union of the paths from root to each node in S1, · · · , Sl ,

while the nodes are read in a preorder fashion When an entry en is popped out, which means that all the descendant-or-self nodes of en in S1, · · · , Sl have been visited, it is known whether or not

a keyword appears in the subtree StackAlgorithm merges all keyword lists and computes the

longest common prefix of the node with the smallest Dewey ID from the input lists and the node

denoted by the top entry of the stack.Then it pops out all top entries until the longest common prefix

is reached If the keyword component of a popped entry en contains all the keywords, then the node denoted by en is a SLCA node Based on the definition of SLCA , all the ancestor nodes of a SLCA node can not be SLCA , so this information is recorded Otherwise, the keyword containment

information of en is used to update its parent entry’s keyword array Also, a stack entry is created

for each Dewey component of the current visiting node that is not part of the common prefix, where each new entry corresponds to one node on the path from the longest common prefix to the current

Trang 3

Algorithm 31 StackAlgorithm (S1, · · · , Sl )

Input: l lists of Dewey IDs, Siis the list of Dewey IDs of the nodes containing keyword ki.

Output: All the SLCAs

1: st ack ← ∅

2: while has not reached the end of all Dewey lists do

3: v ← getSmallestNode()

4: p ← lca(stack, v)

5: while stack.size > p do

6: en ← stack. pop ()

7: if en.keyword [i] = true, ∀i(1 ≤ i ≤ l) then

8: output en as a SLCA

9: mark all the entries in stack so that it can never be SLCA node

10: else

11: ∀i(1 ≤ i ≤ l) : stack. top ().keyword [i] ← true, if en.keyword[i] = true

12: ∀i(p < i ≤ v.length) : stack. push (v[i], [])

13: st ack. top ().keyword[i] ← true, where v ∈ Si

14: check entries of the stack and return any SLCA node if exists

node Essentially, the node represented by the top entry of the stack is the node visited in pre-order traversal.

StackAlgorithm is shown in Algorithm 31 It first initializes the stack stack to be empty

(line 1) As long as there are Dewey lists that have not been visited (line 2), it reads the next node

with the smallest Dewey ID (line 3), and performs necessary operations Essentially, reading nodes

in this order is equivalent to a preorder traversal of the original XML tree ignoring irrelevant nodes Let stack[i] denote the node represented by the i-th entry of stack starting from the bottom, and

v [i] denote the i-th component of the Dewey ID of v After getting v, it computes the LCA of v and the node represented by the top of stack (line 4), which is stack[p] This means that all the keyword nodes have been read that are descendants of stack[p + 1] if they exist, and the keyword

containment information has been stored in the corresponding stack entries Then all those nodes

represented by stack[i] (p < i ≤ stack.size) are popped (lines 5-11) For each popped entry en

(line 6), it first checks whether it is a SLCA node (line 7); if en is indeed a SLCA node, then it

is output (line 8) and the information is recorded that all its ancestors can not be SLCAs (line 9) Otherwise, the keyword containment information of its parent node is updated (line 11) After

popping out all the non-ancestor nodes from stack, v and its ancestors are pushed onto stack

(line 12), and the keyword containment information is stored (line 13) At this moment, the node

represented by the top entry of stack is v, and the whole stack represents all the nodes on the path from root to v, and the keyword containment information is stored compactly After all the Dewey

Trang 4

lists have been read, all the entries need to be popped from stack, and a check is performed to see

if there exists any SLCA node (line 14).

StackAlgorithm outputs all the SLCA nodes, i.e slca(S1, · · · , Sl ) , in time

O(d l

i=1|Si|), or O(ld|S|) [Xu and Papakonstantinou, 2005] Note that the above time

complex-ity does not take into account the time to merge S1, · · · , Sl , as it will take time O(d log l · l

i=1|Si |) getSmallestNode (line 3) just retrieves the next node with smallest Dewey ID from the merged list.

Indexed Lookup Eager: StackAlgorithm treats all the Dewey lists S1, · · · , Sl equally, but some-times |S1|, · · · , |Sl| vary dramatically. Xu and Papakonstantinou [2005] propose IndexedLooku-pEager to compute all the SLCA nodes, in the situation that |S1| is much smaller than |S| It is based on the following properties of slca function.

Property 4.11 slca( {v}, S) = lca(v, closest(v, S)), and slca( {v}, S2, · · · , Sl ) =

slca(slca( {v}, S2, · · · , Sl−1), Sl) = lca(v, closest(v, S2), · · · , closest(v, Sl )) for l > 2.

Property 4.11 suggests that we can find the SLCA node of a node, v, and a set of nodes,

S , by finding the closest node of v and S first followed by finding the LCA node of v and the closest node of v and S The definition of closest is given in Section 4.1.2 Based on Property 4.11, we can compute slca({v1}, S2, · · · , Sl ) by first finding the closest point of v1

from each set Si, denoted as closest (v1, Si) ; then finding the slca consists of the single node

lca(v1, closest (v1, S2), · · · , closest(v1, Sl)) The computation of slca({v1}, S2, · · · , Sl ) takes

time O(d l

i=2log |Si|) Then for arbitrary S1, · · · , Sl, we have the following property.

Property 4.12 slca(S1, · · · , Sl ) = removeAncestor(

v1∈S1

slca( {v1}, S2, · · · , Sl ))

Property 4.12 shows that in order to find SLCA nodes of S1, · · · , Sl, we can first find

slca( {v1}, S2, · · · , Sl ) for each v1∈ S1, and then remove all these ancestor nodes Its correctness

follows from the fact that, slca(S1, · · · , Sl ) = removeAncestor(lca(S1, · · · , Sl )) The definition

of removeAncestor is given in Section 4.1.2.

The above two properties directly lead to an algorithm to compute slca(S1, · · · , Sl ) : (1) first compute {xi} = slca({vi}, S2, · · · , Sl ) , for each vi ∈ S1 (1 ≤ i ≤ |S1|); (2)

removeAncest or( {x1, · · · , x|S1 |}) is the answer The time complexity of the algorithm is

O( |S1| l

i=2d log |Si | + |S1|d log |S1|) or O(|S1|ld log |S|) The first step of computing

slca( {vi}, S2, · · · , Sl ) for each vi ∈ S1takes time O(|S1| l

i=2d log |Si|) The second step takes time O(|S1|d log |S1|), which can be implemented by first sorting {x1, · · · , x|S1|} in increasing

Dewey ID order, and then finding the SLCA nodes by a linear scan Note that, this time complexity

is different from Xu and Papakonstantinou [2005], which is O(|S1| l

i=2d log |Si | + |S1|2) Although it has the same time complexity of IndexedLookupEager , the above algorithm is a blocking algorithm, while IndexedLookupEager is non-blocking.

Lemma 4.13 Given any two nodes viand vj, with pre(vi) < pre(vj), and a set S of Dewey IDs:

Trang 5

1 if slca( {vi}, S) ≥ slca({vj }, S), then slca({vj }, S) slca({vi}, S).

2 if slca( {vi}, S) < slca({vj }, S),

• either slca( {vi}, S) is an ancestor of slca({vj }, S),

• or slca( {vi}, S) is not an ancestor of slca({vj }, S), then for any v such that pre(v) >

pre(vj), slca( {vi}, S) ⊀ slca({v}, S).

The correctness of the above lemma directly follows from Property 4.3 and Property 4.11 It

straightforwardly leads to a non-blocking algorithm to compute slca(S1, S2) , by removing ancestor nodes on-the-fly, which is shown as the subroutine getSLCA in IndexedLookupEager The above lemma can be directly applied to multiple sets with the first set as a singleton, i.e by replacing

S by S2, · · · , Sl in the lemma The correctness directly follows Property 4.10, Property 4.9, and Property 4.11.

Property 4.14 slca(S1, · · · , Sl ) = slca(slca(S1, · · · , Sl−1), Sl) for l > 2.

IndexedLookupEager , as shown in Algorithm 32, directly follows from Lemma 4.13 and

Property 4.11, Property 4.12, and Property 4.14 p in Line 3 is the buffer size, it can be any value

ranging from 1 to |S1|; the smaller p is, the faster the algorithm produces the first SLCA It first

computes X2= slca(X1, S2) , where X1 is the next p nodes from S1 (line 3) Then it computes

X3= slca(X2, S3) and so on, until it computes Xl = slca(Xl−1, Sl) (lines 4-5) Note that at any

step, the nodes in Xi are in increasing Dewey ID order, and there is no ancestor-descendant relation-ship between any two nodes in Xi All nodes in Xl except the first and the last one are guaranteed to

be SLCA nodes (line 9) The first node of Xl is checked at line 6 The last node of Xl is carried on to the next iteration (line 9) to be determined whether or not it is a SLCA (line 7) IndexedLooku-pEager outputs all the SLCA nodes, i.e., slca(S1, · · · , Sl ) , in time O(|S1| l

i=2d log |Si|), or

O( |S1|ld log |S|) [Xu and Papakonstantinou, 2005].

Scan Eager: When the keyword frequencies, i.e., |S1|, · · · , |Sl |, do not differ significantly, the to-tal cost of finding matches by lookups using binary search may exceed the toto-tal cost of finding

the matches by scanning the keyword lists, i.e O(|S1|ld log |S|) > O(ld|S|). ScanEager (Algo-rithm 33) [Xu and Papakonstantinou, 2005] modifies Line 15 of IndexedLookupEager by using

linear scan to find the lm() and rm() It takes advantage of the fact that the accesses to any keyword list

are strictly in increasing order in IndexedLookupEager Consider the getSLCA(S1, S2) subrou-tine in IndexedLookupEager , in order to find lm(v, S2) and rm(v, S2) , ScanEager maintains

a cursor for each keyword list, and it advances the cursor of S2until it finds the node that is closest to

v from the left or the right side Note that if rm(v, S2) exists, then it should be the next node in S2

of lm(v, S2) , or the first node in S2if lm(v, S2) = ⊥.The main idea is based on the fact that, for any

viand vj in S1, with pre(vi ) < pre(vj) , lm(vi , S2) ≤ lm(vj , S2) and rm(vi , S2) ≤ rm(vj , S2) , it

assumes that all lm() and rm() are not equal to ⊥ Note that, in order to ensure the correctness of

Tiêu đề	Keyword search in xml databases
Trường học	Standard University
Chuyên ngành	Computer Science
Thể loại	Luận văn
Năm xuất bản	2005
Thành phố	Hanoi

Định dạng
Số trang	5
Dung lượng	126,29 KB