Keyword Search in Databases- P20 pdf

In this example,MultiwaySLCAwill first consider the first node in each keyword list and select the one with the largest Dewey ID as the anchor node.. The next anchor node is selected in

Trang 1

Algorithm 32IndexedLookupEager(S1, · · · , S l)

Input: l lists of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i

Output: All theSLCAs

1: v← ⊥

2: while there are more nodes in S1 do

3: Read p nodes of S1into buffer B

4: for i ← 2 to l do

5: B ← getSLCA(B, S i )

6: removeFirstNode(B), if v = ⊥ and getFirstNode(B) v

7: output v as aSLCA, if v = ⊥, B = ∅ and v ⊀ getFirstNode(B)

8: if B= ∅ then

9: v ← removeLastNode(B)

10: output B asSLCAnodes

11: output v as aSLCA

12: Procedure getSLCA(S1, S2)

13: Result ← ∅; u ← root with Dewey ID 0

14: for each node v ∈ S1in increasing Dewey ID order do

15: x ← lca(v, closest(v, S2))

16: if pre(u) < pre(x) then

17: Result ← Result ∪ {u}, if u ⊀ x

18: u ← x

19: return Result ∪ {u}

ScanEager, p at Line (3) must be no smaller than |S1|, i.e., it must first compute slca(S1, S2), then

slca(slca(S1, S2), S3)and continue.ScanEagerdirectly follows from Property 4.10, Property 4.9, Property 4.11

ScanEager outputs all the SLCA nodes, i.e., slca(S1, · · · , S l ) , in time O(ld|S1| +

dl

i=2|S i |), or O(ld|S|) [Xu and Papakonstantinou,2005] AlthoughScanEagerhas the same time complexity asStackAlgorithm, it has two advantages First,ScanEagerstarts from the smallest keyword list, and it does not have to scan to the end of every keyword list and may terminate

much earlier Second, the number of lca operations ofScanEageris O(l|S1|), which is usually

much less than that of theStackAlgorithmthat has O(l

i=1|S i |) lca operations.

Multiway SLCA:MultiwaySLCA[Sun et al.,2007] further improves the performance of In-dexedLookupEager, but with the same worst case time complexity The Motivation and general idea ofMultiwaySLCAare shown by the following example

Trang 2

4.2. SLCA-BASED SEMANTICS 95 Algorithm 33ScanEager(S1, · · · , S l)

1: u ← root with Dewey ID 0

2: for each node v1∈ S1in increasing Dewey ID order do

3: moving cursors in each list S i to closest (v1, S i ), for 1≤ i ≤ l

4: v ← lca(v1, closest (v1, S2), · · · , closest(v1, Sl ))

5: if pre(u) < pre(v) then

6: if u ⊀ v then

7: output u as aSLCA

8: u ← v

r1

a1 a100 b1 a101 a200 b2 a901 a1000 b10

b1001

b11

x10

Example 4.15 Consider a keyword query Q = {a, b} on the XML tree shown in Figure 4.4.

S a = {a1, · · · , a1000} and S b = {b1, · · · , b1001}, slca(S a , S b ) = {x1, · · · , x10} Since |S a | < |S b|, IndexedLookupEagerwill enumerate each of the “a” nodes in S a in increasing Dewey ID order

to compute a potentialSLCA This results in a total number of 1000 slca computations to produce

a result of size 10 Lots of redundant computations have been conducted, e.g., theSLCAof a iand

S b gives the same result of x1for 1≤ i ≤ 100.

Conceptually, each potentialSLCAcomputed byIndexedLookupEagercan be thought of

as being driven by some nodes from S a (or S1in general) But,MultiwaySLCApicks an “anchor”

node among the l keyword lists to drive the multiwaySLCAcomputation at each individual step In this example,MultiwaySLCAwill first consider the first node in each keyword list and select the

one with the largest Dewey ID as the anchor node Thus, between a1∈ S a and b1∈ S b, it chooses

b1as the anchor node Next, using b1as an anchor, it will select the closest node from each other

keyword list, i.e., a100 ∈ S b , and will compute the lca of those chosen nodes, i.e., lca(a100, b1) = x1

The next anchor node is selected in the same way by removing all those nodes with Dewey ID

Trang 3

smaller than pre(b1) from each keyword list Then b2is selected, and slca(b2, S a ) = x2 Clearly, MultiwaySLCAis able to skip many unnecessary computations

Definition 4.16 Anchor Node Given l lists S1, · · · , S l , a sequence of nodes, L = v1, · · · , v l

where v i ∈ S i , is said to be anchored by a node v a ∈ L, if for each v i ∈ L, v i = closest(v a , S i ) We

refer to v a as the anchor node of L.

Lemma 4.17 If lca(L) is aSLCAand v ∈ L, then lca(L) = lca(L), where Lis the set of nodes

anchored by v in each Si.

Thus, it only needs to consider anchored sets, where a set is called anchored if it is anchored

by some nodes, for computing potentialSLCAs In fact, from the definition of Compact LCA and its equivalence toELCA, if a node u is aSLCA, then there must exist a set{v1, · · · , v l}, where

v i ∈ S i for 1≤ i ≤ l, such that u = lca(v1, · · · , v l ) and every v i is an anchor node

Lemma 4.18 Consider two matches L = v1, · · · , v l and L= u1, · · · , u l , where L < L, i.e.,

v i ≤ u i for 1 ≤ i ≤ l, and L is anchored by some node v i If Lcontains some node uj with pre(uj )≤

pre(v i ), then lca(L) is either equal to lca(L) or an ancestor of lca(L).

Lemma 4.18 provides a useful property to find the next anchor node Specifically, if we have

considered a match L that is anchored by a node v a , then we can skip all the nodes v ≤ v a

Lemma 4.19 Let L and L be two matches If L contains two nodes, where one is a descendant of lca(L), while the other is not, then lca(L) lca(L).

Lemma 4.19 provides another useful property to optimize the next anchor node Specifically,

if we have considered a match L and lca(L) is guaranteed to be aSLCA, then we can skip all the

nodes that are descendants of lca(L).

Lemma 4.20 Let L be a list of nodes, then lca(L) = lca(f irst(L), last(L)).

Note that, if the nodes in L is not in order, then f irst (L) and last (L) will take time O(ld), while directly using the definition also takes time O(ld), i.e., lca(v1, · · · , v l )=

lca(lca(v1, · · · , v l−1), v l ) , where l is the number of nodes in L.

Two algorithms, namely, Basic Multiway-SLCA (BMS) and Incremental Multiway-SLCA (IMS), are proposed in [Sun et al.,2007] to compute all the SLCAnodes The BMS algorithm implements the general idea above IMS introduces one further optimization aimed to reduce the

lca computation of BMS However, lca takes the same time as comparing two Dewey IDs, and

BMS needs to retrieve nodes in order from an unordered set, and this will incur extra time So in the

Trang 4

4.2. SLCA-BASED SEMANTICS 97 Algorithm 34MultiwaySLCA(S1, · · · , S l)

1: v m ← last({f irst(S i ) | 1 ≤ i ≤ l}), where the index m is also recorded

2: u ← root with Dewey ID 0

3: while v m= ⊥ do

4: if m= 1 then

5: v1← closest(v m , S1)

6: v m ← v1, if v m < v1

7: v i ← closest(v m , S i ), for each 1≤ i ≤ l, i = m

8: x ← lca(f irst(v1, · · · , v l ), last (v1, · · · , v l ))

9: if u ≤ x then

10: output u as aSLCA, if u x

11: u ← x

12: v m ← last({rm(v m , S i ) | 1 ≤ i ≤ l, v i ≤ v m })

13: if v m = ⊥ and u v m then

14: v m ← last({v m } ∪ {out(u, S i ) | 1 ≤ i ≤ l, i = m})

following, we will show BMS algorithm, denoted asMultiwaySLCA, and only show the further optimization of IMS

MultiwaySLCAis shown in Algorithm 34 It computes theSLCAsiteratively At each

iteration, an anchor node v m is selected to compute the match anchored by v mand its LCA, where

index m is also stored, and v m is initialized at Line 1 Let u denote the potentialSLCAnode that is

most recently computed, and it is initialized to be the root node with Dewey ID 0 (line 2) When v m

is not⊥, more potentialSLCAscan be found (lines 3-13) Lines 4-6 further optimize the anchor

node to be a node with large Dewey ID if one exists After an anchor node v mis chosen, Line 7 finds

the match anchored by v m , and Line 8 computes the LCA x of this match If x u (line 9), then

x is ignored Line 10 outputs u as aSLCAif it is not an ancestor-or-self of x u is updated to be

the recently computed potentialSLCA Lines 12-14 select the next anchor node by choosing the furthest possible node that maximized the number of skipped nodes, where line 12 corresponds to Lemma 4.18, and lines 13-14 corresponds to Lemma 4.19

Theorem 4.21 Let u and x be the two variables inMultiwaySLCA If u ≥ x then x u Otherwise

either u ≺ x or u <⊀ x.1If u < ⊀ x, then u is guaranteed to be aSLCA.

1u < ⊀ x means that u < x but u ⊀ x.

Trang 5

IMS [Sun et al.,2007] further optimizes lines 7-8 Let L denote the match anchored by v m,

i.e., L = v1, · · · , v l Note that each call of closest requires two LCA computations IMS reduces the number of LCA computation by enumerating all the possible L’s whose LCA can be potential

SLCA, it can be at most l possible choices By the definition of match L anchored by v m, it must satisfy the following three conditions:

• L ⊆ {v m } ∪ P ∪ N, where P = {lm(v m , S i ) | i ∈ [1, l], i = m, lm(v m , S i ) = ⊥} and N = {rm(v m , S i ) | i ∈ [1, l], i = m, rm(v m , S i )= ⊥}

• L ∩ S i = ∅, ∀i ∈ [1, l]

• v m ∈ L

Without loss of generality, we assume that all lm(v m , S i ) and rm(v m , S i ) are not equal to ⊥,

P = u1, · · · , u l−1, where pre(u i ) ≤ pre(u i+1) ∀i ∈ [1, l − 2], N = u

1, · · · , u

l−1 is the list

corresponding to P , and v m ∈ S l Then all the possible L’s whose LCA can be potentialSLCAis of the formu i , · · · , u l−1, v m , u

1, · · · , u

i−1, denoted as L i This is because that, if f irst (L) = u i,

then u1, · · · , u

i−1must be in L, and L i is the one with smallest last (L) among all those matches with f irst (L) = u i, then result in the largest LCA Note that all the LCAs are on the path

from root to v m , as v m must be in L Then we can enumerate L in the order L1, · · · , L l, where

f irst (L i ) ≤ f irst(L i+1) and last (L i ) ≤ last(L i+1) Therefore, if lca(L i ) lca(L i+1) ∀i < j, and lca(L j ) lca(L j+1) , then L j is the match anchored by v m Note that, the above discussion

are based on the fact that the nodes in P are in increasing Dewey ID order, but usually this is not the case, so we have to sort P first.

BMS (MultiwaySLCA) and IMS correctly output all the SLCA nodes, i.e

slca(S1, · · · , S l ) , in time O(|S1|ld log |S|) [Sun et al.,2007]

4.3 IDENTIFY MEANINGFUL RETURN INFORMATION

The algorithms shown in the previous section study the efficiency aspect of keyword search.They can find and output all theSLCAnodes (or the whole subtree rooted atSLCAnodes) efficiently But they do not consider the user’s intention for a keyword query The information returned is either too few (onlySLCAsare returned) or too large (the whole subtree rooted at eachSLCAis returned) Two approaches have been proposed to identify meaningful return information for a keyword query One alternative is representing the whole subtree rooted at aSLCAnode compactly and presenting

it to users, so that it will not overwhelm users [Liu and Chen,2007] Another alternative is returning

only those subtrees that satisfy two novel properties, which captures desirable changes to a query result upon a change to the query or data in a general framework [Liu and Chen,2008b] Both works are based on the following definition of query result

Definition 4.22 Keyword Query Results Processing keyword query Q on XML tree T returns

a set of query results, denoted asR (T , Q), where each query result is a subtree (defined by a pair

Định dạng
Số trang	5
Dung lượng	128,59 KB