Keyword Search in Databases- P7 ppt

SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASESAlgorithm 7Sparsethe keyword query Q, the top-k value k 1: t opk ← ∅ 2: for all CN s C ranked in decreasing order of scoreC, Q do 3: if

Trang 1

P{} P{} P{} P{}

C{}

W{}

A{XML} A{Michelle} P{XML} P{Michelle}

P{Michelle}

C{}

P{XML}

A{Michelle}

W{}

A{XML}

A Partial Lattice

Inputs

Figure 2.14: Lattice and Its Inputs from a Stream

v , in L , on Rlist specified for Ri{K} Each deletion may notify some father nodes of v to be moved from Rlist or Wlist to Slist, and v may also be moved from Rlist to Wlist.

We have discussed several effective ranking strategies in Section 2.1 In this section, we discuss how

to answer the top-k keyword queries efficiently A naive approach is to first generate all MTJNT s using the algorithms proposed in Section 2.3.1, and then calculate the score for each MTJNT , and finally output the top-k MTJNT s with the highest scores In DISCOVER-II [ Hristidis et al , 2003a ]

and SPARK [ Luo et al , 2007 ], several algorithms are proposed to get top-k MTJNT s efficiently The aim of all the algorithms is to find a proper order of generating MTJNT s in order to stop early before all MTJNT s are generated.

In DISCOVER-II , three algorithms are proposed to get top-k MTJNT s, namely, theSparse algorithm, theSingle-Pipelinedalgorithm, and theGlobal-Pipelinedalgorithm All algorithms are

based on the attribute level ranking function given in Eq 2.1 Given a keyword query Q, for any tuple t, let the tuple score be score(t, Q) = a ∈tscore(a, Q) where score(a, Q) is the score for attribute a of t as defined in Eq 2.2 The score function in Eq 2.1 has the property of tuple monotonicity, defined as follows For any two MTJNT s T = t11 t21 1 tl and T= t

11 t

2 1

1 t

l generated from the same CN C, if for any 1 ≤ i ≤ l, score(ti, Q) ≤ score(t

i, Q) , then we

have score(T , Q) ≤ score(T, Q) .

For a keyword query Q, given a CN C, let the set of keyword relations that contain at least one keyword in C be C.M = {M1, M2, , Ms} Suppose tuples in each Mi(1 ≤ i ≤ s) are sorted in non-increasing order of their scores Let Mi.tj be the j -th tuple in Mi In each Mi, we use Mi.cur

to denote the current tuple such that the tuples before the position of the tuple are all accessed,

and we use Mi.cur ← Mi.cur + 1 to move Mi.cur to the next position We use eval(t1, t2, , ts) (where ti is a tuple and ti ∈ Mi) to denote the MTJNT s of C by fixing Mi to be ti It can be done

by issuing an sql statement in rdbms We use score(C, Q) to denote the upper bound score for all

Trang 2

30 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES

Algorithm 7Sparse(the keyword query Q, the top-k value k)

1: t opk ← ∅

2: for all CN s C ranked in decreasing order of score(C, Q) do

3: if score(topk[k], Q) ≥ score(C, Q) then

4: break

5: evaluate C and update topk

6: output topk

MTJNT s in C, defined as follows:

score(C, Q) =

s

i=1

TheSparseAlgorithm: TheSparsealgorithm avoids evaluating unnecessary CN s which can not possible generate results that are ranked top-k The algorithm is shown in Algorithm 7 It first sorts all CN s by their upper bound value score(C, Q), then for each CN , it generates all its MTJNT s and uses them to update topk (line 5) If the upper bound of the next CN is no larger than the k-th largest score score(top[k], Q) in the topk list, it can safely stop and output topk (lines 3-4).

TheSingle-PipelinedAlgorithm: Given a keyword query Q, theSingle-Pipelinedalgorithm first gets

the top-k MTJNT s for each CN , and then combines them together to get the final result Suppose C.M = {M1, M2, , Ms} for a given CN C, and let score(C.M, i) denote the upper bound score for any MTJNT s that include the unseen tuples in Mi We have:

1≤j≤s and j=i

score(Mj.t1, Q) + score(Mi.cur + 1, Q) (2.21)

TheSingle-Pipelinedalgorithm (Algorithm 8) works as follows Initially, all tuples in Mi( 1 ≤ i ≤ s)

are unseen except for the first one, which is used for upper bounding the other unseen tuples (lines

2-4) Then, it iteratively chooses the list Mp that maximizes the upper bound score, and it moves

Mp.cur to the next unseen tuple (lines 6-7) It processes Mp.cur using all the seen tuples in other

lists Mi(i = p) and uses the results to update topk (lines 8-9) If once the maximum possible upper bound score for all unseen tuples max1≤i≤sscore(C.M, i) is already no larger than the k-th largest score in the topk list, it can safely stop and output topk (line 5).

The Global-Pipelined Algorithm: The Single-Pipelined algorithm introduced above considers

each CN individually before combining their top-k results in order to get the final top-k

re-sults TheGlobal-Pipelined algorithm considers all the CN s together It uses similar procedures

as the Single-Pipelined algorithm The only difference is that, there is only one topk list, and each time, it selects a CN Cp such that max1≤i≤sscore(Cp.M, i) is maximized before process-ing lines 6-9 in theSingle-Pipelinedalgorithm Once the upper bound value for all unseen tuples

Trang 3

Algorithm 8Single-Pipelined(the keyword query Q, the top-k value k, the CN C)

1: t opk ← ∅

2: let C.M = {M1, M2, , Ms}

3: initialize Mi.cur ← Mi.t1for 1 ≤ i ≤ s

4: update topk using eval(M1.t1, M2.t1, , Ms.t1)

5: while max1≤i≤sscore(C.M, i) > score(t opk [k], Q) do

6: suppose score(C.M, p) = max1≤i≤sscore(C.M, i)

7: Mp.cur ← MP.cur + 1

8: for all t1, t2, , tp−1, tp+1, , ts where tiis seen and ti ∈ Mifor 1 ≤ i ≤ s do

9: update topk using eval(t1, t2, , tp−1, Mp.cur, tp+1, , ts)

10: output topk

max1≤i≤s,C j∈Cscore(Cj.M, i) is no larger than the k-th largest value in the topk list, it can stop and output the global top-k results.

In SPARK [ Luo et al , 2007 ], the authors study the tree level ranking function Eq 2.11.

This ranking function does not satisfy tuple monotonicity As a result, the earlier discussed top-k

algorithms that stop early (e.g., theGlobal-Pipelinedalgorithm) can not be insured to output correct

top-k results In order to handle such non-monotonic score functions, a new monotonic upper bound

function is introduced The intuition behind the upper bound function is that, if the upper bound score is already smaller than the score of a certain result, then all the upper bound scores of unseen tuples will be smaller than the score of this result due to the monotonicity of the upper bound

function The upper bound score uscore(T , Q) is defined as follows:

uscore(T , Q) = uscorea(T , Q) · scoreb(T , Q) · scorec(T , Q) (2.22) where

uscorea(T , Q) = 1

1 − s · min(A(T , Q), B(T , Q)) A(T , Q) = sumidf (T , Q) · (1 + ln(1 + ln(

t ∈T

wantf (t, T , Q)))) B(T , Q) = sumidf (T , Q) ·

t ∈T

watf (t, T , Q)

w ∈T ∩Q

idf (T , w)

wantf (t, T , Q) =

w ∈t∩Qtf (t, w) · idf (T , w) sumidf (T , Q) scoreb(T , Q) and scorec(T , Q) can be determined given the CN of T We have the follow Theorem.

uscore(T , Q) ≥ score(T , Q) where score(T , Q) is defined in Eq 2.11.

Trang 4

32 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES

Algorithm 9Skyline-Sweeping(the keyword query Q, the top-k value k, the CN C)

1: t opk ← ∅; Q ← ∅

2: Q push(( 1, 1, , 1), uscore(1, 1, , 1))

3: while Q max -uscore > score(topk[k], Q) do

4: c ← Q popmax()

5: update topk using eval(c)

6: for i = 1 to s do

7: c← c

8: c[i] ← c[i] + 1

9: Q push(c, uscore(c))

10: if c[i] > 1 then

11: break

12: output topk

Another problem caused by theGlobal-Pipelinedalgorithm is that when a new tuple Mp.cur

is processed, it tries all the combinations of seen tuples (t1, t2, , tp, tp+1, , ts) to test whether each

combination can be joined with Mp.cur This operation is costly because the number of combinations can be extremely large when the number of seen tuples becomes large.

TheSkyline-SweepingAlgorithm:Skyline-Sweepinghas been proposed in SPARK to handle two

problems: (1) dealing with the non-monotonic score function in Eq 2.11, and (2) significantly

reducing the number of combinations tested Suppose in M1, M2, , Msof CN C, tuples are ranked

in decreasing order of the wantf values For simplicity, we use c = (i1, i2, , is) to denote the

combination of tuples (M1.ti1, M2.ti2, , Ms.ti s) and we use uscore(i1, i2, , is) to denote the

uscore (Eq 2.22) for the MTJNT s that include tuples (M1.ti1, M2.ti2, , Ms.ti s) The Skyline -Sweepingalgorithm is shown in Algorithm 9.

The algorithm processes a single CN C A priority queue Q is used to keep the set of seen

but not tested combinations ordered by uscore Iteratively, a combination c is selected from Q ,

that has the largest uscore (line 4) Every time a combination is selected, it is evaluated to update the topk list Then all of its adjacent combinations are tried in a non-redundant way (lines 6-11),

and each adjacent combination is pushed into Q Lines 10-11 ensure that each combination is enumerated only once If the maximum score for tuples in Q is no larger than the k-th largest score

in the topk list, it can stop and output the topk list as the final result The comparison between the

processed combinations for theSingle-Pipelinedalgorithm and the processed combinations for the Skyline-Sweepingalgorithm is shown in Figure 2.15.

When there are multiple CN s, it can change theSkyline-Sweepingalgorithm using the similar methods introduced in theGlobal-Pipelinedalgorithm, i.e., it can make Q and topk global to maintain the set of combinations in multiple CN s.

Trang 5

Processed Area

Figure 2.15: Saving computational cost using the Skyline-Sweeping algorithm

TheBlock-PipelinedAlgorithm: The upper bound score function in Eq 2.22 plays two roles in the

algorithm: (1) the monotonicity of the upper bound score function ensures that the algorithm can

output the correct top-k results when stopping early, (2) It is an estimation of the real score of the

results The tighter the score is, the earlier the algorithm stops The upper bound score function in

Eq 2.22 may sometimes be very loose, which generates many unnecessary combinations to be tested.

In order to decrease such unnecessary combinations, a newBlock-Pipelinedalgorithm is proposed in

SPARK A new upper bound score function bscore is introduced, which is tighter than the uscore

function in Eq 2.22, but it is not monotonic The aim of theBlock-Pipelinedalgorithm is to use

both the uscore and the bscore functions such that (1) the uscore function can make sure that the topk results are correctly output, and (2) the bscore function can decrease the gap between the estimated value and the real value of results, and thus reduce the computational cost The bscore is

defined as follows:

bscore(T , Q) = bscorea(T , Q) · scoreb(T , Q) · scorec(T , Q) (2.23) where

bscorea(T , Q) =

w ∈T ∩Q

1 + ln(1 + ln(tf (T , w)))

TheBlock-Pipelinedalgorithm is shown in Algorithm 10; it is similar to theSkyline-Sweeping

algorithm The difference is that it assigns each combination c enumerated a status; for the first time

it is enumerated, it calculates its uscore, sets its status to be U SCORE and inserts it into the queue

Q (lines 9-14) Otherwise, if it is already assigned a U SCORE status, it calculates its bscore, sets its status to be BSCORE and reinserts it into the queue Q again (lines 6-8) before enumerating its

neighbors (lines 9-14) If its status is already set to be BSCORE, it evaluates it and updates the topk

list (line 16) TheBlock-Pipelinedalgorithm deals with a single CN case When there are multiple

CN s, it can use the same methods as handling multiple CN s in theSkyline-Sweepingalgorithm.

Định dạng
Số trang	5
Dung lượng	143,68 KB