SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASESAlgorithm 7Sparsethe keyword query Q, the top-k value k 1: t opk ← ∅ 2: for all CN s C ranked in decreasing order of scoreC, Q do 3: if
Trang 1P{} P{} P{} P{}
C{}
C{}
W{}
W{}
A{XML} A{Michelle} P{XML} P{Michelle}
P{Michelle}
C{}
P{XML}
A{Michelle}
W{}
A{XML}
A Partial Lattice
Inputs
Figure 2.14: Lattice and Its Inputs from a Stream
v , in L , on Rlist specified for Ri{K} Each deletion may notify some father nodes of v to be moved from Rlist or Wlist to Slist, and v may also be moved from Rlist to Wlist.
We have discussed several effective ranking strategies in Section 2.1 In this section, we discuss how
to answer the top-k keyword queries efficiently A naive approach is to first generate all MTJNT s using the algorithms proposed in Section 2.3.1, and then calculate the score for each MTJNT , and finally output the top-k MTJNT s with the highest scores In DISCOVER-II [ Hristidis et al , 2003a ]
and SPARK [ Luo et al , 2007 ], several algorithms are proposed to get top-k MTJNT s efficiently The aim of all the algorithms is to find a proper order of generating MTJNT s in order to stop early before all MTJNT s are generated.
In DISCOVER-II , three algorithms are proposed to get top-k MTJNT s, namely, theSparse algorithm, theSingle-Pipelinedalgorithm, and theGlobal-Pipelinedalgorithm All algorithms are
based on the attribute level ranking function given in Eq 2.1 Given a keyword query Q, for any tuple t, let the tuple score be score(t, Q) = a ∈tscore(a, Q) where score(a, Q) is the score for attribute a of t as defined in Eq 2.2 The score function in Eq 2.1 has the property of tuple monotonicity, defined as follows For any two MTJNT s T = t11 t21 1 tl and T= t
11 t
2 1
1 t
l generated from the same CN C, if for any 1 ≤ i ≤ l, score(ti, Q) ≤ score(t
i, Q) , then we
have score(T , Q) ≤ score(T, Q) .
For a keyword query Q, given a CN C, let the set of keyword relations that contain at least one keyword in C be C.M = {M1, M2, , Ms} Suppose tuples in each Mi(1 ≤ i ≤ s) are sorted in non-increasing order of their scores Let Mi.tj be the j -th tuple in Mi In each Mi, we use Mi.cur
to denote the current tuple such that the tuples before the position of the tuple are all accessed,
and we use Mi.cur ← Mi.cur + 1 to move Mi.cur to the next position We use eval(t1, t2, , ts) (where ti is a tuple and ti ∈ Mi) to denote the MTJNT s of C by fixing Mi to be ti It can be done
by issuing an sql statement in rdbms We use score(C, Q) to denote the upper bound score for all
Trang 230 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES
Algorithm 7Sparse(the keyword query Q, the top-k value k)
1: t opk ← ∅
2: for all CN s C ranked in decreasing order of score(C, Q) do
3: if score(topk[k], Q) ≥ score(C, Q) then
4: break
5: evaluate C and update topk
6: output topk
MTJNT s in C, defined as follows:
score(C, Q) =
s
i=1
TheSparseAlgorithm: TheSparsealgorithm avoids evaluating unnecessary CN s which can not possible generate results that are ranked top-k The algorithm is shown in Algorithm 7 It first sorts all CN s by their upper bound value score(C, Q), then for each CN , it generates all its MTJNT s and uses them to update topk (line 5) If the upper bound of the next CN is no larger than the k-th largest score score(top[k], Q) in the topk list, it can safely stop and output topk (lines 3-4).
TheSingle-PipelinedAlgorithm: Given a keyword query Q, theSingle-Pipelinedalgorithm first gets
the top-k MTJNT s for each CN , and then combines them together to get the final result Suppose C.M = {M1, M2, , Ms} for a given CN C, and let score(C.M, i) denote the upper bound score for any MTJNT s that include the unseen tuples in Mi We have:
1≤j≤s and j=i
score(Mj.t1, Q) + score(Mi.cur + 1, Q) (2.21)
TheSingle-Pipelinedalgorithm (Algorithm 8) works as follows Initially, all tuples in Mi( 1 ≤ i ≤ s)
are unseen except for the first one, which is used for upper bounding the other unseen tuples (lines
2-4) Then, it iteratively chooses the list Mp that maximizes the upper bound score, and it moves
Mp.cur to the next unseen tuple (lines 6-7) It processes Mp.cur using all the seen tuples in other
lists Mi(i = p) and uses the results to update topk (lines 8-9) If once the maximum possible upper bound score for all unseen tuples max1≤i≤sscore(C.M, i) is already no larger than the k-th largest score in the topk list, it can safely stop and output topk (line 5).
The Global-Pipelined Algorithm: The Single-Pipelined algorithm introduced above considers
each CN individually before combining their top-k results in order to get the final top-k
re-sults TheGlobal-Pipelined algorithm considers all the CN s together It uses similar procedures
as the Single-Pipelined algorithm The only difference is that, there is only one topk list, and each time, it selects a CN Cp such that max1≤i≤sscore(Cp.M, i) is maximized before process-ing lines 6-9 in theSingle-Pipelinedalgorithm Once the upper bound value for all unseen tuples
Trang 3Algorithm 8Single-Pipelined(the keyword query Q, the top-k value k, the CN C)
1: t opk ← ∅
2: let C.M = {M1, M2, , Ms}
3: initialize Mi.cur ← Mi.t1for 1 ≤ i ≤ s
4: update topk using eval(M1.t1, M2.t1, , Ms.t1)
5: while max1≤i≤sscore(C.M, i) > score(t opk [k], Q) do
6: suppose score(C.M, p) = max1≤i≤sscore(C.M, i)
7: Mp.cur ← MP.cur + 1
8: for all t1, t2, , tp−1, tp+1, , ts where tiis seen and ti ∈ Mifor 1 ≤ i ≤ s do
9: update topk using eval(t1, t2, , tp−1, Mp.cur, tp+1, , ts)
10: output topk
max1≤i≤s,C j∈Cscore(Cj.M, i) is no larger than the k-th largest value in the topk list, it can stop and output the global top-k results.
In SPARK [ Luo et al , 2007 ], the authors study the tree level ranking function Eq 2.11.
This ranking function does not satisfy tuple monotonicity As a result, the earlier discussed top-k
algorithms that stop early (e.g., theGlobal-Pipelinedalgorithm) can not be insured to output correct
top-k results In order to handle such non-monotonic score functions, a new monotonic upper bound
function is introduced The intuition behind the upper bound function is that, if the upper bound score is already smaller than the score of a certain result, then all the upper bound scores of unseen tuples will be smaller than the score of this result due to the monotonicity of the upper bound
function The upper bound score uscore(T , Q) is defined as follows:
uscore(T , Q) = uscorea(T , Q) · scoreb(T , Q) · scorec(T , Q) (2.22) where
uscorea(T , Q) = 1
1 − s · min(A(T , Q), B(T , Q)) A(T , Q) = sumidf (T , Q) · (1 + ln(1 + ln(
t ∈T
wantf (t, T , Q)))) B(T , Q) = sumidf (T , Q) ·
t ∈T
watf (t, T , Q)
w ∈T ∩Q
idf (T , w)
wantf (t, T , Q) =
w ∈t∩Qtf (t, w) · idf (T , w) sumidf (T , Q) scoreb(T , Q) and scorec(T , Q) can be determined given the CN of T We have the follow Theorem.
uscore(T , Q) ≥ score(T , Q) where score(T , Q) is defined in Eq 2.11.
Trang 432 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES
Algorithm 9Skyline-Sweeping(the keyword query Q, the top-k value k, the CN C)
1: t opk ← ∅; Q ← ∅
2: Q push(( 1, 1, , 1), uscore(1, 1, , 1))
3: while Q max -uscore > score(topk[k], Q) do
4: c ← Q popmax()
5: update topk using eval(c)
6: for i = 1 to s do
7: c← c
8: c[i] ← c[i] + 1
9: Q push(c, uscore(c))
10: if c[i] > 1 then
11: break
12: output topk
Another problem caused by theGlobal-Pipelinedalgorithm is that when a new tuple Mp.cur
is processed, it tries all the combinations of seen tuples (t1, t2, , tp, tp+1, , ts) to test whether each
combination can be joined with Mp.cur This operation is costly because the number of combinations can be extremely large when the number of seen tuples becomes large.
TheSkyline-SweepingAlgorithm:Skyline-Sweepinghas been proposed in SPARK to handle two
problems: (1) dealing with the non-monotonic score function in Eq 2.11, and (2) significantly
reducing the number of combinations tested Suppose in M1, M2, , Msof CN C, tuples are ranked
in decreasing order of the wantf values For simplicity, we use c = (i1, i2, , is) to denote the
combination of tuples (M1.ti1, M2.ti2, , Ms.ti s) and we use uscore(i1, i2, , is) to denote the
uscore (Eq 2.22) for the MTJNT s that include tuples (M1.ti1, M2.ti2, , Ms.ti s) The Skyline -Sweepingalgorithm is shown in Algorithm 9.
The algorithm processes a single CN C A priority queue Q is used to keep the set of seen
but not tested combinations ordered by uscore Iteratively, a combination c is selected from Q ,
that has the largest uscore (line 4) Every time a combination is selected, it is evaluated to update the topk list Then all of its adjacent combinations are tried in a non-redundant way (lines 6-11),
and each adjacent combination is pushed into Q Lines 10-11 ensure that each combination is enumerated only once If the maximum score for tuples in Q is no larger than the k-th largest score
in the topk list, it can stop and output the topk list as the final result The comparison between the
processed combinations for theSingle-Pipelinedalgorithm and the processed combinations for the Skyline-Sweepingalgorithm is shown in Figure 2.15.
When there are multiple CN s, it can change theSkyline-Sweepingalgorithm using the similar methods introduced in theGlobal-Pipelinedalgorithm, i.e., it can make Q and topk global to maintain the set of combinations in multiple CN s.
Trang 5Processed Area
Figure 2.15: Saving computational cost using the Skyline-Sweeping algorithm
TheBlock-PipelinedAlgorithm: The upper bound score function in Eq 2.22 plays two roles in the
algorithm: (1) the monotonicity of the upper bound score function ensures that the algorithm can
output the correct top-k results when stopping early, (2) It is an estimation of the real score of the
results The tighter the score is, the earlier the algorithm stops The upper bound score function in
Eq 2.22 may sometimes be very loose, which generates many unnecessary combinations to be tested.
In order to decrease such unnecessary combinations, a newBlock-Pipelinedalgorithm is proposed in
SPARK A new upper bound score function bscore is introduced, which is tighter than the uscore
function in Eq 2.22, but it is not monotonic The aim of theBlock-Pipelinedalgorithm is to use
both the uscore and the bscore functions such that (1) the uscore function can make sure that the topk results are correctly output, and (2) the bscore function can decrease the gap between the estimated value and the real value of results, and thus reduce the computational cost The bscore is
defined as follows:
bscore(T , Q) = bscorea(T , Q) · scoreb(T , Q) · scorec(T , Q) (2.23) where
bscorea(T , Q) =
w ∈T ∩Q
1 + ln(1 + ln(tf (T , w)))
TheBlock-Pipelinedalgorithm is shown in Algorithm 10; it is similar to theSkyline-Sweeping
algorithm The difference is that it assigns each combination c enumerated a status; for the first time
it is enumerated, it calculates its uscore, sets its status to be U SCORE and inserts it into the queue
Q (lines 9-14) Otherwise, if it is already assigned a U SCORE status, it calculates its bscore, sets its status to be BSCORE and reinserts it into the queue Q again (lines 6-8) before enumerating its
neighbors (lines 9-14) If its status is already set to be BSCORE, it evaluates it and updates the topk
list (line 16) TheBlock-Pipelinedalgorithm deals with a single CN case When there are multiple
CN s, it can use the same methods as handling multiple CN s in theSkyline-Sweepingalgorithm.