Keyword Search in Databases- P6 docx

In KRDBMS [Qin et al.,2009a], the authors observe that evaluating all CN s using only joins may always generate a large number of temporal tuples.. Figure 2.13 shows the number of tempor

Trang 1

24 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES

W{}

C{}

W{}

A{XML} A{Michelle} P{XML} P{Michelle}

(a) Structure

L−Node Relation L−Edge Relation Vid Rname KSet

2

10 A

1

Fid Cid Attr

AID 1 3

l

(b) Storage

Figure 2.11: L-Lattice

P

(a) Semijoin

(b) Join

Figure 2.12: Join vs Semijoin/Join

CN C i, we attempt to find the largest subtrees inL that C ican share with using the index, and we link to the roots of such subtrees Figure 2.11(a) illustrates a partial lattice The entire lattice,L, is maintained in two relations: L-Node relation and L-Edge relation (Figure 2.11(b)) Let a bit-string represent a set of keywords,{k1, k2, · · · , k l} The L-Node relation maintains, for any node in L,

a unique Vid inL, the corresponding relation name (Rname) that appears in the given database

schema, G S, a bit-string (KSet) that indicates the keywords associated with the node inL, and the size of the bit-string (l) The L-Edge relation maintains the parent/child relations among all the nodes inLwith its parent Vid and child Vid (Fid/Cid) plus its join attribute, Attr, (either primary key or foreign key) The two relations can be maintained in memory or on disk Several indexes are built on the relations to quickly search for given nodes inL

There are three main differences between the two execute graphs: the Mesh and the L-Lattice (1) The maximum depth of a Mesh is Tmax− 1 and the maximum depth of an L-Lattice is

Tmax/2+ 1 (2) In a mesh, only the left part of two CN s can be shared (except for the leaf nodes),

while in anL -Lattice multiple parts of two CN s can be shared (3) The number of leaf nodes in a

Trang 2

mesh is O((|V (G S )| · 2l )2) because there are O(|V (G S )| · 2l )clusters in a mesh and each cluster

may contain O(|V (G S )| · 2l )leaf nodes The number of leaf nodes in anL -Lattice is O(2 l )

After sharing computational cost using either the Mesh or theL -Lattice, all CN s are evaluated using joins in DISCOVER or S-KWS A join plan is shown in Figure 2.9(b) to process the CN in Figure 2.9(a) using 5 projects and 4 joins The resulting relation, the output of the join (j4), is a temporal relation with 5 TIDs from the 5 projected relations, where a resulting tuple represents an

MTJNT The rightmost two connected trees in Figure 2.3 are the two results of the operator tree

Figure 2.9(b), (p2, c5, p4, w5, a3) and (p3, c4, p4, w5, a3)

In KRDBMS [Qin et al.,2009a], the authors observe that evaluating all CN s using only joins

may always generate a large number of temporal tuples They propose to use semijoin/join sequences

to compute a CN A semijoin between R and S is defined in Eq 2.18, which is to project () the tuples from R that can possibly join at least a tuple in S.

R S = R (R 1 S) (2.18)

Based on semijoin, a join R 1 Scan be supported by a semijoin and a join as given in Eq 2.19

R 1 S = (R S) 1 S (2.19) Recall that semijoin/joins were proposed to join relations in a distributedrdbms, in order to reduce high communication cost at the expense of I/O cost and CPU cost But, there is no communication

in a centralizedrdbms In other words, there is no obvious reason to use (R S) 1 S to process a

single join R 1 S since the former needs to access the same relation S twice Below, we address the

significant cost saving of semijoin/joins over joins when the number of joins is large, in a centralized rdbms

When evaluating all CN s, the temporal tuples generated can be very large, and the majority

of the generated temporal tuples do not appear in any MTJNT s When evaluating all CN s us-ing the semijoin/join based strategy, computus-ing R 1 (S 1 T ) is done as S← ST , R← RS,

with semijoins, in the reduction phase, followed by T 1 (S1 R) in the join phase For the

CN given in Figure 2.9(a), in the reduction phase (Figure 2.12(a)), C← C{}P {XML}, W←

W {}A{Michelle}, P← P {}C, and P← PW, and in the join phase (Figure 2.12(b)),

P1 C is joined first because P is fully reduced, such that every tuple in P must appear at an

MTJNT The join order is shown in Figure 2.12(b).

Figure 2.13 shows the number of temporal tuples generated using a real database DBLP on IBM DB2.The five 3-keyword queries with different keyword selectivity (the probability that a tuple contains a keyword in DBLP) were randomly selected withTmax= 5 The number of generated temporal tuples are shown in Figure 2.13(a) The number of tuples generated by the semijoin/join approach is significantly less than that by the join approach In a similar fashion, the number of temporal tuples generated by the semijoin/join approach is significantly less than that generated by the join approach whenTmaxincreases (Figure 2.13(b)) for a 3-keyword query

When processing a large number of joins for keyword search onrdbmss, it is the best practice

to process a large number of small joins in order to avoid intermediate join results becoming very

Trang 3

10K 100K 1M 10M

4E-4 8E-4 1.2E-31.6E-3 2E-3

Join SemiJoin-Join

(a) Vary Keyword Selectivity

10K 100K 1M 10M

Join SemiJoin-Join

(b) Vary l

Figure 2.13: # of Temporal Tuples (Default Tmax= 5, l = 3)

large and dominative if it is difficult to find an optimal query processing plan or the cost of finding

an optimal query processing plan is high

Besides evaluating all CN s in a static environment, S-KWS and KDynamic focus on monitor-ing all MTJNT s in a relational data stream where tuples can be inserted/deleted frequently In this situation, it is necessary to find new MTJNT s or expire MTJNT s in order to monitor events that are implicitly interrelated over an open-ended relational data stream for a user-given l-keyword query More precisely, it reports new MTJNT s when new tuples are inserted, and, in addition, reports the

MTJNT s that become invalid when tuples are deleted A sliding window (time interval), W , is

spec-ified A tuple, t, has a lifespan from its insertion into the window at time t.start to W + t.start − 1,

if t is not deleted before then Two tuples can be joined if their lifespans overlap.

S-KWS processes a keyword query in a relational data stream using the mesh as introduced

above The authors observe that in a data stream environment some joins need to be processed when there are incoming new tuples from its inputs but not all joins need to be processed all the time, and, therefore, they propose a demand-driven operator execution A join operator has two inputs and is associated with an output buffer The output buffer of a join operator becomes input to many other join operators that share the join operator (as indicated in the mesh) A tuple that is newly output

by a join operator in its output buffer will be a new arrival input to those joins that share the join operator A join operator will be in a running state if it has newly arrived tuples from both inputs

A join operator will be in a sleeping state if either it has no new arriving tuples from the inputs

or all the join operators that share it are currently sleeping The demand-driven operator execution noticeably reduces the query processing cost

KDynamic processes a keyword query in a relational data stream using the L-Lattice Although

S-KWS can significantly reduce the computational cost, the scalability issues is also a problem

es-pecially when Tmax,|G S |, l, W or the stream speed is high This is because a large number of

intermediate tuples that are computed by many join operators in the mesh with high processing cost

will eventually not be output S-KWS cannot avoid computing such a large number of unnecessary intermediate tuples because it is unknown whether an intermediate tuple will appear in an MTJNT

Trang 4

beforehand The probability of generating a large number of unnecessary intermediate results

in-creases when either the size of sliding window, W , is large, or new data arrive at high speed It is

challenging to reduce the processing cost by reducing the number of intermediate results

In KDynamic, an algorithmCNEvalDynamic is proposed, which works as follows We can maintain|V (G S ) | relations in total to process an l-keyword query Q = {k1, k2, · · · , k l}, due to the

lattice structure that is used A node, v, in lattice Lis uniquely identified with a node id The node

v represents a sub-relation R i {K} By utilizing the unique node id, it is easy to maintain all the 2l sub-relations for a relation R itogether Let us denote such a relation as Ri The schema of Riis the

same as R i plus an additional attribute (Vid) to keep the node id inL When we need to obtain

a sub-relation R i {K} for K⊆ Q associated with a node, v, in the lattice, we use the node id to select and project R i {K} from Ri Therefore, a relation R i {K} can be possibly virtually maintained

Below, we use Ri {K} to denote such a sub-relation It is fast to obtain Ri {K} if an index is built

on the additional attribute Vid on relation Ri

CNEvalDynamicis outlined in Algorithm 6 When a new update operator, op(t, R i ), arrives,

it processes it in lines 3-9 if the operation is an insertion or in lines 11-14 if it is a deletion The procedure EvalPathjoins all the needed tuples in a top-down fashion.EvalPath is implemented

similar to the semijoin-join based static evaluation as discussed above using an additional path,

which records where the join sequence comes from to reduce join cost The two procedures, namely

insertanddelete, maintain a list of tuples for each node in the lattice using only selections (lines

17-18, lines 26-27, and lines 34-35) The selected tuples can join at least one tuple from each list of its child nodes in the lattice If the list of one node in the lattice is changed, it will trigger the father nodes to change their lists accordingly (lines 24-27 and lines 32-35) If the root node is changed,

this means the results should be updated At this time, we use joins to report the updated MTJNT s.

When we join, all the tuples that participate in joins will contribute to the results In this way, we can achieve full reduction when joining

As the number of results itself can be exponentially large, we analyze the extra cost for the

algorithms to evaluate all CN s The extra cost is defined to be the number of tuples generated by the

algorithm minus the number of tuples in the result Suppose the number of tuples in every relation

is n Given a CN with size t, the extra cost for the algorithm using the left deep tree proposed in

S-KWS to evaluate the CN is O(n t−1), and the extra cost for theCNEvalDynamic algorithm to

evaluate the CN is O(n · t).

Finally, we discuss how to implement the event-driven evaluation As shown in Figure 2.14,

there are multiple nodes labeled with identical R i {K} For example, W{} appears in two different nodes in the lattice For each R i {K}, we maintain 3 lists named Rlist (Ready), Wlist (Wait) and Slist (Suspend) The three lists together contain all the node ids in the lattice A node in the latticeL labeled R i {K} can only appear in one of the three lists for R i {K} A node v in L appears in Wlist, if the sub-relations represented by all child nodes of v in Lare non-empty, but

the sub-relation represented by v is empty A node v in Lappears in Rlist, if the sub-relations

represented by all child nodes of v in L are non-empty, and the sub-relation represented by v itself

Trang 5

Algorithm 6CNEvalDynamic( L , Q, )

2: let Kbe the set of all keywords appearing in tuple t

16: let the label of node v be R i {K}

17: if t ∈ Ri {K} and t can join at least one tuple in every relation represented by all v’s children in L then

18: insert tuple t into the sub-relation R i {K }

19: if t ∈ Ri {K} then

20: push (v, t) to path

23: else

25: let the label of node u be R j {K"}

K"(r(R j )) that can join t do

30: let the label of node v be R i {K }

31: delete tuple t from the sub-relation R i {K}

33: let the label of node u be R j {K"}

j {K"} that can join t only do

35: delete(u, t)

is non-empty too Otherwise, v appears in Slist When a new tuple t of relation R iwith keyword

set K is inserted, we only insert it into all relations in the nodes v, in L, on Rlist and Wlist

specified for R i {K} Each insertion may notify some father nodes of v to move from Wlist or

Slistto Rlist Node v may also be moved from Wlist to Rlist When a tuple t of relation R i with keyword set Kis about to be deleted, we only remove it from all relations associated with node

Định dạng
Số trang	5
Dung lượng	172,6 KB