In KRDBMS [Qin et al.,2009a], the authors observe that evaluating all CN s using only joins may always generate a large number of temporal tuples.. Figure 2.13 shows the number of tempor
Trang 124 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES
W{}
C{}
C{}
W{}
A{XML} A{Michelle} P{XML} P{Michelle}
(a) Structure
L−Node Relation L−Edge Relation Vid Rname KSet
2
10 A
1
Fid Cid Attr
AID 1 3
l
(b) Storage
Figure 2.11: L-Lattice
P
(a) Semijoin
(b) Join
Figure 2.12: Join vs Semijoin/Join
CN C i, we attempt to find the largest subtrees inL that C ican share with using the index, and we link to the roots of such subtrees Figure 2.11(a) illustrates a partial lattice The entire lattice,L, is maintained in two relations: L-Node relation and L-Edge relation (Figure 2.11(b)) Let a bit-string represent a set of keywords,{k1, k2, · · · , k l} The L-Node relation maintains, for any node in L,
a unique Vid inL, the corresponding relation name (Rname) that appears in the given database
schema, G S, a bit-string (KSet) that indicates the keywords associated with the node inL, and the size of the bit-string (l) The L-Edge relation maintains the parent/child relations among all the nodes inLwith its parent Vid and child Vid (Fid/Cid) plus its join attribute, Attr, (either primary key or foreign key) The two relations can be maintained in memory or on disk Several indexes are built on the relations to quickly search for given nodes inL
There are three main differences between the two execute graphs: the Mesh and the L-Lattice (1) The maximum depth of a Mesh is Tmax− 1 and the maximum depth of an L-Lattice is
Tmax/2+ 1 (2) In a mesh, only the left part of two CN s can be shared (except for the leaf nodes),
while in anL -Lattice multiple parts of two CN s can be shared (3) The number of leaf nodes in a
Trang 2mesh is O((|V (G S )| · 2l )2) because there are O(|V (G S )| · 2l )clusters in a mesh and each cluster
may contain O(|V (G S )| · 2l )leaf nodes The number of leaf nodes in anL -Lattice is O(2 l )
After sharing computational cost using either the Mesh or theL -Lattice, all CN s are evaluated using joins in DISCOVER or S-KWS A join plan is shown in Figure 2.9(b) to process the CN in Figure 2.9(a) using 5 projects and 4 joins The resulting relation, the output of the join (j4), is a temporal relation with 5 TIDs from the 5 projected relations, where a resulting tuple represents an
MTJNT The rightmost two connected trees in Figure 2.3 are the two results of the operator tree
Figure 2.9(b), (p2, c5, p4, w5, a3) and (p3, c4, p4, w5, a3)
In KRDBMS [Qin et al.,2009a], the authors observe that evaluating all CN s using only joins
may always generate a large number of temporal tuples They propose to use semijoin/join sequences
to compute a CN A semijoin between R and S is defined in Eq 2.18, which is to project () the tuples from R that can possibly join at least a tuple in S.
R S = R (R 1 S) (2.18)
Based on semijoin, a join R 1 Scan be supported by a semijoin and a join as given in Eq 2.19
R 1 S = (R S) 1 S (2.19) Recall that semijoin/joins were proposed to join relations in a distributedrdbms, in order to reduce high communication cost at the expense of I/O cost and CPU cost But, there is no communication
in a centralizedrdbms In other words, there is no obvious reason to use (R S) 1 S to process a
single join R 1 S since the former needs to access the same relation S twice Below, we address the
significant cost saving of semijoin/joins over joins when the number of joins is large, in a centralized rdbms
When evaluating all CN s, the temporal tuples generated can be very large, and the majority
of the generated temporal tuples do not appear in any MTJNT s When evaluating all CN s us-ing the semijoin/join based strategy, computus-ing R 1 (S 1 T ) is done as S← ST , R← RS,
with semijoins, in the reduction phase, followed by T 1 (S1 R) in the join phase For the
CN given in Figure 2.9(a), in the reduction phase (Figure 2.12(a)), C← C{}P {XML}, W←
W {}A{Michelle}, P← P {}C, and P← PW, and in the join phase (Figure 2.12(b)),
P1 C is joined first because P is fully reduced, such that every tuple in P must appear at an
MTJNT The join order is shown in Figure 2.12(b).
Figure 2.13 shows the number of temporal tuples generated using a real database DBLP on IBM DB2.The five 3-keyword queries with different keyword selectivity (the probability that a tuple contains a keyword in DBLP) were randomly selected withTmax= 5 The number of generated temporal tuples are shown in Figure 2.13(a) The number of tuples generated by the semijoin/join approach is significantly less than that by the join approach In a similar fashion, the number of temporal tuples generated by the semijoin/join approach is significantly less than that generated by the join approach whenTmaxincreases (Figure 2.13(b)) for a 3-keyword query
When processing a large number of joins for keyword search onrdbmss, it is the best practice
to process a large number of small joins in order to avoid intermediate join results becoming very
Trang 326 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES
10K 100K 1M 10M
4E-4 8E-4 1.2E-31.6E-3 2E-3
Join SemiJoin-Join
(a) Vary Keyword Selectivity
10K 100K 1M 10M
Join SemiJoin-Join
(b) Vary l
Figure 2.13: # of Temporal Tuples (Default Tmax= 5, l = 3)
large and dominative if it is difficult to find an optimal query processing plan or the cost of finding
an optimal query processing plan is high
Besides evaluating all CN s in a static environment, S-KWS and KDynamic focus on monitor-ing all MTJNT s in a relational data stream where tuples can be inserted/deleted frequently In this situation, it is necessary to find new MTJNT s or expire MTJNT s in order to monitor events that are implicitly interrelated over an open-ended relational data stream for a user-given l-keyword query More precisely, it reports new MTJNT s when new tuples are inserted, and, in addition, reports the
MTJNT s that become invalid when tuples are deleted A sliding window (time interval), W , is
spec-ified A tuple, t, has a lifespan from its insertion into the window at time t.start to W + t.start − 1,
if t is not deleted before then Two tuples can be joined if their lifespans overlap.
S-KWS processes a keyword query in a relational data stream using the mesh as introduced
above The authors observe that in a data stream environment some joins need to be processed when there are incoming new tuples from its inputs but not all joins need to be processed all the time, and, therefore, they propose a demand-driven operator execution A join operator has two inputs and is associated with an output buffer The output buffer of a join operator becomes input to many other join operators that share the join operator (as indicated in the mesh) A tuple that is newly output
by a join operator in its output buffer will be a new arrival input to those joins that share the join operator A join operator will be in a running state if it has newly arrived tuples from both inputs
A join operator will be in a sleeping state if either it has no new arriving tuples from the inputs
or all the join operators that share it are currently sleeping The demand-driven operator execution noticeably reduces the query processing cost
KDynamic processes a keyword query in a relational data stream using the L-Lattice Although
S-KWS can significantly reduce the computational cost, the scalability issues is also a problem
es-pecially when Tmax,|G S |, l, W or the stream speed is high This is because a large number of
intermediate tuples that are computed by many join operators in the mesh with high processing cost
will eventually not be output S-KWS cannot avoid computing such a large number of unnecessary intermediate tuples because it is unknown whether an intermediate tuple will appear in an MTJNT
Trang 4beforehand The probability of generating a large number of unnecessary intermediate results
in-creases when either the size of sliding window, W , is large, or new data arrive at high speed It is
challenging to reduce the processing cost by reducing the number of intermediate results
In KDynamic, an algorithmCNEvalDynamic is proposed, which works as follows We can maintain|V (G S ) | relations in total to process an l-keyword query Q = {k1, k2, · · · , k l}, due to the
lattice structure that is used A node, v, in lattice Lis uniquely identified with a node id The node
v represents a sub-relation R i {K} By utilizing the unique node id, it is easy to maintain all the 2l sub-relations for a relation R itogether Let us denote such a relation as Ri The schema of Riis the
same as R i plus an additional attribute (Vid) to keep the node id inL When we need to obtain
a sub-relation R i {K} for K⊆ Q associated with a node, v, in the lattice, we use the node id to select and project R i {K} from Ri Therefore, a relation R i {K} can be possibly virtually maintained
Below, we use Ri {K} to denote such a sub-relation It is fast to obtain Ri {K} if an index is built
on the additional attribute Vid on relation Ri
CNEvalDynamicis outlined in Algorithm 6 When a new update operator, op(t, R i ), arrives,
it processes it in lines 3-9 if the operation is an insertion or in lines 11-14 if it is a deletion The procedure EvalPathjoins all the needed tuples in a top-down fashion.EvalPath is implemented
similar to the semijoin-join based static evaluation as discussed above using an additional path,
which records where the join sequence comes from to reduce join cost The two procedures, namely
insertanddelete, maintain a list of tuples for each node in the lattice using only selections (lines
17-18, lines 26-27, and lines 34-35) The selected tuples can join at least one tuple from each list of its child nodes in the lattice If the list of one node in the lattice is changed, it will trigger the father nodes to change their lists accordingly (lines 24-27 and lines 32-35) If the root node is changed,
this means the results should be updated At this time, we use joins to report the updated MTJNT s.
When we join, all the tuples that participate in joins will contribute to the results In this way, we can achieve full reduction when joining
As the number of results itself can be exponentially large, we analyze the extra cost for the
algorithms to evaluate all CN s The extra cost is defined to be the number of tuples generated by the
algorithm minus the number of tuples in the result Suppose the number of tuples in every relation
is n Given a CN with size t, the extra cost for the algorithm using the left deep tree proposed in
S-KWS to evaluate the CN is O(n t−1), and the extra cost for theCNEvalDynamic algorithm to
evaluate the CN is O(n · t).
Finally, we discuss how to implement the event-driven evaluation As shown in Figure 2.14,
there are multiple nodes labeled with identical R i {K} For example, W{} appears in two different nodes in the lattice For each R i {K}, we maintain 3 lists named Rlist (Ready), Wlist (Wait) and Slist (Suspend) The three lists together contain all the node ids in the lattice A node in the latticeL labeled R i {K} can only appear in one of the three lists for R i {K} A node v in L appears in Wlist, if the sub-relations represented by all child nodes of v in Lare non-empty, but
the sub-relation represented by v is empty A node v in Lappears in Rlist, if the sub-relations
represented by all child nodes of v in L are non-empty, and the sub-relation represented by v itself
Trang 528 2 SCHEMA-BASED KEYWORD SEARCH ON RELATIONAL DATABASES
Algorithm 6CNEvalDynamic( L , Q, )
2: let Kbe the set of all keywords appearing in tuple t
16: let the label of node v be R i {K}
17: if t ∈ Ri {K} and t can join at least one tuple in every relation represented by all v’s children in L then
18: insert tuple t into the sub-relation R i {K }
19: if t ∈ Ri {K} then
20: push (v, t) to path
23: else
25: let the label of node u be R j {K"}
K"(r(R j )) that can join t do
30: let the label of node v be R i {K }
31: delete tuple t from the sub-relation R i {K}
33: let the label of node u be R j {K"}
j {K"} that can join t only do
35: delete(u, t)
is non-empty too Otherwise, v appears in Slist When a new tuple t of relation R iwith keyword
set K is inserted, we only insert it into all relations in the nodes v, in L, on Rlist and Wlist
specified for R i {K} Each insertion may notify some father nodes of v to move from Wlist or
Slistto Rlist Node v may also be moved from Wlist to Rlist When a tuple t of relation R i with keyword set Kis about to be deleted, we only remove it from all relations associated with node