Each stack entry created for a node v1∈ S1has the following three components: • elca_can is elca_canv1 • CH is child_elcacanv1 • SI B is the list ofELCA_CANsbefore elca_can, which is use
Trang 14.4. ELCA-BASED SEMANTICS 109
v
ui+1 um
Fi+1
ui y
ul
p.c p.(c + 1)
p
Figure 4.9: v and child_elcacan(v) [Xu and Papakonstantinou,2008]
Algorithm 37isELCA(v, ch)
Input: a node v and ch = child_elcacan(v).
Output: return true if v isELCA, f alse otherwise
1: for i ← 1 to l do
2: x ← v
3: for j ← 1 to |ch| do
4: x ← rm(x, S i )
5: if x = ⊥ or pre(x) < pre(ch[j]) then
7: else
8: x ← nextSibling(ch[j])
9: if j = |ch| + 1 then
10: return f alse, if v ⊀ rm(x, S i )
11: return true
After the first step that we gotELCA_CANs, if we can find child_elcacan(v) efficiently for eachELCA_CANv, then we can findELCAs in time O(|S1|ld log |S|) If we assign each
ELCA_CANuto be the child of its ancestorELCA_CANnode v with the largest Dewey ID, then
u corresponds to exactly one node in child_elcacan(v), and the node in child_elcacan(v) corre-sponding to u can be found in O(d) time by the Dewey ID In the following, we use child_elcacan(v)
to denote the set ofELCA_CANnodes u which is a descendant of v and there does not exist any node x with v ≺ x ≺ w, i.e.
child _elcacan(v) = {u ∈ elca_can(S1, · · · , S l ) | v ≺ u ∧
x ∈ elca_can(S1, · · · , S l )(v ≺ x ≺ u)}
There is an one-to-one correspondence between the two definitions of child_elcacan(v) It is easy
to see that
v∈elca_can(S ,··· ,S ) |child_elcacan(v)| = O(|elca_can(S1, · · · , S l ) |) = O(|S1|).
Trang 2110 4 KEYWORD SEARCH IN XML DATABASES
Now the problem becomes how to compute child_elcacan(v) efficiently for all
v ∈ elca_can(S1, · · · , S l ) Note that, the nodes in elca_can(S1, · · · , S l ) as computed by
∪v1∈S1elca _can(v1) are not sorted in Dewey ID order Similar toDeweyInvertedList, a stack
based algorithm is used to compute child_elcacan(v), but it works on the set elca_can(S1, · · · , S l ), whileDeweyInvertedListworks on the set S1∪ S2· · · ∪ S l Each stack entry created for a node
v1∈ S1has the following three components:
• elca_can is elca_can(v1)
• CH is child_elcacan(v1)
• SI B is the list ofELCA_CANsbefore elca_can, which is used to compute CH
IndexedStack[Xu and Papakonstantinou,2007,2008] is shown in Algorithm 38 For each
node v1∈ S1, it computes elca_can v1 = elca_can(v1) (line 3), a stack entry en is created for
elca _can v1(line 4) If the stack is empty (line 5), we simply push en to stack (line 6) Otherwise, different operations are applied based on the relationship between elca_can v1 and elca_can v2,
which is the node at the top of stack.
• elca_can v1 = elca_can v2, then en is discarded (lines 8-9)
• elca_can v2 ≺ elca_can v1, then just push en to stack (lines 10-11),
• elca_can v2 < elca _can v1, but elca_can v2 ⊀ elca_can v1, then the non-ancestor nodes of
elca _can v1in stack is popped out, and it is checked whether it is anELCAor not (procedure popStack (lines 23-30)), because all its descendant match nodes have been read, and the
child _elcacan information have been stored in popEntry.CH (lines 27-28) After the
non-ancestor nodes have been popped out (line 13), it may be necessary to store the sibling nodes
of en to en.SI B Note that, in this case, there may exist a potentialELCAthat is the ancestor
of en, and the descendant of the top entry of the stack (or the root of the XML tree if stack
is empty) If this is possible (line 15), then the sibling information is stored in en.SI B (line
16)
• elca_can v1 ≺ elca_can v1, then the non-ancestor nodes of elca_can v1 in stack is popped
out, and it is checked whether it is to be anELCAor not (line 19), and en.CH is stored (line
20) Note that there does not exist any more potentialELCAnodes that are descendants of the popped entries
Note that these are the only four possible cases of the relationship between elca_can v1
and elca_can v2 IndexedStack output all the ELCA nodes, i.e., elca(S1, · · · , S l ), in time
O( |S1|l
i=2dlog|S i |), or O(|S1|ld log |S|) [Xu and Papakonstantinou,2008]
Trang 34.4. ELCA-BASED SEMANTICS 111 Algorithm 38IndexedStack(S1, · · · , S l)
Input: l list of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i.
Output: output allELCAs
1: st ack← ∅
2: for each node v1∈ S1, in increasing Dewey ID order do
3: elca _can v1 ← slca({v1}, S2, · · · , S l )
4: en ← [elca_can ← elca_can v1; SIB ← []; CH ← []]
5: if stack= ∅ then
6: st ack.push(en); continue
7: t opEnt ry ← stack.top() ; elca_can v2 ← topEntry.elca_can
8: if elca_can v1 = elca_canv2 then
9: ⊥
10: else if elca_can v2 ≺ elca_can v1 then
11: st ack.push(en)
12: else if elca_can v2 < elca _can v1 then
13: popEnt ry ← popStack(elca_can v1)
14: t op _elcacan ← stack.top().elca _can
15: if stack = ∅ and top_elcacan ≺ lca(elca_can v1, popEnt ry.elca _can) then
16: en.SI B ← [popEntry.SIB, popEntry.elca_can]
17: st ack.push(en)
18: else if elca_can v1 ≺ elca_can v2 then
19: popEnt ry ← popStack(elca_can v1)
20: en.CH ← [popEntry.SIB, popEntry.elca_can]
21: st ack.push(en)
22: popStack(0)
23: Procedure popStack (elca_can v1)
24: popEnt ry← ⊥
25: while stack = ∅ and stack.top().elca _can ⊀ elca_can v1 do
26: popEnt ry ← stack.pop()
27: if isELCA (popEntry.elca_can, toChild_elcacan(popEntry.elca_can, popEntry.CH )) then
28: output popEntry.elca_can as anELCA
29: st ack.top().CH ← stack.top().CH + popEntry.elca_can
30: return popEntry
4.4.2 IDENTIFYING MEANINGFUL ELCAS
Kong et al.[2009] extend the definition of contributor [Liu and Chen,2008b] to valid-contribute, and they propose an algorithm similar toMaxMatchto compute relevant matches based onELCA
semantics, i.e., root t can be anELCAnode
Definition 4.35 Valid Contributor Given an XML data T and a keyword query Q, a node v in
Q is called a valid contributor to Q, if either one of the following two conditions holds:
1 v has a unique label tag(v) among its sibling nodes
Trang 4112 4 KEYWORD SEARCH IN XML DATABASES
2 v has several siblings v1, · · · , v m (m ≥ 1), with the same label as tag(v), but the following
conditions hold:
• v i , dMatch(v) ⊂ dMatch(v i )
• ∀v i > v, if dMatch(v) = dMatch(v i ), then T C v = T C v i , where T C vdenote the set
of words (among the match nodes in M) appear in the subtree rooted at v
A valid contributor only compares nodes with its sibling nodes that have the same label If a node v has a unique label among its sibling nodes, then it is a valid contributor Otherwise, only those nodes whose dMatch is not subsumed by any sibling node with the same label is a valid contributor Also, if the subtree rooted at two sibling nodes contains exactly the same set of words (T C v), then only one is a valid contributor
Definition 4.36 Relevant Match For an XML tree T and a query Q, a match node v in T is
relevant if v is a witness node of u ∈ elca(Q), and all the nodes on the path from u to v are valid
contributors
Based on this definition of valid contributor and relevant match, all the subtrees formed by
ELCAnode and its corresponding relevant match nodes will satisfy the four properties discussed earlier [Liu and Chen,2008b], namely, data monotonicity, data consistency, query monotonicity, and
query consistency An algorithm to find the relevant matches for eachELCAnode exists [Kong et al., 2009], that consists of three steps: (1) find allELCAsusingDeweyInvertedListor Indexed-Stack, (2) group match nodes to eachELCAnode, (3) prune irrelevant matches from each group The algorithm uses ideas similar toMaxMatchto find relevant matches according to the definition
of valid contributor
There exist several semantics other thanSLCAandELCAfor keyword search on XML databases,
namely, meaningfulLCA(MLCA) [Li et al.,2004,2008b], interconnection [Cohen et al.,2003],
Compact Valuable LCA (CVLCA) [Li et al.,2007a], and relevance oriented ranking [Bao et al.,2009] The difference between MLCAand interconnection is that MLCAis based on SLCA, whereas
interconnection is not, i.e., the root nodes of the subtrees returned by interconnection may not be a
SLCAnode.CVLCAis a combination ofELCAsemantics and the interconnection semantics Another approach to keyword search on XML databases is to make use of the schema in-formation where results are minimal connected trees of XML fragments that contain all the
key-words [Balmin et al.,2003;Hristidis et al.,2003b] Hristidis et al study keyword search on XML trees, and propose efficient algorithms to find minimum connecting trees [Hristidis et al.,2006] Al-Khalifa et al integrate the IR-styled ranking function into XQuery, and they propose a bulk-algebra which is the basis for integrating information retrieval techniques into a standard pipelined database
Trang 54.5 OTHER APPROACHES 113
query evaluation engine [Al-Khalifa et al.,2003] NaLIX (Natural Language Interface to XML) is
a system, in which an arbitrary English language sentence is translated into an XQuery expression,
and it can be evaluated against an XML database [Li et al.,2007b] The problem of keyword search
on XML using a minimal number of materialized views is also studied, where the answer definition
is based onSLCAsemantics [Liu and Chen,2008a] Some works study the problem of keyword
search over virtual (unmaterialized) XML views [Shao et al.,2007,2009a] eXtract is a system to
generate snippets for tree results of querying on XML database, which highlights the most
domi-nant features [Huang et al.,2008a,b] Answer differentiation is studied to find a limited number of valid features in result so that they can maximally differentiate this result from the others [Liu et al., 2009a]