Keyword Search in Databases- P23 ppt

Each stack entry created for a node v1∈ S1has the following three components: • elca_can is elca_canv1 • CH is child_elcacanv1 • SI B is the list ofELCA_CANsbefore elca_can, which is use

Trang 1

4.4. ELCA-BASED SEMANTICS 109

v

ui+1 um

Fi+1

ui y

ul

p.c p.(c + 1)

p

Figure 4.9: v and child_elcacan(v) [Xu and Papakonstantinou,2008]

Algorithm 37isELCA(v, ch)

Input: a node v and ch = child_elcacan(v).

Output: return true if v isELCA, f alse otherwise

1: for i ← 1 to l do

2: x ← v

3: for j ← 1 to |ch| do

4: x ← rm(x, S i )

5: if x = ⊥ or pre(x) < pre(ch[j]) then

7: else

8: x ← nextSibling(ch[j])

9: if j = |ch| + 1 then

10: return f alse, if v ⊀ rm(x, S i )

11: return true

After the first step that we gotELCA_CANs, if we can find child_elcacan(v) efficiently for eachELCA_CANv, then we can findELCAs in time O(|S1|ld log |S|) If we assign each

ELCA_CANuto be the child of its ancestorELCA_CANnode v with the largest Dewey ID, then

u corresponds to exactly one node in child_elcacan(v), and the node in child_elcacan(v) corre-sponding to u can be found in O(d) time by the Dewey ID In the following, we use child_elcacan(v)

to denote the set ofELCA_CANnodes u which is a descendant of v and there does not exist any node x with v ≺ x ≺ w, i.e.

child _elcacan(v) = {u ∈ elca_can(S1, · · · , S l ) | v ≺ u ∧

x ∈ elca_can(S1, · · · , S l )(v ≺ x ≺ u)}

There is an one-to-one correspondence between the two definitions of child_elcacan(v) It is easy

to see that

v∈elca_can(S ,··· ,S ) |child_elcacan(v)| = O(|elca_can(S1, · · · , S l ) |) = O(|S1|).

Trang 2

110 4 KEYWORD SEARCH IN XML DATABASES

Now the problem becomes how to compute child_elcacan(v) efficiently for all

v ∈ elca_can(S1, · · · , S l ) Note that, the nodes in elca_can(S1, · · · , S l ) as computed by

∪v1∈S1elca _can(v1) are not sorted in Dewey ID order Similar toDeweyInvertedList, a stack

based algorithm is used to compute child_elcacan(v), but it works on the set elca_can(S1, · · · , S l ), whileDeweyInvertedListworks on the set S1∪ S2· · · ∪ S l Each stack entry created for a node

v1∈ S1has the following three components:

• elca_can is elca_can(v1)

• CH is child_elcacan(v1)

• SI B is the list ofELCA_CANsbefore elca_can, which is used to compute CH

IndexedStack[Xu and Papakonstantinou,2007,2008] is shown in Algorithm 38 For each

node v1∈ S1, it computes elca_can v1 = elca_can(v1) (line 3), a stack entry en is created for

elca _can v1(line 4) If the stack is empty (line 5), we simply push en to stack (line 6) Otherwise, different operations are applied based on the relationship between elca_can v1 and elca_can v2,

which is the node at the top of stack.

• elca_can v1 = elca_can v2, then en is discarded (lines 8-9)

• elca_can v2 ≺ elca_can v1, then just push en to stack (lines 10-11),

• elca_can v2 < elca _can v1, but elca_can v2 ⊀ elca_can v1, then the non-ancestor nodes of

elca _can v1in stack is popped out, and it is checked whether it is anELCAor not (procedure popStack (lines 23-30)), because all its descendant match nodes have been read, and the

child _elcacan information have been stored in popEntry.CH (lines 27-28) After the

non-ancestor nodes have been popped out (line 13), it may be necessary to store the sibling nodes

of en to en.SI B Note that, in this case, there may exist a potentialELCAthat is the ancestor

of en, and the descendant of the top entry of the stack (or the root of the XML tree if stack

is empty) If this is possible (line 15), then the sibling information is stored in en.SI B (line

16)

• elca_can v1 ≺ elca_can v1, then the non-ancestor nodes of elca_can v1 in stack is popped

out, and it is checked whether it is to be anELCAor not (line 19), and en.CH is stored (line

20) Note that there does not exist any more potentialELCAnodes that are descendants of the popped entries

Note that these are the only four possible cases of the relationship between elca_can v1

and elca_can v2 IndexedStack output all the ELCA nodes, i.e., elca(S1, · · · , S l ), in time

O( |S1|l

i=2dlog|S i |), or O(|S1|ld log |S|) [Xu and Papakonstantinou,2008]

Trang 3

4.4. ELCA-BASED SEMANTICS 111 Algorithm 38IndexedStack(S1, · · · , S l)

Input: l list of Dewey IDs, S i is the list of Dewey IDs of the nodes containing keyword k i.

Output: output allELCAs

1: st ack← ∅

2: for each node v1∈ S1, in increasing Dewey ID order do

3: elca _can v1 ← slca({v1}, S2, · · · , S l )

4: en ← [elca_can ← elca_can v1; SIB ← []; CH ← []]

5: if stack= ∅ then

6: st ack.push(en); continue

7: t opEnt ry ← stack.top() ; elca_can v2 ← topEntry.elca_can

8: if elca_can v1 = elca_canv2 then

9: ⊥

10: else if elca_can v2 ≺ elca_can v1 then

11: st ack.push(en)

12: else if elca_can v2 < elca _can v1 then

13: popEnt ry ← popStack(elca_can v1)

14: t op _elcacan ← stack.top().elca _can

15: if stack = ∅ and top_elcacan ≺ lca(elca_can v1, popEnt ry.elca _can) then

16: en.SI B ← [popEntry.SIB, popEntry.elca_can]

18: else if elca_can v1 ≺ elca_can v2 then

19: popEnt ry ← popStack(elca_can v1)

20: en.CH ← [popEntry.SIB, popEntry.elca_can]

22: popStack(0)

23: Procedure popStack (elca_can v1)

24: popEnt ry← ⊥

25: while stack = ∅ and stack.top().elca _can ⊀ elca_can v1 do

26: popEnt ry ← stack.pop()

27: if isELCA (popEntry.elca_can, toChild_elcacan(popEntry.elca_can, popEntry.CH )) then

28: output popEntry.elca_can as anELCA

29: st ack.top().CH ← stack.top().CH + popEntry.elca_can

30: return popEntry

4.4.2 IDENTIFYING MEANINGFUL ELCAS

Kong et al.[2009] extend the definition of contributor [Liu and Chen,2008b] to valid-contribute, and they propose an algorithm similar toMaxMatchto compute relevant matches based onELCA

semantics, i.e., root t can be anELCAnode

Definition 4.35 Valid Contributor Given an XML data T and a keyword query Q, a node v in

Q is called a valid contributor to Q, if either one of the following two conditions holds:

1 v has a unique label tag(v) among its sibling nodes

Trang 4

112 4 KEYWORD SEARCH IN XML DATABASES

2 v has several siblings v1, · · · , v m (m ≥ 1), with the same label as tag(v), but the following

conditions hold:

• v i , dMatch(v) ⊂ dMatch(v i )

• ∀v i > v, if dMatch(v) = dMatch(v i ), then T C v = T C v i , where T C vdenote the set

of words (among the match nodes in M) appear in the subtree rooted at v

A valid contributor only compares nodes with its sibling nodes that have the same label If a node v has a unique label among its sibling nodes, then it is a valid contributor Otherwise, only those nodes whose dMatch is not subsumed by any sibling node with the same label is a valid contributor Also, if the subtree rooted at two sibling nodes contains exactly the same set of words (T C v), then only one is a valid contributor

Definition 4.36 Relevant Match For an XML tree T and a query Q, a match node v in T is

relevant if v is a witness node of u ∈ elca(Q), and all the nodes on the path from u to v are valid

contributors

Based on this definition of valid contributor and relevant match, all the subtrees formed by

ELCAnode and its corresponding relevant match nodes will satisfy the four properties discussed earlier [Liu and Chen,2008b], namely, data monotonicity, data consistency, query monotonicity, and

query consistency An algorithm to find the relevant matches for eachELCAnode exists [Kong et al., 2009], that consists of three steps: (1) find allELCAsusingDeweyInvertedListor Indexed-Stack, (2) group match nodes to eachELCAnode, (3) prune irrelevant matches from each group The algorithm uses ideas similar toMaxMatchto find relevant matches according to the definition

of valid contributor

There exist several semantics other thanSLCAandELCAfor keyword search on XML databases,

namely, meaningfulLCA(MLCA) [Li et al.,2004,2008b], interconnection [Cohen et al.,2003],

Compact Valuable LCA (CVLCA) [Li et al.,2007a], and relevance oriented ranking [Bao et al.,2009] The difference between MLCAand interconnection is that MLCAis based on SLCA, whereas

interconnection is not, i.e., the root nodes of the subtrees returned by interconnection may not be a

SLCAnode.CVLCAis a combination ofELCAsemantics and the interconnection semantics Another approach to keyword search on XML databases is to make use of the schema in-formation where results are minimal connected trees of XML fragments that contain all the

key-words [Balmin et al.,2003;Hristidis et al.,2003b] Hristidis et al study keyword search on XML trees, and propose efficient algorithms to find minimum connecting trees [Hristidis et al.,2006] Al-Khalifa et al integrate the IR-styled ranking function into XQuery, and they propose a bulk-algebra which is the basis for integrating information retrieval techniques into a standard pipelined database

Trang 5

4.5 OTHER APPROACHES 113

query evaluation engine [Al-Khalifa et al.,2003] NaLIX (Natural Language Interface to XML) is

a system, in which an arbitrary English language sentence is translated into an XQuery expression,

and it can be evaluated against an XML database [Li et al.,2007b] The problem of keyword search

on XML using a minimal number of materialized views is also studied, where the answer definition

is based onSLCAsemantics [Liu and Chen,2008a] Some works study the problem of keyword

search over virtual (unmaterialized) XML views [Shao et al.,2007,2009a] eXtract is a system to

generate snippets for tree results of querying on XML database, which highlights the most

domi-nant features [Huang et al.,2008a,b] Answer differentiation is studied to find a limited number of valid features in result so that they can maximally differentiate this result from the others [Liu et al., 2009a]

Định dạng
Số trang	5
Dung lượng	127,29 KB