< !ELEMENT team name, players >< !ELEMENT players play* > Figure 4.7: Queries for XSeek In order to find meaningful return nodes, XSeek analyzes both XML data structure and keyword match
Trang 10.1.1.0
0.1.1.0.0
0.1.1.1
0.1.1.1.0
0.1.1.2
0.1.1.2.0
0
0.0
0.1.0.1 0.1.0.2 0.1.0.0
0.1.0.0.0 0.1.0.1.0 0.1.0.2.0
0.1.0 0.0.0
0.1
0.1.2.2
0.1.2.2.0
0.1.2
0.1.2.0
0.1.2.0.0
0.1.2.1
0.1.2.1.0
team
player player
Grizzlies
name nationality position
forward Spain
Gasol
name nationality
player
position
guard USA
Miller
name nationality position
forward USA
Brown
T1 T2
Figure 4.5: Sample XML Document [Liu and Chen,2008b]
t, M) rooted at t, with nodes (M) corresponding to the matches that are considered relevant to Q Every keyword in Q has at least one match in M.
Note that one query result should not be subsumed by another; therefore, the root nodes
and S1, · · · , S l In the following, we mainly focus on identifying meaningful information based on
t, M.
4.3.1 XSEEK
XSeek [Liu and Chen,2007;Liu et al.,2009b,2007] is a system that represents the whole subtree
is likely that the user is interested in information about “Grizzlies.” But by the definition ofSLCA,
only the node 0.0.0 (Grizzlies) is returned, which is not informative Ideally, the subtree rooted at
interested in information about the player whose name is “Gasol” and who is a “forward” in the
team for Q2, and the user is interested in a particular piece of information: the “position” of “Gasol”
for Q3 To process Q5, XSeek outputs the name of players and provides a link to its player children,
which provides information about all the players in the team
Trang 2< !ELEMENT team (name, players) >
< !ELEMENT players (play*) >
Figure 4.7: Queries for XSeek
In order to find meaningful return nodes, XSeek analyzes both XML data structure and keyword match patterns Three types of information are represented in XML data: entities in the
real world, attributes of entities, and connection nodes The input keywords are categorized into two types: the ones that specify search predicates, and the ones that indicate return information Then based on the data and keyword analysis, XSeek generates meaningful return nodes
In order to differentiate the three types of information represented in XML data, XML
schema information is needed, e.g., it is either provided or inferred from the data An example
schema fragment of the XML tree shown in Figure 4.5 is shown in Figure 4.6 For each XML node,
it specifies the names of its sub-elements and attributes using regular expressions with operators
example, “Element players (player*)” indicates that the “players” can have zero or more “player”,
“Element player (name, nationality, position ?)” indicates that a “player” should have one “name”, one “nationality”, and may not have a “position” “Element name (#PCDATA)” specifies that “name” has a value child In the following, we refer to the nodes that can have siblings of the same name as
*-node, as they are followed by “*” in the schema, e.g., the “player” node
Analyzing XML Data Structure: Similar to the E-R model used in relational databases, XSeek
differentiates nodes in an XML tree into three categories.
• A node represents an entity if it corresponds to a *-node in the schema.
• A node denotes an attribute if it does not correspond to a *-node, and only has one child,
which is a value
• A node is a connection node if it represents neither an entity nor an attribute A connection
node can have a child that is an entity, an attribute, or another connection node
Trang 3For example, consider the schema shown in Figure 4.6, where “player” is a *-node, indicating
a many-to-one relationship with its parent node “players” It is inferred to be an entity, while “name”,
“nationality”, and “position” are considered attributes of a “player” entity Since “players” is not a *-node and it does not have a value child, therefore, it is considered to be a connection node Although the above inferences do not always hold, they provide heuristics in the absence of E-R model When the schema information is not available, it can be inferred based on data summarization [Yu and Jagadish,
2006]
Analyzing Keyword Match Patterns: The input keywords can be classified into two categories:
search predicates, which correspond to the where clause in XQuery or SQL, and return nodes, which
correspond to the return clause in XQuery or select clause in SQL They are inferred as follows,
k2matching a node value v, such that u is an ancestor of v, then k1specifies a return node.
• A keyword that does not indicate a return node is treated as a predicate specification In other
words, if a keyword matches a node value, or it matches a node name (tag) that has a value descendant matching another keyword, then this keyword specifies a predicate
since they match value nodes While in Q3, “position” is inferred as a return node since it matches the
name of two nodes, neither of which has any descendant value node matching the other keyword
Generating Search Results: XSeek generates a subtree for eacht, M independently, where t =
lca(M) and t ∈ slca(Q) Sometimes, return nodes can be found by analyzing the keyword match patterns, otherwise, they can be inferred implicitly by analyzing the XML data and the match M.
Definition 4.23 Master Entity If an entity e is the lowest ancestor-or-self ofLCAnode t of a match M, then e is named the master entity of match M If such an e can not be found, the root of the XML tree is considered as the master entity.
Based on the previous analysis, we can find the meaningful return information by two steps First, output all the predicate matches Second, output the return nodes based on the node category
Output Predicate Matches: The predicate matches are output, so that the user can check
output as part of search results, indicating how the keywords are matched and connected to each other
Output Return Nodes: The return nodes are output based their node categories: entity,
at-tribute, and connection node If it is an attribute node, then its name and value child are output The subtree rooted at the entity node or connection node is output compactly, by providing the most relevant information at the first stage with expansion links browsing for more details First, the name
Trang 4of this node and all the attribute children should be output Then a link is generated to each group of child entities that have the same name (tag), and a link is generated to each child connection node
name “team”, the names and values of its attributes are output, e.g., 0.0 (name) and 0.0.0 (Grizzlies).
An expansion link to its connection child 0.1 (players) is generated.
4.3.2 MAX MATCH
each match node in M (as well as its value child, if any) The number of query results is denoted
2008b]
Definition 4.24 Delta Result Tree (δ) Let R be the set of query results of query Q on data
at node v in a query result tree r∈ R is a delta result tree if desc-or-self (v, r) ∩ R = ∅ and
desc -or-self (parent (v, r), r) ∩ R = ∅, where parent(v, r) and desc-or-self (v, r) denote
result trees is denoted as δ(R, R)
We show the four properties that a query (t, M) should satisfy, namely, data monotonicity,
query monotonicity, data consistency and query consistency.
Definition 4.25 Data Monotonicity and Data Consistency For a query Q and two XML
that on T i.e |R(Q, T )| ≤ |R(Q, T)|
• An algorithm satisfies data consistency if every delta result tree in δ(R(Q, T ), R(Q, T))
con-tains v So there can be either 0 or 1 delta result tree.
Example 4.26 Consider query Q4on T1 and T2, respectively Ideally, R(Q4, T1)should contain
one query result rooted at 0.1.0 (player) with matches 0.1.0.0 (name) and 0.1.0.2.0 (forward) Then consider an insertion of a position node with its value forward that results in T2 Ideally, R(Q4, T2)
should contain one more query result: a subtree rooted at 0.1.2 (player) that matches 0.1.2.0 (name) and 0.1.2.2.0 (forward) Then it will satisfy both data monotonicity and data consistency, because
|R(Q4, T1) | = 1 and |R(Q4, T2) | = 2, and the delta result tree is the new result rooted at 0.1.2
(player) which contains the newly added node
Trang 50.1 0.0
0.1.0.2
0.1.0.0.0
0.1.0.0
0.1.0.2.0
team
name players
player
name
forward
Grizzlies
position
Gasol
R(Q1, T1 )
R(Q2, T1 )
(a) Results of Q1and Q2
0
0.0.0
0.1.1
0.1.1.2.0
0.1.1.2
0.1.0
0.1.0.2.0
0.1.0.2
0.1.0.0
0.1.0.0.0
undesirable
team
name players
player player
name
forward forward
Grizzlies
position position
Gasol
R(Q2, T1 )
(b) Undesirable Results of Q2 on
T1
0.0.0
0
0.0
0.1.0 0.1
0.1.2
0.1.2.0
0.1.2.0.0 0.1.2.2.0
0.1.2.2
0.1.0.2.0
0.1.0.2
0.1.0.0
0.1.0.0.0
Grizzlies
position
Brown
position
Gasol
team
name players
player player
name name
forward forward
R(Q3, T1 )
R(Q3, T2 )
(c) Results of Q3on T1and T2
(d) Sample Queries
Figure 4.8: Sample Queries and Results [Liu and Chen,2008b]
i.e., R(Q3, T1) and R(Q3, T2)each contains only one result, and the delta result tree is the subtree
rooted at 0.1.2.2 (position) which contains the newly added node.
Definition 4.27 Query Monotonicity and Query Consistency For two queries Q and Qand
that of Q, i.e |R(Q, T )| ≥ |R(Q, T )|
contains at least one match to k.