Keyword Search in Databases- P21 pot

< !ELEMENT team name, players >< !ELEMENT players play* > Figure 4.7: Queries for XSeek In order to find meaningful return nodes, XSeek analyzes both XML data structure and keyword match

Trang 1

0.1.1.0

0.1.1.0.0

0.1.1.1

0.1.1.1.0

0.1.1.2

0.1.1.2.0

0

0.0

0.1.0.1 0.1.0.2 0.1.0.0

0.1.0.0.0 0.1.0.1.0 0.1.0.2.0

0.1.0 0.0.0

0.1

0.1.2.2

0.1.2.2.0

0.1.2

0.1.2.0

0.1.2.0.0

0.1.2.1

0.1.2.1.0

team

player player

Grizzlies

name nationality position

forward Spain

Gasol

name nationality

player

position

guard USA

Miller

name nationality position

forward USA

Brown

T1 T2

Figure 4.5: Sample XML Document [Liu and Chen,2008b]

t, M) rooted at t, with nodes (M) corresponding to the matches that are considered relevant to Q Every keyword in Q has at least one match in M.

Note that one query result should not be subsumed by another; therefore, the root nodes

and S1, · · · , S l In the following, we mainly focus on identifying meaningful information based on

t, M.

4.3.1 XSEEK

XSeek [Liu and Chen,2007;Liu et al.,2009b,2007] is a system that represents the whole subtree

is likely that the user is interested in information about “Grizzlies.” But by the definition ofSLCA,

only the node 0.0.0 (Grizzlies) is returned, which is not informative Ideally, the subtree rooted at

interested in information about the player whose name is “Gasol” and who is a “forward” in the

team for Q2, and the user is interested in a particular piece of information: the “position” of “Gasol”

for Q3 To process Q5, XSeek outputs the name of players and provides a link to its player children,

which provides information about all the players in the team

Trang 2

< !ELEMENT team (name, players) >

< !ELEMENT players (play*) >

Figure 4.7: Queries for XSeek

In order to find meaningful return nodes, XSeek analyzes both XML data structure and keyword match patterns Three types of information are represented in XML data: entities in the

real world, attributes of entities, and connection nodes The input keywords are categorized into two types: the ones that specify search predicates, and the ones that indicate return information Then based on the data and keyword analysis, XSeek generates meaningful return nodes

In order to differentiate the three types of information represented in XML data, XML

schema information is needed, e.g., it is either provided or inferred from the data An example

schema fragment of the XML tree shown in Figure 4.5 is shown in Figure 4.6 For each XML node,

it specifies the names of its sub-elements and attributes using regular expressions with operators

example, “Element players (player*)” indicates that the “players” can have zero or more “player”,

“Element player (name, nationality, position ?)” indicates that a “player” should have one “name”, one “nationality”, and may not have a “position” “Element name (#PCDATA)” specifies that “name” has a value child In the following, we refer to the nodes that can have siblings of the same name as

*-node, as they are followed by “*” in the schema, e.g., the “player” node

Analyzing XML Data Structure: Similar to the E-R model used in relational databases, XSeek

differentiates nodes in an XML tree into three categories.

• A node represents an entity if it corresponds to a *-node in the schema.

• A node denotes an attribute if it does not correspond to a *-node, and only has one child,

which is a value

• A node is a connection node if it represents neither an entity nor an attribute A connection

node can have a child that is an entity, an attribute, or another connection node

Trang 3

For example, consider the schema shown in Figure 4.6, where “player” is a *-node, indicating

a many-to-one relationship with its parent node “players” It is inferred to be an entity, while “name”,

“nationality”, and “position” are considered attributes of a “player” entity Since “players” is not a *-node and it does not have a value child, therefore, it is considered to be a connection node Although the above inferences do not always hold, they provide heuristics in the absence of E-R model When the schema information is not available, it can be inferred based on data summarization [Yu and Jagadish,

2006]

Analyzing Keyword Match Patterns: The input keywords can be classified into two categories:

search predicates, which correspond to the where clause in XQuery or SQL, and return nodes, which

correspond to the return clause in XQuery or select clause in SQL They are inferred as follows,

k2matching a node value v, such that u is an ancestor of v, then k1specifies a return node.

• A keyword that does not indicate a return node is treated as a predicate specification In other

words, if a keyword matches a node value, or it matches a node name (tag) that has a value descendant matching another keyword, then this keyword specifies a predicate

since they match value nodes While in Q3, “position” is inferred as a return node since it matches the

name of two nodes, neither of which has any descendant value node matching the other keyword

Generating Search Results: XSeek generates a subtree for eacht, M independently, where t =

lca(M) and t ∈ slca(Q) Sometimes, return nodes can be found by analyzing the keyword match patterns, otherwise, they can be inferred implicitly by analyzing the XML data and the match M.

Definition 4.23 Master Entity If an entity e is the lowest ancestor-or-self ofLCAnode t of a match M, then e is named the master entity of match M If such an e can not be found, the root of the XML tree is considered as the master entity.

Based on the previous analysis, we can find the meaningful return information by two steps First, output all the predicate matches Second, output the return nodes based on the node category

Output Predicate Matches: The predicate matches are output, so that the user can check

output as part of search results, indicating how the keywords are matched and connected to each other

Output Return Nodes: The return nodes are output based their node categories: entity,

at-tribute, and connection node If it is an attribute node, then its name and value child are output The subtree rooted at the entity node or connection node is output compactly, by providing the most relevant information at the first stage with expansion links browsing for more details First, the name

Trang 4

of this node and all the attribute children should be output Then a link is generated to each group of child entities that have the same name (tag), and a link is generated to each child connection node

name “team”, the names and values of its attributes are output, e.g., 0.0 (name) and 0.0.0 (Grizzlies).

An expansion link to its connection child 0.1 (players) is generated.

4.3.2 MAX MATCH

each match node in M (as well as its value child, if any) The number of query results is denoted

2008b]

Definition 4.24 Delta Result Tree (δ) Let R be the set of query results of query Q on data

at node v in a query result tree r∈ R is a delta result tree if desc-or-self (v, r) ∩ R = ∅ and

desc -or-self (parent (v, r), r) ∩ R = ∅, where parent(v, r) and desc-or-self (v, r) denote

result trees is denoted as δ(R, R)

We show the four properties that a query (t, M) should satisfy, namely, data monotonicity,

query monotonicity, data consistency and query consistency.

Definition 4.25 Data Monotonicity and Data Consistency For a query Q and two XML

that on T i.e |R(Q, T )| ≤ |R(Q, T)|

• An algorithm satisfies data consistency if every delta result tree in δ(R(Q, T ), R(Q, T))

con-tains v So there can be either 0 or 1 delta result tree.

Example 4.26 Consider query Q4on T1 and T2, respectively Ideally, R(Q4, T1)should contain

one query result rooted at 0.1.0 (player) with matches 0.1.0.0 (name) and 0.1.0.2.0 (forward) Then consider an insertion of a position node with its value forward that results in T2 Ideally, R(Q4, T2)

should contain one more query result: a subtree rooted at 0.1.2 (player) that matches 0.1.2.0 (name) and 0.1.2.2.0 (forward) Then it will satisfy both data monotonicity and data consistency, because

|R(Q4, T1) | = 1 and |R(Q4, T2) | = 2, and the delta result tree is the new result rooted at 0.1.2

(player) which contains the newly added node

Trang 5

0.1 0.0

0.1.0.2

0.1.0.0.0

0.1.0.0

0.1.0.2.0

team

name players

player

name

forward

Grizzlies

position

Gasol

R(Q1, T1 )

R(Q2, T1 )

(a) Results of Q1and Q2

0

0.0.0

0.1.1

0.1.1.2.0

0.1.1.2

0.1.0

0.1.0.2.0

0.1.0.2

0.1.0.0

0.1.0.0.0

undesirable

team

name players

player player

name

forward forward

Grizzlies

position position

Gasol

R(Q2, T1 )

(b) Undesirable Results of Q2 on

T1

0.0.0

0

0.0

0.1.0 0.1

0.1.2

0.1.2.0

0.1.2.0.0 0.1.2.2.0

0.1.2.2

0.1.0.2.0

0.1.0.2

0.1.0.0

0.1.0.0.0

Grizzlies

position

Brown

position

Gasol

team

name players

player player

name name

forward forward

R(Q3, T1 )

R(Q3, T2 )

(c) Results of Q3on T1and T2

(d) Sample Queries

Figure 4.8: Sample Queries and Results [Liu and Chen,2008b]

i.e., R(Q3, T1) and R(Q3, T2)each contains only one result, and the delta result tree is the subtree

rooted at 0.1.2.2 (position) which contains the newly added node.

Definition 4.27 Query Monotonicity and Query Consistency For two queries Q and Qand

that of Q, i.e |R(Q, T )| ≥ |R(Q, T )|

contains at least one match to k.

Định dạng
Số trang	5
Dung lượng	125,86 KB