When the number of databases is large, a proper subset of databases need to be selected that are most suitable to answer a keyword query.. In the ideal case, if the keyword query is eval
Trang 2C H A P T E R 5
Other Topics for Keyword
Search on Databases
In this chapter, we discuss several interesting research issues regarding keyword search on databases
In Section 5.1, we discuss some approaches that are proposed to select some RDB among many
to answer a keyword query In Section 5.2, we discuss keyword search in a spatial database In
Section 5.3, we introduce a PageRank based approach called ObjectRank in RDB, and an approach
that projects a database that only contains tuples relating to a keyword query
There are two main issues to be considered in keyword search across multiple databases:
1 When the number of databases is large, a proper subset of databases need to be selected that are most suitable to answer a keyword query This is the problem of keyword-based selection
of the top-k databases, and it is studied in M -KS [Yu et al.,2007] and G-KS [Vu et al.,2008]
2 The keyword query needs to be executed across the databases that are selected This problem
is studied in Kite [Sayyadian et al.,2007]
In order to rank a set of databasesD = {D1, D2,· · · } according to the their suitability to answer
a certain keyword query Q, a score function score(D, Q) is defined for each database D∈D In the ideal case, if the keyword query is evaluated in each database individually, the best database to answer the query is the one that can generate high quality results SupposeT = {T1, T2, } is the
set of results (MTJNT s, see Chapter 2) for query Q over database D The following equation can
be used to score database D:
score(D, Q)=
T∈T
where score(T , Q) can be any scoring function for the MTJNT T as discussed in Chapter 2.
In practice, it is inefficient to evaluate Q on every database D∈D A straightforward way to
solve the problem efficiently is to calculate the keyword statistics for each k i ∈ Q on each database
D∈D and summarize the statistics as a score reflecting the relevance of Q to D There are two
Trang 3drawbacks to this solution First, the keyword statistics can not reveal the importance of the keyword
to the databases For example, a term in a primary key attribute of a table may be referred to by a large number of foreign keys Such a term may be very important in answering the keyword query, but its frequency in the database can be very low Furthermore, two different keywords can be connected through a sequence of foreign key references in a relational database The length and number of such connections may largely reveal the capability of the database to answer a certain keyword query The statistics of single keywords can not capture such relationships between keywords, and thus they
may choose a database that has high keyword frequency, but they may not generate any MTJNT
Suppose the keyword space isK = {w1, w2, , w s } For each database D ∈ D, we can
con-struct a keyword relationship matrix (KRM ) R = (r i,j ) s ×s , which is a s by s matrix where each
element is defined as follows:
R i,j =
δ
d=0
Here, ω d (w i , w j ) is the number of joining sequences of length d: t01 t11 1 t d where t i∈
D (1≤ i ≤ d) is a tuple, and t0contains keyword w i and t d contains keyword w j δ is a parameter
to control the maximum length of the joining sequences, because it is meaningless if two tuples t0
and t j are too far away from each other ϕ d is a function of d that measures the importance of the joining sequence of length d, it can be specified based on different requirements.
ϕ d = 1
For example, in M -KS [Yu et al.,2007], the value ω d (w i , w j )increases exponentially with respect
to d, so another control parameter M is set such that if the valueδ
d=0ω d (w i , w j ) > M, theR i,j
value is changed to be:
R i,j =
δ −1
d=0
ϕ d · ω d (w i , w j ) + ϕ δ· (M −
δ −1
d=0
where δis a value such that δ≤ δ andδ
d=0ω d (w i , w j ) ≥ M andδ −1
d=0ω d (w i , w j ) < M, i.e.,
δ= min{δ p|δ p
d=0ω d (w i , w j ) ≥ M}.
Given the KRM of database D, and a keyword query Q, the score(D, Q) can be calculated
as follows
score(D, Q)=
w i ∈Q,w j ∈Q,i<j
In place of summation, it is possible to use aggregate functions min, max, or product according to
different requirements
A number of drawbacks of KRM have been identified [Vu et al.,2008] First, KRM only
considers the pairwise relationship between keywords in a query, and this may generate many false
Trang 41 2 0,2
1,2 2
1,3 3 1 0,2
0,2 1
w1
w4
w3
w2
Figure 5.1: KRG and one of its JKT for query Q = {w1, w2, w3, w4}
positives because each real result MTJNT constructs all keywords in the shape of a tree rather than a
pairwise graph Second, considering only the connections between keywords in a relational database
is not enough to rank databases; it is important to also integrate IR-Styled score in the scoring function These can be addressed as follows
Suppose the keyword space isK = {w1, w2, } For each database D ∈ D, a keyword
rela-tionship graph (KRG) can be constructed, G(V , E), where, for each keyword w i ∈ Q, there is a node w i ∈ V (G), and for every two keywords w i ∈ Q and w j ∈ Q, if w i and w j can be connected
through at least one joining sequence of tuples in D, then an edge (w i , w j ) ∈ E(G) is added For each edge (w i , w j ) ∈ E(G), a set of weights are assigned More precisely, when there is a joining sequence of tuples with length d that connect w i and w j in the two ends, then a weight d is added
to the edge (w i , w j ) in G.
Given the KRG G for database D, and a keyword query Q = {k1, k2, , k l}, a Join Keyword
Tree (JKT ) is a tree that satisfies the following conditions.
• Each node in the tree contains at least one keyword
• The tree contains all the keywords (total), and there exist no subtrees that contain all the keywords (minimal)
• Each edge of the tree has a positive integer weight, and the total weight for all edges in the tree is smaller thanTmax.1
• For any two keywords w i and w j contained in nodes v1 and v2, respectively, suppose the
distance (total weight of edges) between v1and v2in the tree is d, then there exists an edge (w i , w j ) in G whose weight is d.
1Tmaxis the maximum number of nodes allowed in a tree.
Trang 5An example of a KRG is shown in Figure 5.1(a) where there are five keywords w1, w2, ,
w5 For edge (w1, w5) , the two keywords are contained in a certain tuple in database D, so we add weight 0 There also exists a joining sequence of length 2 that connects w1and w5at the two ends,
so we add weight 2 A JKT of the KRG is shown in Figure 5.1(b) For two keywords w1and w2
in the JKT , their distance in the tree is 0 because they are contained in the same node The edge (w2, w3) of the KRG has weight 0 The distance of the two keywords w2 and w4is 3, and we can
also find that edge (w2, w4) has weight 3 Given a database D and its KRG G, we have the following
theorem
Theorem 5.1 Given a keyword query Q, for a database D∈D , if its KRG does not contain a JKT for
Q, the results (MTJNTs) for the keyword query Q over the database D will be empty.
Using Theorem 5.1, it can prune databases that do not contain a JKT for the keyword query
Q For other databases, a new scoring function is defined in order to rank them.The scoring function considers both the IR ranking score and the structural score (distances between keywords).The score
consists of two parts, namely the node score and the edge score For database D that is not pruned
and for the keyword spaceK, the node score and the edge score are as follows:
• The node score: The score of each keyword w i ∈Kis
score(D, w i )=
t ∈D and t contains w i score(t, D, w i )
where N(D, w i ) is the number of tuples in D that contain keyword w iand the score for each
tuple t with respect to w i , score(t, D, w i )is defined as follows:
score(t, D, w i )= tf (t, w i )
w ∈t tf (t, w) · ln N (D)
N (D, w i )+ 1 (5.7)
where tf (t, w i ) is the term frequency of w i in the tuple t, and N(D) is the total number of tuples in D.
• The edge score: For any two keywords w i ∈K and w j ∈K , the edge score score(D, w i , w j )
is defined as follows:
score(D, w i , w j )=
δ
d=1
score d (D, w i , w j ) (5.8)
Here δ is a parameter to control the maximum distance between two keywords, and
score d (D, w i , w j )=
(t,t) ∈P d (w i ,w j ,D) tf (t, w i ) · tf (t, w j )· ln N d (D)
N d (w i ,w j ,D)+1