Keyword Search in Databases- P24 doc

When the number of databases is large, a proper subset of databases need to be selected that are most suitable to answer a keyword query.. In the ideal case, if the keyword query is eval

Trang 2

C H A P T E R 5

Other Topics for Keyword

Search on Databases

In this chapter, we discuss several interesting research issues regarding keyword search on databases

In Section 5.1, we discuss some approaches that are proposed to select some RDB among many

to answer a keyword query In Section 5.2, we discuss keyword search in a spatial database In

Section 5.3, we introduce a PageRank based approach called ObjectRank in RDB, and an approach

that projects a database that only contains tuples relating to a keyword query

There are two main issues to be considered in keyword search across multiple databases:

1 When the number of databases is large, a proper subset of databases need to be selected that are most suitable to answer a keyword query This is the problem of keyword-based selection

of the top-k databases, and it is studied in M -KS [Yu et al.,2007] and G-KS [Vu et al.,2008]

2 The keyword query needs to be executed across the databases that are selected This problem

is studied in Kite [Sayyadian et al.,2007]

In order to rank a set of databasesD = {D1, D2,· · · } according to the their suitability to answer

a certain keyword query Q, a score function score(D, Q) is defined for each database D∈D In the ideal case, if the keyword query is evaluated in each database individually, the best database to answer the query is the one that can generate high quality results SupposeT = {T1, T2, } is the

set of results (MTJNT s, see Chapter 2) for query Q over database D The following equation can

be used to score database D:

score(D, Q)=

T∈T

where score(T , Q) can be any scoring function for the MTJNT T as discussed in Chapter 2.

In practice, it is inefficient to evaluate Q on every database D∈D A straightforward way to

solve the problem efficiently is to calculate the keyword statistics for each k i ∈ Q on each database

D∈D and summarize the statistics as a score reflecting the relevance of Q to D There are two

Trang 3

drawbacks to this solution First, the keyword statistics can not reveal the importance of the keyword

to the databases For example, a term in a primary key attribute of a table may be referred to by a large number of foreign keys Such a term may be very important in answering the keyword query, but its frequency in the database can be very low Furthermore, two different keywords can be connected through a sequence of foreign key references in a relational database The length and number of such connections may largely reveal the capability of the database to answer a certain keyword query The statistics of single keywords can not capture such relationships between keywords, and thus they

may choose a database that has high keyword frequency, but they may not generate any MTJNT

Suppose the keyword space isK = {w1, w2, , w s } For each database D ∈ D, we can

con-struct a keyword relationship matrix (KRM ) R = (r i,j ) s ×s , which is a s by s matrix where each

element is defined as follows:

R i,j =

δ

d=0

Here, ω d (w i , w j ) is the number of joining sequences of length d: t01 t11 1 t d where t i∈

D (1≤ i ≤ d) is a tuple, and t0contains keyword w i and t d contains keyword w j δ is a parameter

to control the maximum length of the joining sequences, because it is meaningless if two tuples t0

and t j are too far away from each other ϕ d is a function of d that measures the importance of the joining sequence of length d, it can be specified based on different requirements.

ϕ d = 1

For example, in M -KS [Yu et al.,2007], the value ω d (w i , w j )increases exponentially with respect

to d, so another control parameter M is set such that if the valueδ

d=0ω d (w i , w j ) > M, theR i,j

value is changed to be:

R i,j =

δ −1

d=0

ϕ d · ω d (w i , w j ) + ϕ δ· (M −

δ −1

d=0

where δis a value such that δ≤ δ andδ

d=0ω d (w i , w j ) ≥ M andδ −1

d=0ω d (w i , w j ) < M, i.e.,

δ= min{δ p|δ p

d=0ω d (w i , w j ) ≥ M}.

Given the KRM of database D, and a keyword query Q, the score(D, Q) can be calculated

as follows

score(D, Q)=

w i ∈Q,w j ∈Q,i<j

In place of summation, it is possible to use aggregate functions min, max, or product according to

different requirements

A number of drawbacks of KRM have been identified [Vu et al.,2008] First, KRM only

considers the pairwise relationship between keywords in a query, and this may generate many false

Trang 4

1 2 0,2

1,2 2

1,3 3 1 0,2

0,2 1

w1

w4

w3

w2

Figure 5.1: KRG and one of its JKT for query Q = {w1, w2, w3, w4}

positives because each real result MTJNT constructs all keywords in the shape of a tree rather than a

pairwise graph Second, considering only the connections between keywords in a relational database

is not enough to rank databases; it is important to also integrate IR-Styled score in the scoring function These can be addressed as follows

Suppose the keyword space isK = {w1, w2, } For each database D ∈ D, a keyword

rela-tionship graph (KRG) can be constructed, G(V , E), where, for each keyword w i ∈ Q, there is a node w i ∈ V (G), and for every two keywords w i ∈ Q and w j ∈ Q, if w i and w j can be connected

through at least one joining sequence of tuples in D, then an edge (w i , w j ) ∈ E(G) is added For each edge (w i , w j ) ∈ E(G), a set of weights are assigned More precisely, when there is a joining sequence of tuples with length d that connect w i and w j in the two ends, then a weight d is added

to the edge (w i , w j ) in G.

Given the KRG G for database D, and a keyword query Q = {k1, k2, , k l}, a Join Keyword

Tree (JKT ) is a tree that satisfies the following conditions.

• Each node in the tree contains at least one keyword

• The tree contains all the keywords (total), and there exist no subtrees that contain all the keywords (minimal)

• Each edge of the tree has a positive integer weight, and the total weight for all edges in the tree is smaller thanTmax.1

• For any two keywords w i and w j contained in nodes v1 and v2, respectively, suppose the

distance (total weight of edges) between v1and v2in the tree is d, then there exists an edge (w i , w j ) in G whose weight is d.

1Tmaxis the maximum number of nodes allowed in a tree.

Trang 5

An example of a KRG is shown in Figure 5.1(a) where there are five keywords w1, w2, ,

w5 For edge (w1, w5) , the two keywords are contained in a certain tuple in database D, so we add weight 0 There also exists a joining sequence of length 2 that connects w1and w5at the two ends,

so we add weight 2 A JKT of the KRG is shown in Figure 5.1(b) For two keywords w1and w2

in the JKT , their distance in the tree is 0 because they are contained in the same node The edge (w2, w3) of the KRG has weight 0 The distance of the two keywords w2 and w4is 3, and we can

also find that edge (w2, w4) has weight 3 Given a database D and its KRG G, we have the following

theorem

Theorem 5.1 Given a keyword query Q, for a database D∈D , if its KRG does not contain a JKT for

Q, the results (MTJNTs) for the keyword query Q over the database D will be empty.

Using Theorem 5.1, it can prune databases that do not contain a JKT for the keyword query

Q For other databases, a new scoring function is defined in order to rank them.The scoring function considers both the IR ranking score and the structural score (distances between keywords).The score

consists of two parts, namely the node score and the edge score For database D that is not pruned

and for the keyword spaceK, the node score and the edge score are as follows:

• The node score: The score of each keyword w i ∈Kis

score(D, w i )=

t ∈D and t contains w i score(t, D, w i )

where N(D, w i ) is the number of tuples in D that contain keyword w iand the score for each

tuple t with respect to w i , score(t, D, w i )is defined as follows:

score(t, D, w i )= tf (t, w i )

w ∈t tf (t, w) · ln N (D)

N (D, w i )+ 1 (5.7)

where tf (t, w i ) is the term frequency of w i in the tuple t, and N(D) is the total number of tuples in D.

• The edge score: For any two keywords w i ∈K and w j ∈K , the edge score score(D, w i , w j )

is defined as follows:

score(D, w i , w j )=

δ

d=1

score d (D, w i , w j ) (5.8)

Here δ is a parameter to control the maximum distance between two keywords, and

score d (D, w i , w j )=

(t,t) ∈P d (w i ,w j ,D) tf (t, w i ) · tf (t, w j )· ln N d (D)

N d (w i ,w j ,D)+1

Tiêu đề	Keyword Search in Databases
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	5
Dung lượng	114,9 KB