VARIATIONS OF KEYWORD SEARCH ON DATABASES 129maybe empty.. 5.3.5 SMALL DATABASE AS RESULT Précis [Koutrika et al.,2006;Simitsis et al.,2008] returns a small database that contains only t
Trang 15.3 VARIATIONS OF KEYWORD SEARCH ON DATABASES 129
(maybe empty) For example, for the keyword query Q = {author, number, paper, XML}, one of the possible CI s is (C = {author.TID, paper.TID, paper.title contains “XML” }, a = paper.TID, F = count, w = author.TID) A CI is trivial if one of the following is satisfied: (1) C contains a attribute
c = a such that c functionally determines a, or (2) C contains two attributes ci and cj that refer to
the same attribute or ci is a foreign key of cj The set of non-trivial CI s can be enumerated by using
the full text index enabled inrdbms
After enumerating all non-trivial CI s, for each CI = (C, a, F, w), it enumerates a set of Simple Query Networks (SQN) where each SQN is a connected subgraph of the schema graph
that satisfies the following conditions:
• Total - All tables in C are contained in the SQN.
• Minimal - It is not total if any node is removed from SQN.
• Node Clarity - Each node in SQN has at most one incoming edge.
Suppose the cost of a SQN is the summation of all edge costs and node costs For each CI , it needs to get the SQN with the smallest cost, which is a NP-Complete problem A heuristic greedy algorithm is proposed in SQAK For a CI , (C, a, F, w), it starts at the table o that contains the attribute a For each of the other tables (nodes) v ∈ C, it finds the shortest path from v to o in a backtrack manner If, after adding the path from v to o, the node clarity condition is violated, it backtracks to find the next shortest path from v to o until all nodes in C are successfully added It then outputs the current result to be a good SQN for the CI After finding the SQN for each CI ,
it can get the top-k SQNs with the smallest cost And each of the top-k SQNs is translated into
ansqlto be output
5.3.5 SMALL DATABASE AS RESULT
Précis [Koutrika et al.,2006;Simitsis et al.,2008] returns a small database that contains only the
tuples relevant to a given keyword query Q The schema of a relational database D is modeled as
a weighted graph GS (V , E) , where each relation is modeled as a node in GS, and each foreign key
reference between relations is modeled as an edge in GS Each edge also has a weight, defining the
tightness of the relationship between the two relations Given a keyword query Q = {k1, k2, , k l}, the result of applying Q on D is a small database D, satisfying the following conditions
1 The set of relation names in Dis a subset of the set of relation names in D.
2 For each relation R ∈ Dthat corresponds to relation R ∈ D, we have att(R) ⊆ att(R) and
t up(R) ⊆ tup(R), where att(R) denotes the attributes of R and tup(R) denotes the tuples
of R.
3 The tuples in D can be generated by expanding from the tuples that contain keywords in the query, following the foreign key references They must satisfy the degree constraints and
Trang 2130 5 OTHER TOPICS FOR KEYWORD SEARCH ON DATABASES
cardinality constraints Degree constraints define the attributes and relations in D They
include (1) the maximum number of attributes in D, and (2) the minimum weight of projection
paths in the database schema graph GS Cardinality constraints define the set of tuples in D
They include (1) the maximum number of tuples in D, and (2) the maximum number of
tuples for each relation in D
For example, for the DBLP database shown in Figure 2.2, consider a keyword query Q= {algorithms}, with the constraint such that the distance from any tuple to the tuple that contains
the keyword in Q must be no larger than 2 Then, the result contains the database having the same schema with the original database Tuples such as p2and p3will be contained in the result because
they all have distance 2 with the tuple p4that contains the keyword “algorithms” Tuples such as a1,
a2 and p4will not be contained in the result because they all have distance larger than 2 with any tuple that contains the keyword “algorithms”
In Précis, a keyword query is processed in two steps In the first step, the schema of the database Dis generated, such that all of the degree constraints are satisfied This can be done easily
by expanding from the relations, that may contain the user given keywords, to the adjacent relations following the foreign key references, until all degree constraints are satisfied In the second step, it
evaluates each join edge defined in the schema of Din order to satisfy all the cardinality constraints
5.3.6 OTHER RELATED ISSUES
Jagadish et al.[2007] assert that usability of a database is an important issue to address in database research Enabling keyword query on database is one aspect to improve the usability
Goldman et al [1998] propose the notion, proximity search, which is to search objects in
database that are “near” other relevant objects Here the database is represented as a graph, where objects are represented by nodes and edges represent relationships between the corresponding objects
Su and Widom[2005] propose to construct virtual documents offline, which is the answer unit for a keyword query Virtual documents are interconnected tuples from multiple relations Query answering is in an traditional IR style, where virtual documents satisfying the query are returned.Nandi and Jagadish[2009] propose to represent the database, conceptually, as a collection
of independent “queried units”, each of which represents the desired result of some query against the database.Jayapandian and Jagadish[2008] present an automated technique to generate a good set of forms that can express all possible queries, and each form is capable of expressing only a very limited range of queries.Talukdar et al.[2008] present a system with which a non-expert user can author new query templates and Web forms, to be used by anyone with related information needs The query templates and Web forms are generated by a keyword query against interlinked source relations
Ji et al.[2009] study interactive keyword search on RDB, where the interaction is provided
by autocompletion, which predicts a word of phrase that a user may type based on the partial query
the user has entered An answer defined in [Ji et al., 2009] is a single record in RDB. Li et al
[2009a] extend the autocompletion framework to the steiner tree based semantics for a keyword query.
Trang 35.3 VARIATIONS OF KEYWORD SEARCH ON DATABASES 131
Chaudhuri and Kaushik[2009] study autocompletion with tolerated errors in a general framework,
in which only autocompletions are computed without query evaluation [Pu and Yu,2008,2009]
study the problem of query cleaning for keyword queries in RDB, where query cleaning involves
semantic linkage and spelling corrections followed by segmenting nearby query words into high quality data terms
Guo et al [2007] present efficient algorithm to conduct topology search over biological databases.Shao et al.[2009b] present an effective workflow search engine, WISE, to find informa-tive and concise search results, defined as the minimal views of the most specific workflow hierarchies containing keywords for a keyword query
Trang 5Bibliography
Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das DBXplorer: A system for keyword-based
search over relational databases In Proc 18th Int Conf on Data Engineering, pages 5–16, 2002.
DOI: 10.1109/ICDE.2002.994693 2.1, 2.3
Shurug Al-Khalifa, Cong Yu, and H V Jagadish Querying structured text in an xml
database In Proc 2003 ACM SIGMOD Int Conf On Management of Data, pages 4–15, 2003.
DOI: 10.1145/872757.8727614.5
Sihem Amer-Yahia, Pat Case, Thomas Rölleke, Jayavel Shanmugasundaram, and Gerhard
Weikum Report on the db/ir panel at sigmod 2005 SIGMOD Record, 34(4):71–74, 2005.
DOI: 10.1145/1107499.1107514 (document)
Sihem Amer-Yahia and Jayavel Shanmugasundaram Xml full-text search: Challenges and
oppor-tunities In Proc 31st Int Conf on Very Large Data Bases, page 1368, 2005 (document)
Andrey Balmin, Vagelis Hristidis, Nick Koudas, Yannis Papakonstantinou, Divesh Srivastava, and
Tianqiu Wang A system for keyword proximity search on xml databases In Proc 29th Int Conf.
on Very Large Data Bases, pages 1069–1072, 2003 4.5
Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou ObjectRank: Authority-based
keyword search in databases In Proc 30th Int Conf on Very Large Data Bases, pages 564–575,
2004 5.3.1
Zhifeng Bao, Tok Wang Ling, Bo Chen, and Jiaheng Lu Effective xml keyword search with
relevance oriented ranking In Proc 25th Int Conf on Data Engineering, pages 517–528, 2009.
DOI: 10.1109/ICDE.2009.164.5
Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S Sudarshan Keyword
searching and browsing in databases using BANKS In Proc 18th Int Conf on Data Engineering,
pages 431–440, 2002.DOI: 10.1109/ICDE.2002.994756 3.1, 3.1, 3.3.1, 3.3.1
Sergey Brin and Lawrence Page The anatomy of a large-scale hypertextual web search engine
Computer Networks, 30(1-7):107–117, 1998.DOI: 10.1016/S0169-7552(98)00110-X3.1, 4.4.1 Kaushik Chakrabarti, Venkatesh Ganti, Jiawei Han, and Dong Xin Ranking objects based on
relationships In Proc 2006 ACM SIGMOD Int Conf On Management of Data, pages 371–382,
2006.DOI: 10.1145/1142473.1142516 5.3.1