ABSTRACT A critical weakness of the traditional query processing model in existing database systems is the lack of the flexibility in interpreting and answering users’ queries.. This has
Trang 1EIDESSTATTLICHE ERKLÄRUNG
Ich erkläre an Eides statt, dass ich die vorliegende Dissertation selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe
Linz, April 2003
Dang Tran Khanh
Trang 2KURZFASSUNG
Eine Schwäche der traditionellen Anfragebeantwortung in den heutigen Datenbanksystemen ist der Mangel an Flexibilität bei der Interpretation der Benutzeranfragen Dies führte zur Entwicklung sogenannter „Flexilble Query Answering Systems“ (FQASs), welche die bestehenden Datenbanken um die Funktionalität von Ähnlichkeitsabfragen erweitern Doch die Realisierung solcher Funktionalitäten für die heute am Markt befindlichen Datenbankmanagementsysteme (DBMSs) ist kein trivialer Vorgang, weil die Semantik fehlt, auf die sich Ähnlichkeitsabfragen beziehen müssen Auf der anderen Seite spielt das Konzept von Ähnlichkeit / Relevanz von Dokumenten eine zentrale Rolle bei Information Retrieval (IR) Systemen Diese zielen mehr auf Befriedigung des Informationsbedarfes eines Benutzers ab, weniger auf einen Datenbedarf, den der Nutzer eines DBMS hat Eine angemessene Integration der IR-Konzepte in existierende Datenbanksysteme um diese Semantik-Lücke zu schließen könnte zu semantikbasierten FQAS führen Trotzdem ist die Entwicklung solcher Systeme absolut keine einfache Arbeit und enthält weitere Herausforderungen wie (1) die Modellierung eines solchen Systems, (2) die Entwicklung der Flexibilität und (3) die Schaffung eines transparenten und sicheren Systems
In dieser Arbeit werden die oben angeführten Herausforderungen analysiert, Methoden und Techniken entworfen, implementiert und evaluiert, mit dem Ziel, die Planung und Konstruktion von effizienten semantik-basierten FQASs im Allgemeinen zu unterstützen Im Speziellen konzentriert sich die Arbeit auf Methoden für Ähnlichkeitssuche, komplexe Ähnlichkeitssuche, Ähnlichkeits-Join, angenäherte Ähnlichkeitssuche (approximate similarity queries) sowie auf Optimierung und Integration all dieser Ansätze in konventionelle DBMSs Darüber hinaus werden die Erfahrungen betreffend der Entwicklung von semantik-basierten FQAS behandelt Die Ergebnisse der Arbeit können nicht nur im Bereich traditioneller DBMSs eingesetzt werden, sondern auch im weiteren Umfeld (z.B moderne IR-Systeme, Data Mining)
Unter den erreichten Ergebnissen sind besonders hervorzuheben: (1) Erfindung einer neuartigen multidimensionalen Indexstruktur, des SH-tree (Super Hybrid tree), die auch in hochdimensionalen Räumen skaliert; (2) ein neuer Ansatz, genannt Hyper Sphere Approach (ISA), der effizient komplexe Multi-Feature-Nearest-Neighbor-Queries (M-FNN) abarbeitet; (3) ein innovativer Ansatz, ε-ISA, für ein schnelles Durchführen von „Approximate complex M-FNN“ (ε-ISA ist einer der ersten Lösungen für diese Aufgabenstellung); (4) Effiziente Ansätze für „complex similarity joins“ und „approximate complex similarity joins“; (5) Diskussion und Lösung kritischer Fragen für die Integration des SH-tree in DBMSs
Trang 3ABSTRACT
A critical weakness of the traditional query processing model in existing database systems is the lack of the flexibility in interpreting and answering users’ queries This has initiated a new research trend into flexible query answering systems (FQASs), which extend the existing database systems with similarity retrieval capabilities in order to fulfill a user’s requirements, which are expressed by a formal query language as, e.g., the SQL, more “intelligently” and effectively However, realizing such similarity retrieval capabilities for the state-of-the-art of conventional database management systems (DBMSs) is non-trivial work due to the lack of semantics that form the basis for similarity searches within these systems On the other hand, the concept of similarity/relevance search is at the center of Information Retrieval (IR) systems, which mainly aim at satisfying requirements of user information needs instead of user data needs as in conventional DBMSs A suitable integration of this concept with an existing database system in order to overcome the lack of the semantics could lead to a semantic based FQAS as desired Even then, developing such a FQAS is absolutely not simple work and it introduces many new challenges that relate to (1) modeling the system, (2) developing flexibilities for the system, and (3) building a transparency and guarantee system
In this thesis, we shall analyze the above challenges, and design, implement and evaluate techniques in order to facilitate the design and construction of efficient semantic based FQASs in common In particular, we shall concentrate our attention
on similarity search techniques, complex similarity query/join processing and optimization, approximate similarity queries of various types, and integration of these facilities all into DBMSs Moreover, although we shall present the research results in the context of developing semantic based FQASs, our achievements can also be applicable to other modern database applications (e.g., modern IR systems, data mining) rather than only within the range of traditional DBMSs
Among the achieved results, most important ones include: (1) Inventing a multidimensional index structure, named the SH-tree (Super Hybrid tree), that can scale to high-dimensional data spaces; (2) Introducing a novel approach called the Incremental hyper-Sphere Approach (ISA) to efficiently address complex multi-feature nearest neighbor (M-FNN) queries; (3) Introducing an innovative approach named the ε-ISA to efficiently solve approximate complex M-FNN queries (the ε-ISA is one of a few vanguard solutions to the problem of approximate complex similarity query answering); (4) Proposing efficient approaches to deal with complex and approximate complex similarity joins; (5) Discussing and solving critical issues towards integrating the SH-tree into DBMSs
Trang 4ACKNOWLEDGEMENTS
First of all, I must thank my parents, who give me the life and support me all the life
in any kind of need
I thank my direct supervisor, Prof Küng Josef for his great support and many fruitful discussions I am also so grateful to his family members, who helped and showed me
to know more about Austria and Austrians Specially, I must thank Gabriela a lot for the diners and jokes
I would express my great gratitude for all the help of my supervisor, Prof Wagner Roland He helped a lot with many important decisions during the research and study
in Austria
Thank my younger sister, Dang Viet Ha, and my youngest brother, Dang Tran Khanh They and my parents are indispensable people in my life They all infused spirit into me to finish this thesis My younger sister and brother helped me to forget the work whenever it was at a standstill and I needed some time to relax
For the cost of living and part of funding for my research work, I thank all people from Austrian Exchange Service - OeAD In particular, I would thank Mr Szelegowitz Andreas who is managing the OeAD branch in Linz for all of his help during my stay in Austria
I thank my old teacher, Dr Nguyen Thanh Son, indeed for his support and useful advices
I am also thankful to all of my friends and colleagues for their support and encouragement
Last but not least, I give my special thanks to my fiancée, Huyen Trang, who is always to be with me, and gives me her great encouragement Her support is a lever for me to accomplish the thesis
Linz, April 2003 Dang Tran Khanh
Trang 5TABLE OF CONTENTS
CHAPTER 1
INTRODUCTION 1
1.1 Motivation 1
1.2 Challenges 4
1.3 Contributions and Structure of Thesis 5
CHAPTER 2 FLEXIBLE QUERY ANSWERING SYSTEMS 13
2.1 Introduction 13
2.2 Supporting Flexible Retrieval Capabilities in DBMSs 15
2.2.1 ARES 16
2.2.2 VAGUE 19
2.2.3 FLEX 21
2.2.4 CoBase 24
2.2.5 VQS 25
2.3 Modern Information Retrieval Systems 26
2.3.1 QBIC 28
2.3.2 Photobook 31
2.3.3 MARS 32
2.4 Discussion 33
2.5 Conclusions 35
CHAPTER 3 VQS - VAGUE QUERY SYSTEM: APPROACH AND ISSUES 38
3.1 Introduction 38
3.2 Semantic Based Similarity Searches 41
3.3 Basic Ideas and Overall Architecture of VQS 44
3.3.1 The Basic Ideas 44
Trang 6Approach 56
3.6 Issues and Discussion 61
3.7 Conclusions 64
CHAPTER 4 AN OVERVIEW OF MULTIDIMENSIONAL ACCESS METHODS 66
4.1 Preliminary 66
4.2 Introduction and Fundamental Definitions 68
4.2.1 Basic Operations Related to Building MAMs 72
4.2.2 Basic and Advanced Query Types Related to MAMs 74
4.3 Prevailing Search Algorithms 84
4.4 Developing a Taxonomy for Multidimensional Access Methods 90
4.5 A Survey of Some Typical Multidimensional Access Methods 93
4.5.1 KD-Tree Based Indexing Techniques 93
4.5.1.1 The KD-tree 93
4.5.1.2 The VAMSplit-tree 94
4.5.1.3 The LSDh-tree 94
4.5.2 R-Tree Based Indexing Techniques 95
4.5.2.1 The R-tree 95
4.5.2.2 The TV-tree 96
4.5.2.3 The SS-tree 97
4.5.2.4 The X-tree 98
4.5.2.5 The SR-tree 98
4.5.2.6 The M-tree and MAMs for Metric Spaces 99
4.5.2.7 The A-tree 99
4.5.3 Hybrid Techniques of both KD-tree and R-tree 100
4.5.3.1 The Hybrid tree 100
4.5.3.2 The SH-tree 101
4.5.4 Other Techniques 101
Trang 74.6 The Generalized Search Tree (GiST): A Framework for Search Trees in
Database Systems 104
4.7 Comparative Studies 108
4.8 Remarks and Conclusions 112
CHAPTER 5 THE SH-TREE: A SUPER HYBRID INDEX STRUCTURE FOR MULTIDIMENSIONAL DATA 115
5.1 Preliminary Remarks 115
5.2 Introduction 117
5.3 Motivations 118
5.4 The SH-tree Structure 120
5.4.1 Multidimensional Space Partitioning and Basic Structure of the SH-tree 121
5.4.2 Splitting Nodes in the SH-tree 125
5.4.2.1 Leaf Node Splitting 125
5.4.2.2 Balanced Node Splitting 128
5.4.2.3 Internal Node Splitting 128
5.4.3 The Extended Balanced SH-tree 131
5.5 The SH-tree’s Basic Operations 134
5.5.1 Insertion 134
5.5.2 Deletion 138
5.5.3 Search 138
5.6 Evaluating performance of the SH-tree 144
5.6.1 Implementation Details 144
5.6.2 The SH-tree Performance with k-Nearest Neighbor Queries 146
5.7 Towards Integrating the SH-tree into DBMSs 159
5.7.1 An Approach to Cost Estimation for Searching on the SH-tree 159
5.7.2 Local Dynamic Bulk-Loading of the SH-tree 166
Trang 8CHAPTER 6
SOLVING COMPLEX MULTI-FEATURE
NEAREST NEIGHBOR QUERIES 181
6.1 Introduction 181
6.2 Addressing Complex Vague Queries 183
6.3 Incremental Hyper-Sphere Approach (ISA) 188
6.3.1 Basic Incremental hyper-Sphere Approach Version 189
6.3.2 An Incremental Algorithm Adapted for hyper-Sphere Range Queries 193
6.3.3 Enhanced Incremental hyper-Sphere Approach Version 197
6.3.4 Discussion 201
6.4 Finding k Nearest Neighbors for Complex Vague Queries 203
6.5 Experimental Results 204
6.6 Remarks and Conclusions 209
CHAPTER 7 SOLVING APPROXIMATE SIMILARITY QUERIES 211
7.1 A Preliminary Declaration of Problems 211
7.2 Solving Approximate Complex Multi-Feature Nearest Neighbor Queries 214
7.2.1 Introduction and Related Research 214
7.2.2 The ISA: An Efficient Approach for Solving Complex Vague Queries 220
7.2.3 The ε-ISA: Towards Efficiently Solving Approximate M-FNN Problem 227
7.2.3.1 Finding Approximate Nearest Neighbor 228
7.2.3.2 A Generalized Algorithm for Finding Approximate k-Nearest Neighbors 232
Trang 97.5 Conclusions and Future Work 253
CHAPTER 8 EFFICIENT PROCESSING OF COMPLEX AND APPROXIMATE COMPLEX SIMILARITY JOINS 256
8.1 Introduction and A Classification Scheme of Similarity Joins 256
8.2 Efficient Complex Similarity Join Processing in the VQS 262
8.2.1 Complex Similarity Join Processing: An Approach to the Best Match Joins 265
8.2.2 Approximate Complex Similarity Join Processing 267
8.2.3 Evaluation Results and Discussions 269
8.3 A Generalization of Complex and Approximate Complex Similarity Join Processing Problem 270
8.4 Remarks and Conclusions 273
CHAPTER 9 CONCLUSIONS AND FUTURE WORK 276
9.1 Concluding Summary 276
9.2 Future Work 281
REFERENCES 285
Trang 10Figure 2.1 Overall architecture of FLEX 22
Figure 2.2 An example runway length TAH 24
Figure 2.3 Overall architecture of QBIC 29
Figure 2.4 A classification scheme for flexible query answering systems 35
Figure 3.1 Normalization using the effective diameter 47
Figure 3.2 Formal description of the Vague Query Language - VQL 48
Figure 3.3 Overall architecture of primitive VQS 51
Figure 3.4 Formal description of the extended Vague Query Language 54
Figure 3.5 An example of complex vague query processing with the Incremental hyper-Cube Approach 58
Figure 3.6 New architecture of VQS 61
Figure 4.1 A schema of the query type classification with MAMs 75
Figure 4.2 The dilation (R+) and the erosion (R-) of an example range R 82
Figure 4.3 MINDIST, MINMAXDIST, and MAXDIST 88
Figure 4.4 Evolution schema of MAMs in recent years 92
Figure 5.1 Some problems with coding actual data region 119
Figure 5.2 A possible partition of an example data space and the corresponding mapping to the SH-tree 123
Figure 5.3 Problem with leaf node splitting in the Hybrid tree: No suitable split position satisfies the storage utilization constraint 126
Figure 5.4 Steps of slitting a leaf node in the SH-tree 128
Figure 5.5 Choose an internal node to insert a new data object into 135
Figure 5.6 Split propagation in the SH-tree 137
Figure 5.7 An algorithm for answering range queries in the SH-tree 142
Figure 5.8 Pseudo-code of the adapted k-NN algorithm 1 148
Trang 11Figure 5.12 Variety in data size of 16-d uniformly distributed data set 155
Figure 5.13 Variety in data size of 9-d real data set 156
Figure 5.14 Variety in number of NN (real data set) 157
Figure 5.15 Variety in number of NN (16-d synthetic data set) 157
Figure 5.16 An algorithm for estimating costs of range queries in the SH-tree 162
Figure 5.17 Experimental results of cost estimation (16-d synthetic data set) 164
Figure 5.18 Experimental results of cost estimation (9-d real data set) 165
Figure 5.19 Local Dynamic Bulk-Loading Algorithm for the SH-tree 168
Figure 5.20 Incorrectness when operations are executed concurrently 171
Figure 5.21 Example of the extended SH-tree structure (for leaf nodes) 173
Figure 5.22 Example of the extended SH-tree structure (for balanced nodes) 174
Figure 6.1 An example illustration of defects of the Incremental hyper-Cube Approach (ICA) 187
Figure 6.2 Example feature spaces for a 2-condition complex vague query with Incremental hyper-Sphere Approach (ISA) 190
Figure 6.3 The incremental nearest neighbor algorithm for a recursive, conservative, and hierarchical index structure 194
Figure 6.4 The incremental nearest neighbor algorithm for a recursive, conservative, and hierarchical index structure 197
Figure 6.5 Cost reduction of the ISA comparing to the ICA 199
Figure 6.6 Incrementally reducing the extended radius: the searched space is qi3 instead of qi4 201
Figure 6.7 The enhanced ISA vs the ICA (complex k-NN queries) 206
Figure 6.8 The optimal ISA vs the ICA (complex 2-condition NN queries) 207
Figure 6.9 The optimal ISA vs the ICA (complex 3-condition NN queries) 207
Figure 6.10 The optimal ISA vs the enhanced ISA and the ICA (complex k-NN queries) 208
Trang 12Complex M-FNN Queries 223
Figure 7.4 Overview of problem solved by the ISA 224
Figure 7.5 Incremental algorithm adapted for range queries 225
Figure 7.6 The ISA performance is still decreased when the query condition number increases 226
Figure 7.7 The ε-ISA: Finding (1+ε)-approximate nearest neighbor of a CVQ 229 Figure 7.8 Lower Bound Total Distance of a M-FNN query 231
Figure 7.9 Two-condition (4-d and 8-d) NN queries, different ε values 236
Figure 7.10 Two-condition (4-d) k-NN queries, ε = 0.2 237
Figure 7.11 Three-condition (2-d) NN queries, different ε values 239
Figure 7.12 Pseudo-code of the adapted approximate k-NN algorithm 244
Figure 7.13 Approximate NN results and correct NN results 247
Figure 7.14 Pseudo-code of the adapted approximate range query algorithm 251
Figure 7.15 Approximate bounding sphere range queries 252
Figure 8.1 A similarity join classification scheme 258
Figure 8.2 Complex similarity joins in the VQS 262
Figure 8.3 Solving the complex similarity join (CSJ) problem in the VQS 266
Figure 8.4 Solving the approximate complex similarity join (ACSJ) problem in the VQS 269
Trang 13Table 2.1 An example of similarity relation in ARES 17
Table 3.1 An example NCR-Table for color names 45
Table 3.2 An example of the use of NCR-Tables 49
Table 7.1 Experimental results of approximate k-NN queries 246
Trang 15Despite successes of databases for traditional applications such as banking or airline scheduling and reservation systems, which are mainly concerned with numeric and/or ASCII data and even database technology is actually one of the major contributions of computer science to the commercial world, databases are now facing new challenges as the domains and requirements of database applications evolve in step with the fast paced developments in computer science One of those emerging challenges in the commercial world nowadays is how to process user’s queries not
only efficiently and effectively but also flexibly The traditional query processing
model in the conventional relational database management systems (RDBMSs), which return a result set that matches a user’s query exactly, is insufficient and significantly inflexible Nevertheless, almost existing RDBMSs still do not support
vague retrieval capabilities directly That means when available data in a relational
database do not match a user’s query precisely, the corresponding RDBMS will only return an empty result set to the user This limits their applicability to domains where only crisp answers are meaningful In many other application domains, however, the users also expect not only the crisp results returned but also some other results that are relevant or close to the query in a certain sense [DKW2002c] Such applications frequently appear in the real-world domains as image/multimedia processing, CAD/CAM systems, Geographical Information Systems (GISs), tourist information systems, time-series databases, digital libraries, modern Information Retrieval (IR), electronic commerce (E-commerce) and so forth
We consider a simple example for tourist information systems as follows When a tourist is looking for a hotel with a rent price at 100 EUR a day and it would be
located in the city center, he will fail to find such a hotel by means of a conventional
RDBMS if the city center does not have any hotel rented at that price In fact, the
user might accept a hotel near the city center and the rent price can also be a little
lower or higher 100 EUR per day A flexible search system should solve this
problem efficiently We call a database or an information system that supports this
aspect a Flexible Query Answering System (FQAS) Specially, in the e-commerce
systems nowadays, such FQASs become more and more important because the customers need not touch goods (e.g., cars, clothes, real estate, etc.) in advance, but
Trang 16they can see information of the goods by using computers before deciding to purchase them or not If the system does not directly support the vague retrieval capabilities, its users are forced to retry a particular query repeatedly with minor modifications until they get the satisfactory data If the users do not know any alternative modifications to retry their queries, then this solution is infeasible [Mot1988] Otherwise, even though the data items which the user really wanted cannot be found on the list of outputs, the output data items still provide the user with enough information to make the next trial effective [IcH1986] As a consequence, developing effective FQASs that bring efficient solutions to such problems is essential and indispensable for the growth of computer science
Even then, building a FQAS for an existing database system is absolutely not simple work because there are many data types and constraints existing in the database that need to be taken into account and the FQAS must not conflict with the present functions of the system Additionally, the capability of a FQAS can be considered by different aspects and factors that depend on the main aims of that FQAS For
example, some FQASs can support only for dealing with the vagueness of the
queries, but others can support the vague retrieval capabilities for even CDCQ-FR (Crisp Data, Crisp Query and Fuzzy Result) model (see chapter 3) Therefore,
evaluating a FQAS is in common not easy as well In fact, FQASs could be useful for a variety of application domains, but not for all of them Moreover, the need of FQASs with the increasing production and exchange of multimedia information through the Internet is not only confined to the RDBMSs, but also extended to modern IR systems [BMP2001], one of the research topics of great interest nowadays The next section, i.e section 1.2, will point out the challenges that we must cope with while developing a FQAS over an existing database system Eventually, section 1.3 presents our objectives as well as contributions and structure
of this thesis
Trang 171.2 Challenges
Although the flexibility in FQASs implies different levels of significance, in our context, the flexibility can be interpreted as capabilities that provide easy, informative and intuitive access to data for every type of need To gain this objective over centralized relational database systems, one must extend and facilitate the
RDBMSs with flexible retrieval capabilities 1, which are unavailable for most existing RDBMSs This work introduces main challenges as follows:
Modeling the system: This challenge requires the developers to select a suitable model for the FQAS that will be built First, the developers have to decide which functions the FQAS intends to extend as well as how to realize them over an existing database system Notice that the requirements might be different between application domains or even between applications in the same domain Second, they must think about the way that the FQAS will cooperate with the system Ideally, the FQAS should be integrated into the RDBMS kernel This gains many benefits from the existing database system because of the tight coupling with the query optimizer, which allows optimal usage of the index structures in the execution plans However, the integration
is in general a very complex and costly task, especially, when the database system cannot be modified, it must be developed from the scratch Another possibility is to implement the vague retrieval capabilities “on top” of the database systems This approach allows extending the FQAS easily without making “side-effect” on the database system However, its main disadvantage
is not cost-effective when processing the queries The additional costs of the query processing are often much higher than as usual and this can degenerate performance of the system
Developing flexibilities for the system: Despite the selected model of the FQAS, a full-fledged FQAS should be a system that can cooperate with general database systems, but not only one specific system, and can benefit
1 Also called “vague/similarity retrieval capabilities” in this thesis
Trang 18naive users as well as professional users Additionally, application area independence ability of a FQAS should also be taken into account It means that the system should be applicable to a variety of application domains as well, it must not be confined itself to a certain specific area Therefore, FQASs must be easily extensible and maintainable at a low cost These last two desired characteristics of FQASs directly assist the above ones and they are particularly important in dynamic environments
Building a transparency and guarantee system: The modifications of any part of the database system, e.g the extension of the SQL, should not trouble the users That means the FQAS should keep the inherent user-friendliness of the original system as much as possible and the technical details inside the system should be invisible to the naive users as well Moreover, the FQAS also has to guarantee the correctness of returned results as expectation This indicates that extended functions provided by a FQAS must not divert the accuracy of the standard functions in the existing database system
The challenges as mentioned above are theoretically general for all FQASs Practically, there might be some compromises between the conflicting requirements for simplicity, flexibility and efficiency when building a real system In section 1.3,
we also present the concrete issues related to a developing FQAS, which has in some degree inspired the author to accomplish this thesis
1.3 Contributions and Structure of Thesis
Realizing the need of FQASs for modern database applications and the lack of the flexible retrieval capabilities in the existing database systems, a project called Vague Searches in Information Systems (VASIS) [FAW1998] has been conducted at the FAW institute2 The motivation for the project is to realize common concepts and methodologies for modeling semantic meta-information in the conventional database
2 The FAW institute belongs to Johannes Kepler University of Linz, Austria/Europe URL: http://www.faw.uni-linz.ac.at/
Trang 19systems These concepts should enable the database systems to supply semantic based query processing capabilities that can be used to find data objects semantically close to a given query object The main objectives of the project are to reach application domain independence and to provide powerful instruments for facilitating processing of the vague retrieval capabilities in real world applications A prototype
of the system named Vague Query System (VQS) (see chapter 3) was implemented The VQS provides semantic based query processing functionality for CDCQ-FR model: Given a query, if the RDBMS fails to retrieve record(s) that matches the query precisely, the VQS can employ the semantic meta-information to find a set of records semantically close/similar to the query The VQS’s approach is very promising and similar to that of modern IR systems In chapter 3, we will elaborate
on the VQS’s finished work before researches in this thesis were carried out Nevertheless, that VQS prototype had much remains to be done In particular, the VQS lacks sophisticated techniques to manage the storage and retrieval of multidimensional semantic meta-information and to process queries efficiently This leads to dramatically degenerate the performance of the VQS so that it cannot be efficiently applied to the existing database systems as expected
In this thesis, we analyze the issues posed by the above challenges, and design, implement and evaluate techniques to facilitate the design and construction of efficient semantic based FQASs3 Specially, we will concentrate our attention on similarity search techniques, complex vague queries/joins processing and optimization, approximate queries, and integration of these facilities into DBMSs Our case-study system is the VQS: Research results can be integrated to the VQS so
as to make it become a full-fledged FQAS Moreover, although we present the research results in the context of developing semantic based FQASs, e.g the VQS, our results can also be applied to general FQASs, e.g modern IR systems, as well as other modern database applications in common as we will see through this thesis Main contributions of the thesis focus on the following topics:
3 See chapter 3 for more details about a semantic based FQAS in our context
Trang 20 Multidimensional Index Structures: Multidimensional data are often encountered in modern database applications such as data mining/OLAP, GISs, tourist information systems, spatio-temporal databases, multimedia content-based retrieval, etc Therefore, building a FQAS for a database system in such domains requires managing (large) multidimensional datasets efficiently Although several efforts have been made, however, the RDBMSs are still far away from their support for multidimensional data and are not able to support such applications efficiently One of the main problems is inadequate support for multidimensional access methods (MAMs) MAMs accelerate the storage and retrieval of data items by selectively accessing a small number of them in a large collection The traditional multidimensional index structures like the R-tree [Gut1984] are insufficient and inefficient because they work well only at low dimensional spaces, which they have been designed for, but not suitable for high dimensional spaces that occur in modern database applications In the need, we have developed a new structure called SH-tree, a Super Hybrid Index Structure for Multidimensional Data It is a well-combined structure and carries positive aspects of both space partitioning based (SP-based) and data partitioning based (DP-based) techniques We developed cost-effective algorithms for basic operations of the SH-tree such as insertion, deletion, update, and search4 By the theoretical analyses as well as the experimental results, we have shown that the SH-tree is one of the most flexible and efficient index structures for supporting the storage and retrieval of average and high dimensional data
Solving Complex Multi-Feature Nearest Neighbor Queries: In this thesis,
we also call a complex multi-feature nearest neighbor query a complex vague query In practice, most users’ queries consist of more than one query condition To answer such queries, a semantic based FQAS must search in some feature spaces of data items individually and then combine the returned
4 See chapter 4 for the details about basic and advanced search operation types of typical multidimensional index structures
Trang 21results to give the final ones to the user In this thesis, we introduce a novel, efficient and general approach, called ISA – Incremental hyper-Sphere Approach, for efficiently dealing with multi-feature nearest neighbor queries The generalization of our approach is that it has been introduced to solving general complex vague queries, which are different from ones encountered and quite well discussed in some modern applications areas (e.g in multimedia databases [Fag1999, NeR1999, ORC+1998], etc.) Experimental results have proven the efficiency of our proposed approach, i.e the ISA
Solving Approximate Similarity Queries: A data object P is called a approximate nearest neighbor of a given query object Q with ε>0 if for all other data objects P’: dist(P,Q) ≤ (1+ε)dist(P’,Q), in which dist(X, Y) represents the distance between objects X and Y The feature spaces are usually multidimensional spaces and may consist of a vast amount of data Therefore searching costs, including IO-cost and CPU-cost, are prohibitively expensive for queries Alleviating or minimizing these costs during the query processing is one of our main purposes when developing FQASs For only such a multidimensional feature space, to alleviate the costs, problem of answering nearest neighbor and approximate nearest neighbor queries has been proposed and quite well addressed in the literature In this thesis, we introduce approaches for efficiently dealing not only with approximate single feature queries but also with approximate multi-feature queries Specially, our researches aim at finding approximate results of both nearest neighbor queries and range queries To the best of our knowledge, the work presented
(1+ε)-in this thesis is one of a few vanguard solutions for deal(1+ε)-ing with problem of answering approximate multi-feature nearest neighbor queries (other solutions introduced as in multimedia databases [FLN2001], supporting incremental join queries on ranked inputs [NCS+2001]) Our experimental results have shown that our approaches are very promising in terms of cost-efficiency and wide applicability
Trang 22 Efficient Processing of Complex and Approximate Complex Similarity
Joins: Join processing is the most expensive operation among the three most
frequently used relational operations (i.e., selection, projection, and join) in the RDBMSs [YuM1998] Join processing within a semantic based FQAS is much more expensive than that of the RDBMSs because we have to consider not only the conventional relations participating the join operation, but also semantic meta-information of the attributes of these relations, which are usually multidimensional data as mentioned above To the best of our knowledge, there is unique work has been done for the same problem [KuP1998], i.e for a semantic based FQAS like the VQS However, the result
of this work is confined to just returning “good matches” for complex join operations, but not “best matches”, and it still has many shortcomings that
lead it to become useless In the thesis, we will present general and efficient approaches and algorithms to address the problem Our research results concentrate on solving both complex and approximate complex similarity
joins efficiently and effectively (top-k “best matches” satisfying a complex
join or an approximate complex join will be returned) Additionally, a
generalized solution to the concerned problem will also be proposed
Towards Integrating Multidimensional Index Structures into DBMSs: As
we see, MAMs take a critical role in the success of modern database applications in common and semantic based FQASs in particular Therefore, supporting multidimensional index structures as access methods in the RDBMSs is an indispensable requirement Several of the main challenges relative to this topic are to develop efficient concurrency control and recovery techniques for the index structures so that they can reliably and efficiently participate in transaction processing, to develop cost estimation models for the query processing to make it available to the query optimizer, to speed up the tree building process (specially needed in interactive processing environments and batch processing), and so on We will discuss and introduce, in this thesis, some preliminary proposals for the SH-tree with
respect to these matters
Trang 23More concretely, the rest of the thesis is organized as follows:
Chapter 2: Flexible Query Answering Systems In this chapter, we further discuss the
importance of supporting the vague retrieval capabilities in the existing database systems as well as modern database applications and later we briefly introduce some previous prominent researches related to this topic in different application domains
of both the conventional RDBMSs and modern IR systems For each of these systems, we will especially elaborate on what their most important features are to make them work efficiently and their viewpoint about the users’ information/data needs as well as how they define and create information relevant to a given query Finally, we summary and propose a classification scheme for FQASs introduced
Chapter 3: VQS - Vague Query System: Approach and Issues This chapter is
dedicated to detailing the VQS, an implemented prototype for semantic based FQASs We will present the achievements that the developers had gained and the issues remained to be done before the research results in this thesis are applied to it
Chapter 4: An Overview of Multidimensional Access Methods This chapter will
survey multidimensional index structures that have been recently published and introduce basic operations that a typical multidimensional index structure should support as well as prevailing searching algorithms that are the state-of-the-arts of related researches We also develop taxonomy and give a classified evolution schema for these index structures
Chapter 5: The SH-Tree: A Super Hybrid Index Structure for Multidimensional Data In the first main part of this chapter, we present the SH-tree in details as well
as the basic operations that it supports Experimental results with both uniformly distributed and real datasets will also be given to compare the SH-tree to some most
prominent index structures Furthermore, as mentioned above, towards integrating
Multidimensional Access Methods into DBMSs is an indispensable requirement for
modern database applications and this is a growing tendency Since, in the second
main part of this chapter, we shall preliminary present a cost estimation model for the SH-tree, a dynamic bulk-loading technique of the SH-tree, and discuss several
Trang 24theoretical issues relative to concurrency control and recovery techniques for the tree Specially, we shall give an algorithm to solve the problem of preserving consistency of the SH-tree in presence of concurrent operations as insertions, deletions, and updates These are vitally important issues of the SH-tree before it can
SH-be supported as an access method in the commercial RDBMSs
Chapter 6: Solving Complex Multi-Feature Nearest Neighbor Queries This chapter
is dedicated to introducing an Incremental hyper-Sphere Approach (ISA) for solving complex vague queries efficiently and generally In addition, we also modified a state-of-the-art algorithm proposed for addressing k nearest neighbor problem to establish an incremental algorithm for range queries The modified algorithm has dramatically contributed to the improvement of the search performance
Chapter 7: Solving Approximate Similarity Queries In this chapter, we present
approaches to addressing the approximate issue for most important query types in a FQAS as single feature and multi-feature nearest neighbor queries and range queries The proposed approaches not only lessen the costs, but also remain the acceptably high accuracy of the answers
Chapter 8: Efficient Processing of Complex and Approximate Complex Similarity Joins Although the work in this chapter is particularly devoted to the VQS, first of
all we introduce novel and efficient approaches to complex and approximate complex similarity join processing for this system, and then we present a generalization of the approach to complex similarity join processing, in which the generalized join condition can be a set of arbitrary user-defined predicates on the input records/tuples An approximate approach to this generalized issue will also be pointed out therein
Chapter 9: Conclusions and Future Work This last chapter gives conclusions as well
as relevant issues of great interest for the future research directions
Trang 25This chapter is mainly dedicated to supplying an overview of prominent researches
in the area of FQASs that have been done or in progress so that we can perceive importance of the problem more exactly and observe its growth and development over the last two decades
In principle, the area of developing FQASs is related to enhancements of inquiring or query answering systems into technology that can be experienced as being
"intelligent" or "flexible" Their emphasis is in common on problems in users-posing queries and systems-producing answers This needs to search for more advanced techniques for analyzing and processing user’s queries than those normally used in
Trang 26database systems and search engines In other words, FQASs concern with the vital necessity to provide easy, flexible, informative, and intuitive access to information for every type of need [DKW2002c, BMP2001] Actually, such systems are mainly intended to facilitate retrieval from information repositories such as databases, multimedia, libraries, and the World Wide Web These repositories are typically equipped with standard query systems, which are often rigid and inadequate and need
to be more improved
Although target of this issue, as just mentioned above, draws on several research areas, including databases, information retrieval, knowledge representation, soft computing, multimedia, human-computer interaction, etc [BMP2001], in this chapter, we focus only on introducing attempts at supporting flexible retrieval capabilities in DBMSs and modern IR/multimedia systems The main reason for this
is that related researches of interest in these two areas are very close to our work, aiming at developing an efficient semantic based FQAS like the VQS [KuP1997] (cf chapter 3) Even so, also note that there are basic differences between database management systems and information retrieval systems [Rij1979, SaM1983, Ing1992, BaR1999] They are briefly pointed out below before we continue going into details of some previous researches that correspond to each of them
As we know, with the increasing production and exchange of multimedia information through the Internet, the need of effective IR systems is a crucial issue
nowadays The main aim of an IR system is to identify the information needs of a
user This is contrary to database management systems, e.g RDBMSs, which mainly
aim at providing the data needs for the users The database management systems are
established on the use of a data model (relational or object-oriented) so as to express the data of the concerned universe by a given database Afterwards, the data are exploited as they are stored in the system by means of a query language, e.g SQL Such a data retrieval language aims at retrieving all objects satisfying clearly defined conditions such as those in a regular expression or in a relational algebra expression Thus, for a data retrieval system, a single erroneous object among a thousand retrieved objects means total failure [BaR1999] Supporting flexible retrieval
Trang 27capabilities in DBMSs should address this data need issue efficiently Information retrieval aims at modeling, designing and implementing systems able to provide efficient and effective content-based access to a large amount of information This information is extracted from the data objects, e.g images, text documents The difficulty is not only knowing how to extract this information but also knowing how
to use it to decide the information relevant to the user’s query Thus, the notion of
relevance is at the center of information retrieval [BaR1999] In fact, the primary
goal of an IR system is to retrieve all the documents which are relevant to a user’s query while retrieving as few non-relevant documents as possible The key difference among all systems in support of the flexible retrieval capabilities is method to understand and process users’ queries while dealing with their goals In other words, it is their viewpoint about users’ information/data needs as well as how they define and create the relevant information/data to a user’s query
By this perspective, in the following sections, we introduce some most prominent systems and approaches related to the discussing topic As mentioned above and for the sake of clearness, they are classified into two main categories: (1) Supporting flexible retrieval capabilities in DBMSs (section 2.2) and (2) modern IR systems (section 2.3) Section 2.4 presents discussions and brings up a wider view for the problem Eventually, section 2.5 will give conclusions
2.2 Supporting Flexible Retrieval Capabilities in DBMSs
Generally, in the context of DBMSs, when available data stored in the database do not match a user’s query precisely, the system will return an empty result set to the user To get satisfactory data, the user then has to retry a particular query repeatedly with alternative values of certain query conditions until it matches those data This solution, however, can be applied only if the user is aware of the close alternatives, otherwise it is infeasible There are plenty of proposals to solve this problem Most
of them concern with either the queries that can be posed by the users or the nature of the data managed by the system which can be “imperfect”, e.g fuzzy database systems [Pet1996] With respect to the former problem, a system is called flexible if
Trang 28it allows to express preferences and/or importance of criteria (i.e query conditions)
in order to rank the retrieved data objects according to their adequacy [BMP2001] Many mechanisms such as distances [IcH1986, Mot1988, etc.] or fuzzy sets [BoP1994, BLP1995, BDP+1997, etc.] have been introduced to accomplish this objective The latter problem can be seen as a kind of refinement of NULL values in the conventional databases There are different frameworks and tools proposed to realize this kind of information as presented in [MoS1997] Besides, the flexibility can also be understood in many other ways, such as the capability for a system to answer a query in a cooperative fashion [BMP2001] These cooperative database systems are concerned with techniques that enhance database management systems with cooperative behavior that usually imitate some aspects of human behavior [Mot2000, GGM1992]
Many of the techniques proposed for supporting such above flexible retrieval capabilities in DBMSs have not progressed beyond their modest proof-of-concept demonstrations A critical challenge is to construct viable prototypes that will convince both users and software developers of their usefulness and practicality [Mot2000] In the sub-sections below, we briefly describe some most prominent prototype systems that enhance database management systems with the flexibility in support of query answering posed by the users
2.2.1 ARES
ARES (Associative Information Retrieval System) [IcH1986] was introduced to facilitate a relational database with the capability of performing flexible interpretation of queries Here a flexible interpretation is made so that data items, which are semantically close to an exact match for the query conditions can also be obtained when the users expect
The core of ARES is concept of similarity relation that represents similarity between elements in a domain This special type of relation specifies inner-domain data similarity and is different from the conventional relations, which are formally
Trang 29defined as a subset of the Cartesian product of n domains and express relationship among elements in those domains Table 2.1b illustrates a part of an example
similarity relation for the field Job of a table Employee (Table 2.1a), in which each
pair of attribute values in the job domain is associated with a certain similarity degree (the smaller value means two jobs are more similar)
Employee Name Job Salary
Josef Assistant 3000 Roland Director 5000
Trang 30Depending on such similarity relations and a new operator “similar to”, which implies “approximately equal to”, ARES functionally extends the relational algebra
operations selection, project and join to allow a certain degree of ambiguity in the interpretation of associated conditions The extended relational algebra operations
are called ambiguous select, ambiguous project and ambiguous join, respectively
The execution of extended relational algebra operations is done after they are translated into the combination of conventional ones Therefore, no specific techniques are necessary to provide for executing them and ARES can be added to an existing relational database easily (see [IcH1986] for more details about the system organization and several useful graphics-based tools of ARES)
When a user requires the system ARES to make a flexible interpretation of query conditions, the user must supply a threshold for each of these query conditions The global distance, called TOTAL_SIMILARITY in ARES, is defined as the sum of elementary similarity values tied to the flexible query conditions involving the query Then the tuples are sorted according to their global distances and the system will output as many tuples as possible within the limit that has been specified by the user (also see [IcH1986] for the implementation of ARES)
Although ARES has addressed the basic issue of similarity matching and it is a useful system that can benefit its users, there are some flaws which lead to dramatically limit its usefulness [Mot1988] First, each similarity relation needs n2entries with respect to n different attribute values in the corresponding conventional relation and thus it leads to high storage cost Maintenance cost of similarity relations
is also high because when a new attribute value is added, 2n+1 additional entries are necessary for the corresponding similarity relation [DKW2002b] Second, ARES does not allow defining similarity between the attribute values for infinite domains, for example the similarity between two given strings, because the similarities can only be defined by means of tables Third, ARES does not allow multiple metrics for the same domain and cannot adapt itself to the views and priorities of its individual users as well Last, the extensions of the relational data model in the ARES are
Trang 31unnecessarily complex Altogether, ARES is a notable idea but the flaws make it unpromising as expected
2.2.2 VAGUE
VAGUE [Mot1988] is a system that resembles the ARES system in its overall goals
It is an extension to the relational data model with data metrics and the standard query language with a comparator similar-to To express a query, users need only to
know about this new comparator In fact, the “metricized” database model that introduced in VAGUE is a generalization of the relational database model since a
“nonmetricized” database is a particular type of a metricized one, as we will see hereinafter In the “metricized” databases, each attribute belongs to a domain, which
is a set of values (possibly infinite) For each domain, there is at least a data metric associated with it to define distance between its values There are four types of data metrics in VAGUE indicated in [Mot1988] as follows:
Computational metric: A data metric is computational if it derives its distances by computation only (i.e no retrievals are required) An example of
a computational metric on a numerical domain, such as DATE, SALARY, etc., is the absolute value of the difference between two numbers An example of a computational metric on a non-numerical domain, such as PERSON_NAME, FILM_TITLE, etc., is a procedure that determines the degree of similarity between two strings of characters1
Tabular metric: A data metric is tabular if it derives its distances by retrieval only (i.e no computations are required) The distance between every two values of the domain is stored in a table, and the metric simply searches this table An example for a tabular metric is the geographic distance between locations These tables are equivalent to the similarity relations in ARES
1 A well-known metric to determine the degree of similarity between two arbitrary strings was devised
by Vladimir Levenshtein [Lev1966]
Trang 32 Referential metric: In case an attribute is mapped to another relation where
this attribute is also the key, then the distance between two arbitrary values of
this attribute can be derived by a certain combination of the individual
distances between the corresponding elements in their mapped relation Such
metrics are called referential metrics Note here that some individual
distances may be given more weight than others, and some distances may be
ignored altogether Motro gave an example to illustrate this metric as follows:
Consider a relation FILM with the following attributes: TITLE, DIRECTOR,
CATEGORY, and RATING Assume that TITLE is the key attribute, and let
(Psycho, Hitchcock, Suspense, 3.5) and (Modern_Times,
Chaplin, Comedy, 4.0) be two tuples from this relation The distance
between the titles Psycho and Modern_Times may be derived by some
combination of the individual distances between Hitchcock and
Chaplin, Suspense and Comedy, and 3.5 and 4.0
Default metric: When a domain cannot be provided with a suitable metric,
the following DEFAULT metric should be employed and the domain
becomes one as in the conventional relational databases:
0 if x = yDEFAULT (x, y) =
In addition, VAGUE also allows multiple metrics over each domain with ability to
select the appropriate metric for each query During query processing, if the operator
“~” (similar-to) occurs, the query processor selects the appropriate metric to
calculate the results The results contain the best matches in the metric nearest and
they can be sorted if user wants Besides, the author also mentioned incomplete
information problem (NULL values) and proposed ways to deal with it
Trang 33Although VAGUE is a useful system, its design represents a compromise between the conflicting requirements for simplicity, flexibility and efficiency For examples, the users of VAGUE cannot provide their own similarity thresholds for each vague qualification but when a vague query does not match any data, VAGUE doubles all searching radii simultaneously and thus the search performance can be considerably deteriorated In other words, the performance problem has been neglected in the VAGUE system
2.2.3 FLEX
FLEX [Mot1990] is a kind of cooperative query answering systems This system, in fact, is a single user interface to relational databases that can be used satisfactorily by users with different levels of expertise FLEX is based on a formal query language, but is tolerant of incorrect input It never rejects users’ queries because its design is highly modular, consisting of various different techniques for processing requests of decreasing level of well-formedness In other words, each input is “put” through this series of techniques until an interpretation is established As a result, FLEX can service a wider variety of users because it adapts flexibly and transparently to their level of well-formedness and provides an interpretation at that level It also never delivers empty answers without explanation or help That means FLEX is a cooperative system Figure 2.1 below illustrates the overall architecture of FLEX2 Detailed descriptions of each component in the overall architecture of FLEX are directed to [Mot1990] Here we briefly emphasize the different levels of query interpretation capability in FLEX Specifically, four levels can be differentiated in this system as follows:
First, at the top level, the parser parses a user’s query, which is composed in a simple
editor and submitted to it If the parser succeeds in this work, the query is passed to
the query processor and the user will receive a standard response Expert users are
2 This figure is reproduced from [Mot1990]
Trang 34expected to interact with FLEX mostly at this level, which is indistinguishable from standard formal query interfaces [Mot2000]
Figure 2.1 Overall architecture of FLEX
Interaction Interaction Interaction
Display
Processor End
Trang 35Second, if the query cannot be parsed successfully, it is given to the corrector and in
here FLEX employs various techniques to address common formalism errors If successful, the corrected query is processed and an answer is obtained This level can assist users with moderate experience or experienced users who make occasional mistakes Note that at this level, the corrected query might not match the initial objective of the user exactly
Third, if the query cannot be corrected, an attempt is made to synthesize a common
query from tokens that are recognized in the input This is done by the synthesizer
The synthesized query is then presented to the user for approval or editing This level
is intended to support naive users who may have difficulties constructing queries from scratch, however, they are able to comprehend the synthesized query and possibly even alter them in some ways [Mot2000] The problem with decreasing the initial objective of the user at this level is even further
Eventually, if the query cannot be synthesized or the synthesized query is rejected by
the user (see the interaction component in the Figure 2.1), a browser is engaged to
display frames of information extracted from the database on the recognized input tokens Hence, FLEX is able to adapt its interpretations to the different correctness and well-formedness of the input
In addition to the four processing levels, FLEX also deals with empty result sets and partial answers (see [Mot2000] for more details about kinds of queries in cooperative systems) In other words, it has cooperative behavior as well3 If the answer is empty,
the input is passed to the query generalizer The generalizer then suggests related
queries that have nonempty answers or it will point out erroneous presuppositions
3 See [Mot2000, GGM1992] for the overviews of cooperative database systems
Trang 36runway length TAH in which the Medium-Range (i.e., from 4000 to 8000 ft.) is a
more abstract representation than a specific runway length in the same TAH (e.g.,
6000 ft.) (see [CYC+1996] for the details and other examples)
Figure 2.2 An example runway length TAH
Trang 37In addition, CoBase has also been facilitated with knowledge discovery tools to generate TAHs automatically from data sources Some of the cooperative techniques implemented in CoBase are as follows: Query relaxations for queries that have empty answers (a query can be modified by relaxing the query conditions thanks to
such operations as generalization (moving up the TAH) and specialization (moving
down the TAH)); conceptual level queries (containing concepts that may not be expressed in the database schema, e.g., “runway-length = Medium-Range”) when users do not know the precise schema; and the augmentation of queries with associative (relevant/related) information
CoBase has been equipped with an extended language of SQL named CoSQL that allows its users to explicitly specify relaxation operations and controls With the CoSQL, for example, the users can specify “approximate” values and ask “similar-
to/near-to” queries Approximate values (e.g., find flights arriving at about 18pm) are
translated to interval of values and the results of similar-to queries are ranked according to the similarity degree of their attributes to the specified value Moreover,
to provide relevant information for the query answers, a pattern-based framework is used for deriving associative (relevant) information from past cases
2.2.5 VQS
Depending on basic ideas of ARES and VAGUE, VQS (Vague Query System)
[KuP1997] has been designed to be an “add-on” component over an existing
database management system to facilitate both data needs and information needs of users VQS is a type of semantic based FQASs: It can retrieve tuples that do not match the query accurately by employing semantic meta-information of attributes of the query relation in conjunction with an extended query language of the SQL called VQL - Vague Query Language
In fact, the design of VQS looks like an integrated Information Retrieval/Database Management system [SaR1990, ScP1982, etc.] where both user data/information needs can be satisfied The user data need is a well-studied and well-understood
Trang 38problem by database management systems and the user information need is fulfilled
by information retrieval systems The integration of such systems can bring users many benefits Nevertheless, building such an integrated system is not simple work and involved in dealing with all obstacles that one must face while developing each
of them separately as well as difficulties arising during the integration process The extensions of VQS do not conflict with or damage the existing functionalities of database management systems while still facilitating them with the desirable vague retrieval capabilities We defer detailing VQS until chapter 3 where we will describe and discuss its system architecture and achievements concretely
2.3 Modern Information Retrieval Systems
Since the 1940s, the problem of information storage and retrieval has attracted increasing attention From the conceptual point of view, information storage and retrieval is simple Suppose there is a collection of documents stored in a repository and a user gives a request (e.g., thanks to a formal query language as SQL) to which the answer is a subset of that collection of documents, satisfying the information need expressed by his query Simply, these satisfied documents could be obtained by accessing all the available documents in the repository, retaining the relevant documents and discarding all the others But, although this solution brings a perfect retrieval, which means that all the relevant documents are ensured to be obtained, it
is impractical The main reason is that accessing the whole available documents in most cases leads to unacceptably slow response time to the query This is obviously undesirable, especially for applications dealing with large data repositories
In most traditional information retrieval systems, i.e text documents retrieval systems, the problem has been solved by means of indices Documents in a collection
are frequently represented through a set of index terms or keywords Such keywords are extracted directly from the text of the document automatically or/and manually
No matter whether these representative keywords are derived automatically or
generated by a specialist, they provide a logical view of the document [BaR1999]
Depending on the keywords extracted, one can build an index for the document
Trang 39collection An index is a critical data structure because it allows fast searching over large volumes of data at a lower IO-cost There are numerous index structures (see chapter 4 for more information on indices), but in the traditional IR systems the
inverted file 5 is the most popular one Thanks to the index, the user’s query processing is later performed much more efficiently with only a small part of documents in the database must be accessed There are also plenty of processing models to decide whether a document is relevant to a query or how relevant they are Those models include Boolean model [Lan1979], vector space retrieval model [Sal1989, SaM1983, Sal1971, Sal1968], extended Boolean model [SFW1983] and so forth More surveys and details of traditional IR systems are referred to [Rij1979, SaM1983, Ing1992]
In the past over 20 years, the area of information retrieval has grown well beyond its primary goals of indexing text and searching for useful documents in a collection [BaR1999] Nowadays, with the increasing development of the World Wide Web and its products, the area of information retrieval has been attracted by new ordeals Researches in modern IR now include much more different topics as modeling, document classification and categorization, systems architecture, user interfaces and human-computer interaction, data visualization, filtering, languages, etc Even now, intrinsically, the main aim of an IR system is still to identify the information relevant
to the needs expressed by a user thanks to a formal query, which is formulated by a certain formal query language, e.g., extended SQL To deal with this aim, modern IR systems employ the same mechanism as traditional IR systems: Both documents and queries are represented in a formal way and a matching technique will be used to compare the two representations for estimating the relevance of documents to the given query Nevertheless, most of the existing IR systems are based on simple models and they must commonly bear low effectiveness Specially, for the Web, the main obstacle is the absence of a well-defined underlying data model, which implies that information definition and structure is frequently of low quality [BaR1999] A
5 An inverted file is a representation for a collection that is essentially an index For each keyword or index term that appears in the collection of documents, an inverted file lists each document where it appears This representation is especially useful for performing Boolean queries
Trang 40promising direction to improve IR systems is to make them flexible, i.e capable to
be tolerant to imprecision, partial truth, uncertainty and approximation which characterize various stages of the IR process and to make them able to learn the user concept of relevance [BMP2001] (here again, we must note that the relevance notion
is at the center of information retrieval)
There are abundant in formal theories and techniques that can be employed to define flexible IR systems, e.g., probability theory, fuzzy set theory, neural networks, etc Although modern information retrieval relates to a lot of state-of-the-art issues that need the special attention from researchers, in the following subsections we only introduce several modern multimedia/image retrieval systems These multimedia/image IR systems are important representatives not only for contemporary researches into modern information retrieval field, but also for crucial issues in the commercial world and the digital age nowadays The reason is intuitive: With the very fast development of Internet technologies and the growth of the demand from Internet users for exchange of multimedia information via the Internet, such efficient modern IR systems are obviously desirable
2.3.1 QBIC
QBIC, standing for Query By Image Content, is the first commercial content-based image retrieval system, which has been developed by IBM Almaden Research Center [NBE+1993]6 The QBIC system allows queries on large image and video databases based on example images, user-constructed sketches and drawings, selected color and texture patterns, camera and object motion, and other graphical information [FSN+1995] To achieve this functionality, QBIC has two main components: database population and database query In the former process, images and videos
6 The previous standard approach to searching image and video is to create text annotations that describe the content of the image, and these textual annotations are stored in a standard database, e.g.,
a relational database The images themselves are not really a part of the database, but referenced by text strings or pointers This approach is tedious and prohibitively expensive as well as user- and purpose-dependent [PPS1994]