information retrieval data structures & algorithms - william b. frakes

Frakes and Ricardo Baeza-Yates FOREWORD PREFACE CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL SYSTEMS CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND ALGORITHMS RELATED TO

Trang 1

Information Retrieval: Data Structures &

Algorithms

edited by William B Frakes and Ricardo Baeza-Yates

FOREWORD

PREFACE

CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL SYSTEMS

CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND ALGORITHMS RELATED TO INFORMATION RETRIEVAL

CHAPTER 3: INVERTED FILES

CHAPTER 4: SIGNATURE FILES

CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND PAT ARRAYS

CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS

CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

CHAPTER 8: STEMMING ALGORITHMS

CHAPTER 9: THESAURUS CONSTRUCTION

CHAPTER 10: STRING SEARCHING ALGORITHMS

CHAPTER 11: RELEVANCE FEEDBACK AND OTHER QUERY MODIFICATION TECHNIQUESCHAPTER 12: BOOLEAN OPERATIONS

CHAPTER 13: HASHING ALGORITHMS

Trang 2

CHAPTER 14: RANKING ALGORITHMS

CHAPTER 15: EXTENDED BOOLEAN MODELS

CHAPTER 16: CLUSTERING ALGORITHMS

CHAPTER 17: SPECIAL-PURPOSE HARDWARE FOR INFORMATION RETRIEVAL

CHAPTER 18: PARALLEL INFORMATION RETRIEVAL ALGORITHMS

Trang 3

Udi Manber

Department of Computer Science, University of Arizona

In the not-so-long ago past, information retrieval meant going to the town's library and asking the

librarian for help The librarian usually knew all the books in his possession, and could give one a

definite, although often negative, answer As the number of books grew and with them the number of libraries and librarians it became impossible for one person or any group of persons to possess so much information Tools for information retrieval had to be devised The most important of these tools is the

index a collection of terms with pointers to places where information about them can be found The

terms can be subject matters, author names, call numbers, etc., but the structure of the index is

essentially the same Indexes are usually placed at the end of a book, or in another form, implemented as card catalogs in a library The Sumerian literary catalogue, of c 2000 B.C., is probably the first list of books ever written Book indexes had appeared in a primitive form in the 16th century, and by the 18th century some were similar to today's indexes Given the incredible technology advances in the last 200 years, it is quite surprising that today, for the vast majority of people, an index, or a hierarchy of

indexes, is still the only available tool for information retrieval! Furthermore, at least from my

experience, many book indexes are not of high quality Writing a good index is still more a matter of experience and art than a precise science

Why do most people still use 18th century technology today? It is not because there are no other

methods or no new technology I believe that the main reason is simple: Indexes work They are

extremely simple and effective to use for small to medium-size data As President Reagan was fond of saying "if it ain't broke, don't fix it." We read books in essentially the same way we did in the 18th

century, we walk the same way (most people don't use small wheels, for example, for walking, although

it is technologically feasible), and some people argue that we teach our students in the same way There

is a great comfort in not having to learn something new to perform an old task However, with the

information explosion just upon us, "it" is about to be broken We not only have an immensely greater amount of information from which to retrieve, we also have much more complicated needs Faster

computers, larger capacity high-speed data storage devices, and higher bandwidth networks will all come along, but they will not be enough We will need better techniques for storing, accessing,

querying, and manipulating information

It is doubtful that in our lifetime most people will read books, say, from a notebook computer, that

people will have rockets attached to their backs, or that teaching will take a radical new form (I dare not even venture what form), but it is likely that information will be retrieved in many new ways, but many more people, and on a grander scale

Trang 4

I exaggerated, of course, when I said that we are still using ancient technology for information retrieval The basic concept of indexes searching by keywords may be the same, but the implementation is a world apart from the Sumerian clay tablets And information retrieval of today, aided by computers, is not limited to search by keywords Numerous techniques have been developed in the last 30 years, many

of which are described in this book There are efficient data structures to store indexes, sophisticated query algorithms to search quickly, data compression methods, and special hardware, to name just a few areas of extraordinary advances Considerable progress has been made for even seemingly elementary problems, such as how to find a given pattern in a large text with or without preprocessing the text Although most people do not yet enjoy the power of computerized search, and those who do cry for better and more powerful methods, we expect major changes in the next 10 years or even sooner The wonderful mix of issues presented in this collection, from theory to practice, from software to hardware,

is sure to be of great help to anyone with interest in information retrieval

An editorial in the Australian Library Journal in 1974 states that "the history of cataloging is exceptional

in that it is endlessly repetitive Each generation rethinks and reformulates the same basic problems, reframing them in new contexts and restating them in new terminology." The history of computerized cataloging is still too young to be in a cycle, and the problems it faces may be old in origin but new in scale and complexity Information retrieval, as is evident from this book, has grown into a broad area of study I dare to predict that it will prosper Oliver Wendell Holmes wrote in 1872 that "It is the province

of knowledge to speak and it is the privilege of wisdom to listen." Maybe, just maybe, we will also be able to say in the future that it is the province of knowledge to write and it is the privilege of wisdom to query

Go to Preface Back to Table of Contents

Trang 5

Text is the primary way that human knowledge is stored, and after speech, the primary way it is

transmitted Techniques for storing and searching for textual documents are nearly as old as written language itself Computing, however, has changed the ways text is stored, searched, and retrieved In traditional library indexing, for example, documents could only be accessed by a small number of index terms such as title, author, and a few subject headings With automated systems, the number of indexing terms that can be used for an item is virtually limitless

The subfield of computer science that deals with the automated storage and retrieval of documents is called information retrieval (IR) Automated IR systems were originally developed to help manage the huge scientific literature that has developed since the 1940s, and this is still the most common use of IR systems IR systems are in widespread use in university, corporate, and public libraries IR techniques have also been found useful, however, in such disparate areas as office automation and software

engineering Indeed, any field that relies on documents to do its work could potentially benefit from IR techniques

IR shares concerns with many other computer subdisciplines, such as artificial intelligence, multimedia systems, parallel computing, and human factors Yet, in our observation, IR is not widely known in the computer science community It is often confused with DBMS a field with which it shares concerns and yet from which it is distinct We hope that this book will make IR techniques more widely known and used

Data structures and algorithms are fundamental to computer science Yet, despite a large IR literature, the basic data structures and algorithms of IR have never been collected in a book This is the need that

we are attempting to fill In discussing IR data structures and algorithms, we attempt to be evaluative as well as descriptive We discuss relevant empirical studies that have compared the algorithms and data structures, and some of the most important algorithms are presented in detail, including implementations

in C

Our primary audience is software engineers building systems with text processing components Students

of computer science, information science, library science, and other disciplines who are interested in text retrieval technology should also find the book useful Finally, we hope that information retrieval

researchers will use the book as a basis for future research

Bill Frakes

Ricardo Baeza-Yates

ACKNOWLEDGEMENTS

Trang 6

Many people improved this book with their reviews The authors of the chapters did considerable reviewing of each others' work Other reviewers include Jim Kirby, Jim O'Connor, Fred Hills, Gloria Hasslacher, and Ruben Prieto-Diaz All of them have our thanks Special thanks to Chris Fox, who tested The Code on the disk that accompanies the book; to Steve Wartik for his patient unravelling of many Latex puzzles; and to Donna Harman for her helpful suggestions.

Go to Chapter 1 Back to Table of Contents

Trang 7

CHAPTER 1: INTRODUCTION TO INFORMATION

STORAGE AND RETRIEVAL SYSTEMS

1.1 INTRODUCTION

Automated information retrieval (IR) systems were originally developed to help manage the huge scientific

literature that has developed since the 1940s Many university, corporate, and public libraries now use IR systems to provide access to books, journals, and other documents Commercial IR systems offer databases containing millions

of documents in myriad subject areas Dictionary and encyclopedia databases are now widely available for PCs IR has been found useful in such disparate areas as office automation and software engineering Indeed, any discipline that relies on documents to do its work could potentially use and benefit from IR.

This book is about the data structures and algorithms needed to build IR systems An IR system matches user

queries formal statements of information needs to documents stored in a database A document is a data object,

usually textual, though it may also contain other types of data such as photographs, graphs, and so on Often, the documents themselves are not stored directly in the IR system, but are represented in the system by document

surrogates This chapter, for example, is a document and could be stored in its entirety in an IR database One might instead, however, choose to create a document surrogate for it consisting of the title, author, and abstract This is typically done for efficiency, that is, to reduce the size of the database and searching time Document surrogates are

also called documents, and in the rest of the book we will use document to denote both documents and document

surrogates.

An IR system must support certain basic operations There must be a way to enter documents into a database,

change the documents, and delete them There must also be some way to search for documents, and present them to

a user As the following chapters illustrate, IR systems vary greatly in the ways they accomplish these tasks In the next section, the similarities and differences among IR systems are discussed.

1.2 A DOMAIN ANALYSIS OF IR SYSTEMS

This book contains many data structures, algorithms, and techniques In order to find, understand, and use them effectively, it is necessary to have a conceptual framework for them Domain analysis systems analysis for

multiple related systems described in Prieto-Diaz and Arrango (1991), is a method for developing such a

Trang 8

framework Via domain analysis, one attempts to discover and record the similarities and differences among related systems.

The first steps in domain analysis are to identify important concepts and vocabulary in the domain, define them, and organize them with a faceted classification Table 1.1 is a faceted classification for IR systems, containing

important IR concepts and vocabulary The first row of the table specifies the facets that is, the attributes that IR systems share Facets represent the parts of IR systems that will tend to be constant from system to system For example, all IR systems must have a database structure they vary in the database structures they have; some have inverted file structures, some have flat file structures, and so on.

A given IR system can be classified by the facets and facet values, called terms, that it has For example, the

CATALOG system (Frakes 1984) discussed in Chapter 8 can be classified as shown in Table 1.2.

Terms within a facet are not mutually exclusive, and more than one term from a facet can be used for a given

system Some decisions constrain others If one chooses a Boolean conceptual model, for example, then one must choose a parse method for queries.

Table 1.1: Faceted Classification of IR Systems (numbers in parentheses indicate chapters)

Conceptual File Query Term Document Hardware

Model Structure Operations Operations Operations

-Boolean(1) Flat File(10) Feedback(11) Stem(8) Parse(3,7) vonNeumann(1)

Extended Inverted Parse(3,7) Weight(14) Display Parallel(18)

Trang 9

Assign IDs(3)

Table 1.2: Facets and Terms for CATALOG IR System

Facets Terms

-File Structure Inverted file

Query Operations Parse, Boolean

Term Operations Stem, Stoplist, Truncation

Hardware von Neumann, Mag Disk

Document Operations parse, display, sort, field mask, assign IDs

Conceptual Model Boolean

Viewed another way, each facet is a design decision point in developing the architecture for an IR system The system designer must choose, for each facet, from the alternative terms for that facet We will now discuss the facets and their terms in greater detail.

1.2.1 Conceptual Models of IR

The most general facet in the previous classification scheme is conceptual model An IR conceptual model is a

general approach to IR systems Several taxonomies for IR conceptual models have been proposed Faloutsos

(1985) gives three basic approaches: text pattern search, inverted file search, and signature search Belkin and Croft (1987) categorize IR conceptual models differently They divide retrieval techniques first into exact match and inexact match The exact match category contains text pattern search and Boolean search techniques The inexact match category contains such techniques as probabilistic, vector space, and clustering, among others The problem with these taxonomies is that the categories are not mutually exclusive, and a single system may contain aspects of many of them.

Almost all of the IR systems fielded today are either Boolean IR systems or text pattern search systems Text

pattern search queries are strings or regular expressions Text pattern systems are more common for searching small collections, such as personal collections of files The grep family of tools, described in Earhart (1986), in the UNIX environment is a well-known example of text pattern searchers Data structures and algorithms for text pattern searching are discussed in Chapter 10.

Almost all of the IR systems for searching large document collections are Boolean systems In a Boolean IR system, documents are represented by sets of keywords, usually stored in an inverted file An inverted file is a list of

keywords and identifiers of the documents in which they occur Boolean list operations are discussed in Chapter 12 Boolean queries are keywords connected with Boolean logical operators (AND, OR, NOT) While Boolean systems have been criticized (see Belkin and Croft [1987] for a summary), improving their retrieval effectiveness has been

Trang 10

difficult Some extensions to the Boolean model that may improve IR performance are discussed in Chapter 15.

Researchers have also tried to improve IR performance by using information about the statistical distribution of terms, that is the frequencies with which terms occur in documents, document collections, or subsets of document collections such as documents considered relevant to a query Term distributions are exploited within the context of some statistical model such as the vector space model, the probabilistic model, or the clustering model These are discussed in Belkin and Croft (1987) Using these probabilistic models and information about term distributions, it

is possible to assign a probability of relevance to each document in a retrieved set allowing retrieved documents to

be ranked in order of probable relevance Ranking is useful because of the large document sets that are often

retrieved Ranking algorithms using the vector space model and the probabilistic model are discussed in Chapter 14 Ranking algorithms that use information about previous searches to modify queries are discussed in Chapter 11 on relevance feedback.

In addition to the ranking algorithms discussed in Chapter 14, it is possible to group (cluster) documents based on the terms that they contain and to retrieve from these groups using a ranking methodology Methods for clustering documents and retrieving from these clusters are discussed in Chapter 16.

1.2.2 File Structures

A fundamental decision in the design of IR systems is which type of file structure to use for the underlying

document database As can be seen in Table 1.1, the file structures used in IR systems are flat files, inverted files, signature files, PAT trees, and graphs Though it is possible to keep file structures in main memory, in practice IR databases are usually stored on disk because of their size.

Using a flat file approach, one or more documents are stored in a file, usually as ASCII or EBCDIC text Flat file searching (Chapter 10) is usually done via pattern matching On UNIX, for example, one can store a document collection one per file in a UNIX directory, and search it using pattern searching tools such as grep (Earhart 1986)

or awk (Aho, Kernighan, and Weinberger 1988).

An inverted file (Chapter 3) is a kind of indexed file The structure of an inverted file entry is usually keyword, document-ID, field-ID A keyword is an indexing term that describes the document, document-ID is a unique

identifier for a document, and field-ID is a unique name that indicates from which field in the document the

keyword came Some systems also include information about the paragraph and sentence location where the term occurs Searching is done by looking up query terms in the inverted file

Signature files (Chapter 4) contain signatures it patterns that represent documents There are various ways of constructing signatures Using one common signature method, for example, documents are split into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist (see below), words Each word in the block is hashed to give a signature a bit pattern with some of the bits set to 1 The signatures of each word in a block are OR'ed together to create a block signature The block signatures are then concatenated to produce the document signature Searching is done by comparing the signatures of queries with document signatures.

PAT trees (Chapter 5) are Patricia trees constructed over all sistrings in a text If a document collection is viewed as

a sequentially numbered array of characters, a sistring is a subsequence of characters from the array starting at a given point and extending an arbitrary distance to the right A Patricia tree is a digital tree where the individual bits

of the keys are used to decide branching.

Trang 11

Graphs, or networks, are ordered collections of nodes connected by arcs They can be used to represent documents

in various ways For example, a kind of graph called a semantic net can be used to represent the semantic

relationships in text often lost in the indexing systems above Although interesting, graph-based techniques for IR are impractical now because of the amount of manual effort that would be needed to represent a large document collection in this form Since graph-based approaches are currently impractical, we have not covered them in detail

in this book.

1.2.3 Query Operations

Queries are formal statements of information needs put to the IR system by users The operations on queries are obviously a function of the type of query, and the capabilities of the IR system One common query operation is parsing (Chapters 3 and 7), that is breaking the query into its constituent elements Boolean queries, for example, must be parsed into their constituent terms and operators The set of document identifiers associated with each query term is retrieved, and the sets are then combined according to the Boolean operators (Chapter 12).

In feedback (Chapter 11), information from previous searches is used to modify queries For example, terms from relevant documents found by a query may be added to the query, and terms from nonrelevant documents deleted There is some evidence that feedback can significantly improve IR performance.

1.2.4 Term Operations

Operations on terms in an IR system include stemming (Chapter 8), truncation (Chapter 10), weighting (Chapter 14), and stoplist (Chapter 7) and thesaurus (Chapter 9) operations Stemming is the automated conflation (fusing or combining) of related words, usually by reducing the words to a common root form Truncation is manual

conflation of terms by using wildcard characters in the word, so that the truncated term will match multiple words For example, a searcher interested in finding documents about truncation might enter the term "truncat?" which would match terms such as truncate, truncated, and truncation Another way of conflating related terms is with a thesaurus which lists synonymous terms, and sometimes the relationships among them A stoplist is a list of words considered to have no indexing value, used to eliminate potential indexing terms Each potential indexing term is checked against the stoplist and eliminated if found there.

In term weighting, indexing or query terms are assigned numerical values usually based on information about the statistical distribution of terms, that is, the frequencies with which terms occur in documents, document collections,

or subsets of document collections such as documents considered relevant to a query.

1.2.5 Document Operations

Documents are the primary objects in IR systems and there are many operations for them In many types of IR systems, documents added to a database must be given unique identifiers, parsed into their constituent fields, and those fields broken into field identifiers and terms Once in the database, one sometimes wishes to mask off certain fields for searching and display For example, the searcher may wish to search only the title and abstract fields of documents for a given query, or may wish to see only the title and author of retrieved documents One may also wish to sort retrieved documents by some field, for example by author There are many sorting algorithms and because of the generality of the subject we have not covered it in this book A good description of sorting

algorithms in C can be found in Sedgewick (1990) Display operations include printing the documents, and

Trang 12

displaying them on a CRT.

Using information about term distributions, it is possible to assign a probability of relevance to each document in a retrieved set, allowing retrieved documents to be ranked in order of probable relevance (Chapter 14) Term

distribution information can also be used to cluster similar documents in a document space (Chapter 16).

Another important document operation is display The user interface of an IR system, as with any other type of information system, is critical to its successful usage Since user interface algorithms and data structures are not IR specific, we have not covered them in detail here.

1.2.6 Hardware for IR

Hardware affects the design of IR systems because it determines, in part, the operating speed of an IR system a crucial factor in interactive information systems and the amounts and types of information that can be stored practically in an IR system Most IR systems in use today are implemented on von Neumann machines general purpose computers with a single processor Most of the discussion of IR techniques in this book assumes a von Neumann machine as an implementation platform The computing speeds of these machines have improved

enormously over the years, yet there are still IR applications for which they may be too slow In response to this problem, some researchers have examined alternative hardware for implementing IR systems There are two

approaches parallel computers and IR specific hardware.

Chapter 18 discusses implementation of an IR system on the Connection machine a massively parallel computer with 64,000 processors Chapter 17 discusses IR specific hardware machines designed specifically to handle IR operations IR specific hardware has been developed both for text scanning and for common operations like

Boolean set combination.

Along with the need for greater speed has come the need for storage media capable of compactly holding the huge document databases that have proliferated Optical storage technology, capable of holding gigabytes of information

on a single disk, has met this need Chapter 6 discusses data structures and algorithms that allow optical disk

technology to be successfully exploited for IR.

1.2.7 Functional View of Paradigm IR System

Figure 1.1 shows the activities associated with a common type of Boolean IR system, chosen because it represents the operational standard for IR systems.

Figure 1.1: Example of Boolean IR system

When building the database, documents are taken one by one, and their text is broken into words The words from the documents are compared against a stoplist a list of words thought to have no indexing value Words from the document not found in the stoplist may next be stemmed Words may then also be counted, since the frequency of words in documents and in the database as a whole are often used for ranking retrieved documents Finally, the words and associated information such as the documents, fields within the documents, and counts are put into the database The database then might consist of pairs of document identifiers and keywords as follows.

Trang 13

Such a structure is called an inverted file In an IR system, each document must have a unique identifier, and its

fields, if field operations are supported, must have unique field names.

To search the database, a user enters a query consisting of a set of keywords connected by Boolean operators

(AND, OR, NOT) The query is parsed into its constituent terms and Boolean operators These terms are then

looked up in the inverted file and the list of document identifiers corresponding to them are combined according to the specified Boolean operators If frequency information has been kept, the retrieved set may be ranked in order of probable relevance The result of the search is then presented to the user In some systems, the user makes

judgments about the relevance of the retrieved documents, and this information is used to modify the query

automatically by adding terms from relevant documents and deleting terms from nonrelevant documents Systems such as this give remarkably good retrieval performance given their simplicity, but their performance is far from perfect Many techniques to improve them have been proposed.

One such technique aims to establish a connection between morphologically related terms Stemming (Chapter 8) is

a technique for conflating term variants so that the semantic closeness of words like "engineer," "engineered," and

"engineering" will be recognized in searching Another way to relate terms is via thesauri, or synonym lists, as discussed in Chapter 9.

1.3 IR AND OTHER TYPES OF INFORMATION

SYSTEMS

How do IR systems relate to different types of information systems such as database management systems (DBMS), and artificial intelligence (AI) systems? Table 1.3 summarizes some of the similarities and differences.

Table 1.3: IR, DBMS, Al Comparison

Data Object Primary Operation Database Size

Trang 14

AI logical statements inference usually small

One difference between IR, DBMS, and AI systems is the amount of usable structure in their data objects

Documents, being primarily text, in general have less usable structure than the tables of data used by relational DBMS, and structures such as frames and semantic nets used by AI systems It is possible, of course, to analyze a document manually and store information about its syntax and semantics in a DBMS or an AI system The barriers for doing this to a large collection of documents are practical rather than theoretical The work involved in doing knowledge engineering on a set of say 50,000 documents would be enormous Researchers have devoted much effort to constructing hybrid systems using IR, DBMS, AI, and other techniques; see, for example, Tong (1989) The hope is to eventually develop practical systems that combine IR, DBMS, and AI.

Another distinguishing feature of IR systems is that retrieval is probabilistic That is, one cannot be certain that a retrieved document will meet the information need of the user In a typical search in an IR system, some relevant documents will be missed and some nonrelevant documents will be retrieved This may be contrasted with retrieval from, for example, a DBMS where retrieval is deterministic In a DBMS, queries consist of attribute-value pairs that either match, or do not match, records in the database.

One feature of IR systems shared with many DBMS is that their databases are often very large sometimes in the gigabyte range Book library systems, for example, may contain several million records Commercial on-line

retrieval services such as Dialog and BRS provide databases of many gigabytes The need to search such large collections in real time places severe demands on the systems used to search them Selection of the best data

structures and algorithms to build such systems is often critical.

Another feature that IR systems share with DBMS is database volatility A typical large IR application, such as a book library system or commercial document retrieval service, will change constantly as documents are added, changed, and deleted This constrains the kinds of data structures and algorithms that can be used for IR.

In summary, a typical IR system must meet the following functional and nonfunctional requirements It must allow

a user to add, delete, and change documents in the database It must provide a way for users to search for documents

by entering queries, and examine the retrieved documents It must accommodate databases in the megabyte to gigabyte range, and retrieve relevant documents in response to queries interactively often within 1 to 10 seconds.

1.4 IR SYSTEM EVALUATION

IR systems can be evaluated in terms of many criteria including execution efficiency, storage efficiency, retrieval

Trang 15

effectiveness, and the features they offer a user The relative importance of these factors must be decided by the designers of the system, and the selection of appropriate data structures and algorithms for implementation will depend on these decisions.

Execution efficiency is measured by the time it takes a system, or part of a system, to perform a computation This can be measured in C based systems by using profiling tools such as prof (Earhart 1986) on UNIX Execution efficiency has always been a major concern of IR systems since most of them are interactive, and a long retrieval time will interfere with the usefulness of the system The nonfunctional requirements of IR systems usually specify maximum acceptable times for searching, and for database maintenance operations such as adding and deleting documents.

Storage efficiency is measured by the number of bytes needed to store data Space overhead, a common measure of storage efficiency, is the ratio of the size of the index files plus the size of the document files over the size of the document files Space overhead ratios of from 1.5 to 3 are typical for IR systems based on inverted files.

Most IR experimentation has focused on retrieval effectiveness usually based on document relevance judgments

This has been a problem since relevance judgments are subjective and unreliable That is, different judges will assign different relevance values to a document retrieved in response to a given query The seriousness of the

problem is the subject of debate, with many IR researchers arguing that the relevance judgment reliability problem

is not sufficient to invalidate the experiments that use relevance judgments A detailed discussion of the issues involved in IR experimentation can be found in Salton and McGill (1983) and Sparck-Jones (1981).

Many measures of retrieval effectiveness have been proposed The most commonly used are recall and precision

Recall is the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database Except for small test collections, this denominator is generally unknown and must be

estimated by sampling or some other method Precision is the ratio of the number of relevant documents retrieved over the total number of documents retrieved Both recall and precision take on values between 0 and 1.

Since one often wishes to compare IR performance in terms of both recall and precision, methods for evaluating them simultaneously have been developed One method involves the use of recall-precision graphs bivariate plots where one axis is recall and the other precision Figure 1.2 shows an example of such a plot Recall-precision plots show that recall and precision are inversely related That is, when precision goes up, recall typically goes down and vice-versa Such plots can be done for individual queries, or averaged over queries as described in Salton and

McGill (1983), and van Rijsbergen (1979).

Figure 1.2: Recall-precision graph

A combined measure of recall and precision, E, has been developed by van Rijsbergen (1979) The evaluation

measure E is defined as:

where P = precision, R = recall, and b is a measure of the relative importance, to a user, of recall and precision Experimenters choose values of E that they hope will reflect the recall and precision interests of the typical user For example, b levels of 5, indicating that a user was twice as interested in precision as recall, and 2, indicating that

Trang 16

a user was twice as interested in recall as precision, might be used.

IR experiments often use test collections which consist of a document database and a set of queries for the data base for which relevance judgments are available The number of documents in test collections has tended to be small, typically a few hundred to a few thousand documents Test collections are available on an optical disk (Fox 1990) Table 1.4 summarizes the test collections on this disk.

Table 1.4: IR Test Collections

Collection Subject Documents Queries

ADI Information Science 82 35

CACM Computer Science 3200 64

CISI Library Science 1460 76

TIME General Articles 423 83

IR experiments using such small collections have been criticized as not being realistic Since real IR databases typically contain much larger collections of documents, the generalizability of experiments using small test

collections has been questioned.

typically need to support large databases, some in the megabyte to gigabyte range, and retrieve relevant documents

in response to queries interactively often within 1 to 10 seconds We have summarized the various approaches, elaborated in subsequent chapters, taken by IR systems in providing these services Evaluation techniques for IR systems were also briefly surveyed The next chapter is an introduction to data structures and algorithms.

Trang 17

AHO, A., B KERNIGHAN, and P WEINBERGER 1988 The AWK Programming Language Reading, Mass.:

Addison-Wesley.

BELKIN N J., and W B CROFT 1987 "Retrieval Techniques," in Annual Review of Information Science and

Technology, ed M Williams New York: Elsevier Science Publishers, 109-145.

EARHART, S 1986 The UNIX Programming Language, vol 1 New York: Holt, Rinehart, and Winston.

FALOUTSOS, C 1985 "Access Methods for Text," Computing Surveys, 17(1), 49-74.

FOX, E., ed 1990 Virginia Disk One, Blacksburg: Virginia Polytechnic Institute and State University.

FRAKES, W B 1984 "Term Conflation for Information Retrieval," in Research and Development in Information

Retrieval, ed C S van Rijsbergen Cambridge: Cambridge University Press.

PRIETO-DIAZ, R., and G ARANGO 1991 Domain Analysis: Acquisition of Reusable Information for Software

Construction New York: IEEE Press.

SALTON, G., and M MCGILL 1983 An Introduction to Modern Information Retrieval New York: McGraw-Hill SEDGEWICK, R 1990 Algorithms in C Reading, Mass.: Addison-Wesley.

SPARCK-JONES, K 1981 Information Retrieval Experiment London: Butterworths.

TONG, R, ed 1989 Special Issue on Knowledge Based Techniques for Information Retrieval, International

Journal of Intelligent Systems, 4(3).

VAN RIJSBERGEN, C J 1979 Information Retrieval London: Butterworths.

Go to Chapter 2 Back to Table of Contents

Trang 18

CHAPTER 2: INTRODUCTION TO DATA

STRUCTURES AND ALGORITHMS RELATED TO INFORMATION RETRIEVAL

Infomation retrieval (IR) is a multidisciplinary field In this chapter we study data structures and

algorithms used in the implementation of IR systems In this sense, many contributions from theoretical computer science have practical and regular use in IR systems

The first section covers some basic concepts: strings, regular expressions, and finite automata In section 2.3 we have a look at the three classical foundations of structuring data in IR: search trees, hashing, and digital trees We give the main performance measures of each structure and the associated trade-offs In section 2.4 we attempt to classify IR algorithms based on their actions We distinguish three main

classes of algorithms and give examples of their use These are retrieval, indexing, and filtering

algorithms

The presentation level is introductory, and assumes some programming knowledge as well as some theoretical computer science background We do not include code bccause it is given in most standard textbooks For good C or Pascal code we suggest the Handbook of Algorithms and Data Structures of Gonnet and Baeza-Yates (1991)

2.2 BASIC CONCEPTS

We start by reviewing basic concepts related with text: strings, regular expressions (as a general query language), and finite automata (as the basic text processing machine) Strings appear everywhere, and the simplest model of text is a single long string Regular expressions provide a powerful query

language, such that word searching or Boolean expressions are particular cases of it Finite automata are

Trang 19

used for string searching (either by software or hardware), and in different ways of text filtering and processing.

2.2.1 Strings

We use to denote the alphabet (a set of symbols) We say that the alphabet is finite if there exists a

bound in the size of the alphabet, denoted by Otherwise, if we do not know a priori a bound in the

alphabet size, we say that the alphabet is arbitrary A string over an alphabet is a finite length

sequence of symbols from The empty string ( ) is the string with no symbols If x and y are strings, xy denotes the concatenation of x and y If = xyz is a string, then x is a prefix, and z a suffix of The length of a string x ( x ) is the number of symbols of x Any contiguous sequence of letters y from a string is called a substring If the letters do not have to be contiguous, we say that y is a subsequence.

2.2.2 Similarity between Strings

When manipulating strings, we need to know how similar are a pair of strings For this purpose, several

similarity measures have been defined Each similarity model is defined by a distance function d, such

that for any strings s1, s2, and s3, satisfies the following properties:

d(s1, s1) = 0, d(s1, s2) 0, d(s1, s3) d(s1, s2) + d(s2, s3)

The two main distance functions are as follows:

The Hamming distance is defined over strings of the same length The function d is defined as the

number of symbols in the same position that are different (number of mismatches) For example, d(text,

that) = 2.

The edit distance is defined as the minimal number of symbols that is necessary to insert, delete, or

substitute to transform a string s1 to s2 Clearly, d(s1, s2) length(s1) - length(s2) For example, d (text, tax) = 2.

2.2.3 Regular Expressions

We use the usual definition of regular expressions (RE for short) defined by the operations of

concatenation, union (+) and star or Kleene closure (*) (Hopcroft and Ullman (1979) A language over

an alphabet is a set of strings over Let L1 and L2 be two languages The language {xy x L1 and y

L 2 } is called the concatenation of L 1 and L 2 and is denoted by L 1 L 2 If L is a language, we define L 0

= { } and L i = LL i-1 for i 1 The star or Kleene closure of L, L*, is the language The plus or

Trang 20

positive closure is defined by L+ = LL*.

We use L(r) to represent the set of strings in the language denoted by the regular expression r The

regular expressions over and the languages that they denote (regular sets or regular languages) are

defined recursively as follows:

is a regular expression and denotes the empty set

(empty string) is a regular expression and denotes the set { }

For each symbol a in , a is a regular expression and denotes the set {a}

If p and q are regular expressions, then p + q (union), pq (concatenation), and p* (star) are regular expressions that denote L(p) L(q), L(p)L(q), and L(p)*, respectively

To avoid unnecessary parentheses we adopt the convention that the star operator has the highest

precedence, then concatenation, then union All operators are left associative

We also use:

to denote any symbol from (when the ambiguity is clearly resolvable by context)

r? to denote zero or one occurrence of r (that is, r? = + r).

[a1 am] to denote a range of symbols from For this we need an order in

r k to denote (finite closure)

Examples:

All the examples given here arise from the Oxford English Dictionary:

1 All citations to an author with prefix Scot followed by at most 80 arbitrary characters then by works

beginning with the prefix Kenilw or Discov:

<A>Scot 80 <W>(Kenilw + Discov)

where< > are characters in the OED text that denote tags (A for author, W for work)

Trang 21

2 All "bl" tags (lemma in bold) containing a single word consisting of lowercase alphabetical only:

4 All references to author W Scott:

<A>((Sirb)? W)?bScott b?</A>

where b denotes a literal space.

We use regular languages as our query domain, and regular languages can be represented by regular expressions Sometimes, we restrict the query to a subset of regular languages For example, when searching in plain text, we have the exact string matching problem, where we only allow single strings

as valid queries

2.2.4 Finite Automata

A finite automaton is a mathematical model of a system The automaton can be in any one of a finite number of states and is driven from state to state by a sequence of discrete inputs Figure 2.1 depicts an automaton reading its input from a tape

Figure 2.1: A finite automaton

Formally, a finite automaton (FA) is defined by a 5-tuple (Q, , , q0, F) (see Hopcroft and Ullman

[1979]), where

Q is a finite set of states,

is a finite input alphabet,

Trang 22

q0 Q is the initial state,

F Q is the set of final states, and

is the (partial) transition function mapping Q X ( + { }) to zero or more elements of Q That is, (q, a) describes the next state(s), for each state q and input symbol a; or is undefined.

A finite automaton starts in state q0 reading the input symbols from a tape In one move, the FA in state

q and reading symbol a enters state(s) (q, a), and moves the reading head one position to the right If

(q, a) F, we say that the FA has accepted the string written on its input tape up to the last symbol

read If (q, a) has an unique value for every q and a, we say that the FA is deterministic (DFA);

otherwise we say that it is nondeterministic (NFA).

The languages accepted by finite automata (either DFAs or NFAs) are regular languages In other words,

there exists a FA that accepts L(r) for any regular expression r; and given a DFA or NFA, we can

express the language that it recognizes as RE There is a simple algorithm that, given a regular

expression r, constructs a NFA that accepts L (r) in O (|r|) time and space There are also algorithms to convert a NFA to a NFA without transitions (O(|r| 2 ) states) and to a DFA (0(2|r|) states in the worst

case)

Figure 2.2 shows the DFA that searches an occurrence of the fourth query of the previous section in a text The double circled state is the final state of the DFA All the transitions are shown with the

exception of

the transition from every state (with the exception of states 2 and 3) to state 1 upon reading a <, and

the default transition from all the states to state 0 when there is no transition defined for the read symbol

Figure 2.2: DFA example for <A>((Sir b)? W)?bScott b? < / A>.

A DFA is called minimal if it has the minimum possible number of states There exists an O(| |n log n) algorithm to minimize a DFA with n states.

A finite automaton is called partial if the function is not defined for all possible symbols of for

each state In that case, there is an implicit error state belonging to F for every undefined transition.

Trang 23

DFA will be used in this book as searching machines Usually, the searching time depends on how the transitions are implemented If the alphabet is known and finite, using a table we have constant time per

transition and thus O (n) searching time If the alphabet is not known in advance, we can use an ordered table in each state In this case, the searching time is O (n log m) Another possibility would be to use a

hashing table in each state, achieving constant time per transition on average

These three data structures differ on how a search is performed Trees define a lexicographical order over the data However, in search trees, we use the complete value of a key to direct the search, while in digital trees, the digital (symbol) decomposition is used to direct the search On the other hand, hashing

"randomizes" the data order, being able to search faster on average, with the disadvantage that scanning

in sequential order is not possible (for example, range searches are expensive)

Some examples of their use in subsequent chapters of this book are:

Search trees: for optical disk files (Chapter 6), prefix B-trees (Chapter 3), stoplists (Chapter 7).

Hashing: hashing itself (Chapter 13), string searching (Chapter 10), associated retrieval, Boolean

operations (Chapters 12 and 15), optical disk file structures (Chapter 6), signature files (Chapter 4), stoplists (Chapter 7)

Digital trees: string searching (Chapter 10), suffix trees (Chapter 5).

We refer the reader to Gonnet and Baeza-Yates (1991) for search and update algorithms related to the data structures of this section

2.3.1 Search Trees

The most well-known search tree is the binary search tree Each internal node contains a key, and the left subtree stores all keys smaller that the parent key, while the right subtree stores all keys larger than the parent key Binary search trees are adequate for main memory However, for secondary memory, multiway search trees are better, because internal nodes are bigger In particular, we describe a special class of balanced multiway search trees called B-tree

Trang 24

A B-tree of order m is defined as follows:

The root has between 2 and 2m keys, while all other internal nodes have between m and 2m keys

If ki is the i-th key of a given internal node, then all keys in the i - 1 - th child are smaller than ki, while all the keys in the i-th child are bigger

All leaves are at the same depth

Usually, a B-tree is used as an index, and all the associated data are stored in the leaves or buckets This

structure is called B+-tree An example of a B+-tree of order 2 is shown in Figure 2.3, using bucket size 4

Figure 2.3: A B + -tree example (D i denotes the primary key i, plus its associated data).

B-trees are mainly used as a primary key access method for large databases in secondary memory To search a given key, we go down the tree choosing the appropriate branch at each step The number of disk accesses is equal to the height of the tree

Updates are done bottom-up To insert a new record, we search the insertion point If there is not enough space in the corresponding leaf, we split it, and we promote a key to the previous level The algorithm is applied recursively, up to the root, if necessary In that case, the height of the tree increases by one Splits provides a minimal storage utilization of 50 percent Therefore, the height of the tree is at most logm+1 (n/b) + 2 where n is the number of keys, and b is the number of records that can be stored in a

leaf Deletions are handled in a similar fashion, by merging nodes On average, the expected storage utilization is ln 2 .69 (Yao 1979; Baeza-Yates 1989)

To improve storage utilization, several overflow techniques exist Some of them are:

B*-trees: in case of overflow, we first see if neighboring nodes have space In that case, a subset of the keys is shifted, avoiding a split With this technique, 66 percent minimal storage utilization is

provided The main disadvantage is that updates are more expensive (Bayer and McCreight 1972; Knuth 1973)

Partial expansions: buckets of different sizes are used If an overflow occurs, a bucket is expanded (if possible), or split Using two bucket sizes of relative ratio 2/3, 66 percent minimal and 81 percent

average storage utilization is achieved (Lomet 1987; Baeza-Yates and Larson 1989) This technique

Trang 25

does not deteriorate update time.

Adaptive splits: two bucket sizes of relative ratios 1/2 are used However, splits are not symmetric (balanced), and they depend on the insertion point This technique achieves 77 percent average storage utilization and is robust against nonuniform distributions (low variance) (Baeza-Yates 1990)

A special kind of B-trees, Prefix B-trees (Bayer and Unterauer 1977), supports efficiently variable

length keys, as is the case with words This kind of B-tree is discussed in detail in Chapter 3

2.3.2 Hashing

A hashing function h (x) maps a key x to an integer in a given range (for example, 0 to m - 1) Hashing

functions are designed to produce values uniformly distributed in the given range For a good discussion about choosing hashing functions, see Ullman (1972), Knuth (1973), and Knott (1975) The hashing

value is also called a signature.

A hashing function is used to map a set of keys to slots in a hashing table If the hashing function gives the same slot for two different keys, we say that we have a collision Hashing techniques mainly differ in how collisions are handled There are two classes of collision resolution schemas: open addressing and

overflow addressing.

In open addressing (Peterson 1957), the collided key is "rehashed" into the table, by computing a new

index value The most used technique in this class is double hashing, which uses a second hashing

function (Bell and Kaman 1970; Guibas and Szemeredi 1978) The main limitation of this technique is that when the table becomes full, some kind of reorganization must be done Figure 2.4 shows a hashing

table of size 13, and the insertion of a key using the hashing function h (x) = x mod 13 (this is only an

example, and we do not recommend using this hashing function!)

Figure 2.4: Insertion of a new key using double hashing.

In overflow addressing (Williams 1959; Knuth 1973), the collided key is stored in an overflow area, such that all key values with the same hashing value are linked together The main problem of this

schema is that a search may degenerate to a linear search

Searches follow the insertion path until the given key is found, or not (unsuccessful case) The average search time is constant, for nonfull tables

Because hashing "randomizes" the location of keys, a sequential scan in lexicographical order is not possible Thus, ordered scanning or range searches are very expensive More details on hashing can be

Trang 26

found in Chapter 13.

Hashing schemes have also been used for secondary memory The main difference is that tables have to

grow dynamically as the number of keys increases The main techniques are extendible hashing which

uses hashing on two levels: a directory and a bucket level (Fagin et al 1979), and linear hashing which uses an overflow area, and grows in a predetermined way (Litwin 1980; Larson 1980; Larson and Kajla

1984) For the case of textual databases, a special technique called signature files (Faloutsos 1987) is

used most frequently This technique is covered in detail in Chapter 4 of this book

To improve search time on B-trees, and to allow range searches in hashing schemes, several hybrid

methods have been devised Between them, we have to mention the bounded disorder method (Litwin

and Lomet 1987), where B+-tree buckets are organized as hashing tables

2.3.3 Digital Trees

Efficient prefix searching can be done using indices One of the best indices for prefix searching is a binary digital tree or binary trie constructed from a set of substrings of the text This data structure is used in several algorithms

Tries are recursive tree structures that use the digital decomposition of strings to represent a set of

strings and to direct the searching Tries were invented by de la Briandais (1959) and the name was

suggested by Fredkin (1960), from information retrie val If the alphabet is ordered, we have a

lexicographically ordered tree The root of the trie uses the first character, the children of the root use the second character, and so on If the remaining subtrie contains only one string, that string's identity is stored in an external node

Figure 2.5 shows a binary trie (binary alphabet) for the string "01100100010111 " after inserting all the substrings that start from positions 1 through 8 (In this case, the substring's identity is represented

by its starting position in the text.)

The height of a trie is the number of nodes in the longest path from the root to an external node The

length of any path from the root to an external node is bounded by the height of the trie On average, the height of a trie is logarithmic for any square-integrable probability distribution (Devroye 1982) For a random uniform distribution (Regnier 1981), we have

for a binary trie containing n strings.

The average number of internal nodes inspected during a (un)successful search in a binary trie with n

strings is log2n + O(1) The average number of internal nodes is + O (n) (Knuth 1973).

Trang 27

A Patricia tree (Morrison 1968) is a trie with the additional constraint that single-descendant nodes are

eliminated This name is an acronym for "Practical Algorithm To Retrieve Information Coded In

Alphanumerical." A counter is kept in each node to indicate which is the next bit to inspect Figure 2.6 shows the Patricia tree corresponding to the binary trie in Figure 2.5

Figure 2.5: Binary trie (external node label indicates position in the text) for the first eight suffixes in "01100100010111 ".

Figure 2.6: Patricia tree (internal node label indicates bit number).

For n strings, such an index has n external nodes (the n positions of the text) and n -1 internal nodes Each internal node consists of a pair of pointers plus some counters Thus, the space required is O(n).

It is possible to build the index in time, where denotes the height of the tree As for tries, the expected height of a Patricia tree is logarithmic (and at most the height of the binary trie) The expected height of a Patricia tree is log2 n + o(log2 n) (Pittel 1986).

A trie built using the substrings (suffixes) of a string is also called suffix tree (McCreight [1976] or Aho

et al [1974]) A variation of these are called position trees (Weiner 1973) Similarly, a Patricia tree is

called a compact suffix tree.

2.4.1 Retrieval Algorithms

The main class of algorithms in IR is retrieval algorithms, that is, to extract information from a textual database We can distinguish two types of retrieval algorithms, according to how much extra memory

we need:

Trang 28

Sequential scanning of the text: extra memory is in the worst case a function of the query size, and not of the database size On the other hand, the running time is at least proportional to the size of the text, for example, string searching (Chapter 10).

Indexed text: an "index" of the text is available, and can be used to speed up the search The index size is usually proportional to the database size, and the search time is sublinear on the size of the text, for example, inverted files (Chapter 3) and signature files (Chapter 4)

Formally, we can describe a generic searching problem as follows: Given a string t (the text), a regular expression q (the query), and information (optionally) obtained by preprocessing the pattern and/or the text, the problem consists of finding whether t *q * (q for short) and obtaining some or all of the

following information:

1 The location where an occurrence (or specifically the first, the longest, etc.) of q exists Formally, if t

*q *, find a position m 0 such that t m q * For example, the first occurrence is defined as the

least m that fulfills this condition.

2 The number of occurrences of the pattern in the text Formally, the number of all possible values of m

in the previous category

3 All the locations where the pattern occurs (the set of all possible values of m).

In general, the complexities of these problems are different

We assume that is not a member of L(q) If it is, the answer is trivial Note that string matching is a particular case where q is a string Algorithms to solve this problem are discussed in Chapter 10.

The efficiency of retrieval algorithms is very important, because we expect them to solve on-line queries with a short answer time This need has triggered the implementation of retrieval algorithms in many different ways: by hardware, by parallel machines, and so on These cases are explained in detail in Chapter 17 (algorithms by hardware) and Chapter 18 (parallel algorithms)

2.4.2 Filtering Algorithms

This class of algorithms is such that the text is the input and a processed or filtered version of the text is the output This is a typical transformation in IR, for example to reduce the size of a text, and/or

standardize it to simplify searching

The most common filtering/processing operations are:

Trang 29

Common words removed using a list of stopwords This operation is discussed in Chapter 7.

Uppercase letters transformed to lowercase letters

Special symbols removed and sequences of multiple spaces reduced to one space

Numbers and dates transformed to a standard format (Gonnet 1987)

Spelling variants transformed using Soundex-like methods (Knuth 1973)

Word stemming (removing suffixes and/or prefixes) This is the topic of Chapter 8

Automatic keyword extraction

Word ranking

Unfortunately, these filtering operations may also have some disadvantages Any query, before

consulting the database, must be filtered as is the text; and, it is not possible to search for common

words, special symbols, or uppercase letters, nor to distinguish text fragments that have been mapped to the same internal form

2.4.3 Indexing Algorithms

The usual meaning of indexing is to build a data structure that will allow quick searching of the text, as

we mentioned previously There are many classes of indices, based on different retrieval approaches For example, we have inverted files (Chapter 3), signature files (Chapter 4), tries (Chapter 5), and so on, as

we have seen in the previous section Almost all type of indices are based on some kind of tree or

hashing Perhaps the main exceptions are clustered data structures (this kind of indexing is called

clustering), which is covered in Chapter 16, and the Direct Acyclic Word Graph (DAWG) of the text,

which represents all possible subwords of the text using a linear amount of space (Blumer et al 1985), and is based on finite automata theory

Usually, before indexing, the text is filtered Figure 2.7 shows the complete process for the text

Figure 2.7: Text preprocessing

The preprocessing time needed to build the index is amortized by using it in searches For example, if

Trang 30

building the index requires O (n log n) time, we would expect to query the database at least O (n) times

to amortize the preprocessing cost In that case, we add O (log n) preprocessing time to the total query

time (that may also be logarithmic)

Many special indices, and their building algorithms (some of them in parallel machines), are covered in this book

REFERENCES

AHO, A., J HOPCROFT, and J ULLMAN 1974 The Design and Analysis of Computer Algorithms

Reading, Mass.: Addison-Wesley

BAEZA-YATES, R 1989 "Expected Behaviour of B+-Trees under Random Insertions." Acta

Informatica, 26(5), 439-72 Also as Research Report CS-86-67, University of Waterloo, 1986.

BAEZA-YATES, R 1990 "An Adaptive Overflow Technique for the B-tree," in Extending Data Base

Technology Conference (EDBT 90), eds F Bancilhon, C Thanos and D Tsichritzis, pp 16-28, Venice

Springer Verlag Lecture Notes in Computer Science 416

BAEZA-YATES, R., and P.-A LARSON 1989 "Performance of B+-trees with Partial Expansions."

IEEE Trans on Knowledge and Data Engineering, 1, 248-57 Also as Research Report CS-87-04, Dept

of Computer Science, University of Waterloo, 1987

BAYER, R., and E MCCREIGHT 1972 "Organization and Maintenance of Large Ordered

Indexes."Acta Informatica, 1(3), 173-89.

BAYER, R., and K UNTERAUER 1977 "Prefix B-trees." ACM TODS, 2(1), 11-26.

BELL, J., and C KAMAN 1970 "The Linear Quotient Hash Code." CACM, 13(11), 675-77.

BLUMER, A., J BLUMER, D HAUSSLER, A EHRENFEUCHT, M CHEN, and J SEIFERAS

1985 "The Smallest Automaton Recognizing the Subwords of a Text." Theoretical Computer Science,

40, 31-55

DE LA BRIANDAIS, R 1959 "File Searching Using Variable Length Keys, in AFIPS Western JCC,

pp 295-98, San Francisco, Calif

DEVROYE, L 1982 "A Note on the Average Depth of Tries." Computing, 28, 367-71.

FAGIN, R., J NIEVERGELT, N PIPPENGER, and H STRONG 1979 "Extendible Hashing a Fast

Trang 31

Access Method for Dynamic Files ACM TODS, 4(3), 315-44.

FALOUTSOS, C 1987 Signature Files: An Integrated Access Method for Text and Attributes, Suitable for Optical Disk Storage." Technical Report CS-TR-1867, University of Maryland

FREDKIN, E 1960 "Trie Memory." CACM, 3, 490-99.

GONNET, G 1987 "Extracting Information from a Text Database: An Example with Dates and

Numerical Data," in Third Annual Conference of the UW Centre for the New Oxford English Dictionary,

pp 85-89, Waterloo, Canada

GONNET, G and R BAEZA-YATES 1991 Handbook of Algorithms and Data Structures In Pascal

and C (2nd ed.) Wokingham, U.K.: Addison-Wesley.

GUIBAS, L., and E SZEMEREDI 1978 "The Analysis of Double Hashing." JCSS, 16(2), 226-74.

HOPCROFT, J., and J ULLMAN 1979 Introduction to Automata Theory Reading, Mass.:

Addison-Wesley

KNOTT, G D 1975 "Hashing Functions." Computer Journal, 18(3), 265-78.

KNUTH, D 1973 The Art of Computer Programming: Sorting and Searching, vol 3 Reading, Mass.:

Addison-Wesley

LARSON, P.-A 1980 "Linear Hashing with Partial Expansions," in VLDB, vol 6, pp 224-32, Montreal.

LARSON, P.-A., and A KAJLA 1984 "File Organization: Implementation of a Method Guaranteeing

Retrieval in One Access." CACM, 27(7), 670-77.

LITWlN, W 1980 "Linear Hashing: A New Tool for File and Table Addressing," in VLDB, vol 6, pp

212-23, Montreal

LITWIN, W., and LOMET, D 1987 "A New Method for Fast Data Searches with Keys IEEE

Software, 4(2), 16-24.

LOMET, D 1987 "Partial Expansions for File Organizations with an Index ACM TODS, 12: 65-84

Also as tech report, Wang Institute, TR-86-06, 1986

MCCREIGHT, E 1976 "A Space-Economical Suffix Tree Construction Algorithm." JACM, 23, 262-72.

MORRlSON, D 1968 "PATRlClA-Practical Algorithm to Retrieve Information Coded in

Trang 32

Alphanumeric." JACM, 15, 514-34.

PETERSON, W 1957 "Addressing for Random-Access Storage IBM J Res Development, 1(4), 130-46.

PITTEL, B 1986 "Paths in a Random Digital Tree: Limiting Distributions." Adv Appl Prob., 18,

139-55

REGNIER, M 1981 "On the Average Height of Trees in Digital Search and Dynamic Hashing." Inf

Proc Letters, 13, 64-66.

ULLMAN, J 1972 "A Note on the Efficiency of Hashing Functions." JACM, 19(3), 569-75.

WEINER, P 1973 "Linear Pattern Matching Algorithm," in FOCS, vol 14, pp 1-11.

WILLIAMS, F 1959 "Handling Identifiers as Internal Symbols in Language Processors." CACM, 2(6),

Trang 33

CHAPTER 3: INVERTED FILES

3.1 INTRODUCTION

Three of the most commonly used file structures for information retrieval can be classified as

lexicographical indices (indices that are sorted), clustered file structures, and indices based on hashing One type of lexicographical index, the inverted file, is presented in this chapter, with a second type of lexicographical index, the Patricia (PAT) tree, discussed in Chapter 5 Clustered file structures are

covered in Chapter 16, and indices based on hashing are covered in Chapter 13 and Chapter 4 (signature files)

The concept of the inverted file type of index is as follows Assume a set of documents Each document

is assigned a list of keywords or attributes, with optional relevance weights associated with each

keyword (attribute) An inverted file is then the sorted list (or index) of keywords (attributes), with each keyword having links to the documents containing that keyword (see Figure 3.1) This is the kind of

Trang 34

index found in most commercial library systems The use of an inverted file improves search efficiency

by several orders of magnitude, a necessity for very large text files The penalty paid for this efficiency

is the need to store a data structure that ranges from 10 percent to 100 percent or more of the size of the text itself, and a need to update that index as the data set changes

Figure 3.1: An inverted file implemented using a sorted array

Usually there are some restrictions imposed on these indices and consequently on later searches

Examples of these restrictions are:

a controlled vocabulary which is the collection of keywords that will be indexed Words in the text that are not in the vocabulary will not be indexed, and hence are not searchable

a list of stopwords (articles, prepositions, etc.) that for reasons of volume or precision and recall will not be included in the index, and hence are not searchable

a set of rules that decide the beginning of a word or a piece of text that is indexable These rules deal with the treatment of spaces, punctuation marks, or some standard prefixes, and may have signficant impact on what terms are indexed

a list of character sequences to be indexed (or not indexed) In large text databases, not all character sequences are indexed; for example, character sequences consisting of all numerics are often not indexed

It should be noted that the restrictions that determine what is to be indexed are critical to later search effectiveness and therefore these rules should be carefully constructed and evaluated This problem is further discussed in Chapter 7

A search in an inverted file is the composition of two searching algorithms; a search for a keyword

(attribute), which returns an index, and then a possible search on that index for a particular attribute value The result of a search on an inverted file is a set of records (or pointers to records)

This Chapter is organized as follows The next section presents a survey of the various implementation structures for inverted files The third section covers the complete implementation of an algorithm for building an inverted file that is stored as a sorted array, and the fourth section shows two variations on this implementation, one that uses no sorting (and hence needs little working storage) and one that

increases efficiency by making extensive use of primary memory The final section summarizes the chapter

Trang 35

3.2 STRUCTURES USED IN INVERTED FILES

There are several structures that can be used in implementing inverted files: sorted arrays, B-trees, tries, and various hashing structures, or combinations of these structures The first three of these structures are sorted (lexicographically) indices, and can efficiently support range queries, such as all documents

having keywords that start with "comput." Only these three structures will be further discussed in this chapter (For more on hashing methods, see Chapters 4 and 13.)

3.2.1 The Sorted Array

An inverted file implemented as a sorted array structure stores the list of keywords in a sorted array, including the number of documents associated with each keyword and a link to the documents

containing that keyword This array is commonly searched using a standard binary search, although large secondary-storage-based systems will often adapt the array (and its search) to the characteristics of their secondary storage

The main disadvantage of this approach is that updating the index (for example appending a new

keyword) is expensive On the other hand, sorted arrays are easy to implement and are reasonably fast (For this reason, the details of creating a sorted array inverted file are given in section 3.3.)

3.2.2 B-trees

Another implementation structure for an inverted file is a B-tree More details of B-trees can be found in Chapter 2, and also in a recent paper (Cutting and Pedersen 1990) on efficient inverted files for dynamic data (data that is heavily updated) A special case of the B-tree, the prefix B-tree, uses prefixes of words

as primary keys in a B-tree index (Bayer and Unterauer 1977) and is particularly suitable for storage of textual indices Each internal node has a variable number of keys Each key is the shortest word (in length) that distinguishes the keys stored in the next level The key does not need to be a prefix of an actual term in the index The last level or leaf level stores the keywords themselves, along with their associated data (see Figure 3.2) Because the internal node keys and their lengths depend on the set of keywords, the order (size) of each node of the prefix B-tree is variable Updates are done similarly to those for a B-tree to maintain a balanced tree The prefix B-tree method breaks down if there are many words with the same (long) prefix In this case, common prefixes should be further divided to avoid wasting space

Figure 3.2: A prefix B-tree

Compared with sorted arrays, B-trees use more space However, updates are much easier and the search

Trang 36

time is generally faster, especially if secondary storage is used for the inverted file (instead of memory) The implementation of inverted files using B-trees is more complex than using sorted arrays, and

therefore readers are referred to Knuth (1973) and Cutting and Pedersen (1990) for details of

implementation of trees, and to Bayer and Unterauer (1977) for details of implementation of prefix trees

B-3.2.3 Tries

Inverted files can also be implemented using a trie structure (see Chapter 2 for more on tries) This

structure uses the digital decomposition of the set of keywords to represent those keywords A special trie structure, the Patricia (PAT) tree, is especially useful in information retrieval and is described in detail in Chapter 5 An additional source for tested and optimized code for B-trees and tries is Gonnet and Baeza-Yates (1991)

3.3 BUILDING AN INVERTED FILE USING A

SORTED ARRAY

The production of sorted array inverted files can be divided into two or three sequential steps as shown

in Figure 3.3 First, the input text must be parsed into a list of words along with their location in the text This is usually the most time consuming and storage consuming operation in indexing Second, this list must then be inverted, from a list of terms in location order to a list of terms ordered for use in searching (sorted into alphabetical order, with a list of all locations attached to each term) An optional third step is the postprocessing of these inverted files, such as for adding term weights, or for reorganizing or

compressing the files

Figure 3.3: Overall schematic of sorted array inverted file creation

Creating the initial word list requires several different operations First, the individual words must be recognized from the text Each word is then checked against a stoplist of common words, and if it can be considered a noncommon word, may be passed through a stemming algorithm The resultant stem is then recorded in the word-within-location list The parsing operation and the use of a stoplist are

described in Chapter 7, and the stemming operation is described in Chapter 8

The word list resulting from the parsing operation (typically stored as a disk file) is then inverted This is usually done by sorting on the word (or stem), with duplicates retained (see Figure 3.4) Even with the use of high-speed sorting utilities, however, this sort can be time consuming for large data sets (on the order of n log n) One way to handle this problem is to break the data sets into smaller pieces, process each piece, and then correctly merge the results Methods that do not use sorting are given in section

Trang 37

3.4.1 After sorting, the duplicates are merged to produce within-document frequency statistics (A

system not using within-document frequencies can just sort with duplicates removed.) Note that

although only record numbers are shown as locations in Figure 3.4, typically inverted files store field locations and possibly even word location These additional locations are needed for field and proximity searching in Boolean operations and cause higher inverted file storage overhead than if only record location was needed Inverted files for ranking retrieval systems (see Chapter 14) usually store only record locations and term weights or frequencies

Figure 3.4: Inversion of word list

Although an inverted file could be used directly by the search routine, it is usually processed into an improved final format This format is based on the search methods and the (optional) weighting methods used A common search technique is to use a binary search routine on the file to locate the query words This implies that the file to be searched should be as short as possible, and for this reason the single file shown containing the terms, locations, and (possibly) frequencies is usually split into two pieces The first piece is the dictionary containing the term, statistics about that term such as number of postings, and

a pointer to the location of the postings file for that term The second piece is the postings file itself, which contains the record numbers (plus other necessary location information) and the (optional)

weights for all occurrences of the term In this manner, the dictionary used in the binary search has only one "line" per unique term Figure 3.5 illustrates the conceptual form of the necessary files; the actual form depends on the details of the search routine and on the hardware being used Work using large data sets (Harman and Candela 1990) showed that for a file of 2,653 records, there were 5,123 unique terms with an average of 14 postings/term and a maximum of over 2,000 postings for a term A larger data set

of 38,304 records had dictionaries on the order of 250,000 lines (250,000 unique terms, including some numbers) and an average of 88 postings per record From these numbers it is clear that efficient storage structures for both the binary search and the reading of the postings are critical

Figure 3.5: Dictionary and postings file from the last example

3.4 MODIFICATIONS TO THE BASIC TECHNIQUE

Two different techniques are presented as improvements on the basic inverted file creation discussed in section 3.3 The first technique is for working with very large data sets using secondary storage The second technique uses multiple memory loads for inverting files

3.4.1 Producing an Inverted File for Large Data Sets without

Sorting

Trang 38

Indexing large data sets using the basic inverted file method presents several problems Most computers cannot sort the very large disk files needed to hold the initial word list within a reasonable time frame, and do not have the amount of storage necessary to hold a sorted and unsorted version of that word list, plus the intermediate files involved in the internal sort Whereas the data set could be broken into

smaller pieces for processing, and the resulting files properly merged, the following technique may be considerably faster For small data sets, this technique carries a significant overhead and therefore

should not be used (For another approach to sorting large amounts of data, see Chapter 5.)

The new indexing method (Harman and Candela 1990) is a two-step process that does not need the middle sorting step The first step produces the initial inverted file, and the second step adds the term weights to that file and reorganizes the file for maximum efficiency (see Figure 3.6)

Figure 3.6: Flowchart of new indexing method

The creation of the initial inverted file avoids the use of an explicit sort by using a right-threaded binary tree (Knuth 1973) The data contained in each binary tree node is the current number of term postings and the storage location of the postings list for that term As each term is identified by the text parsing program, it is looked up in the binary tree, and either is added to the tree, along with related data, or causes tree data to be updated The postings are stored as multiple linked lists, one variable length linked list for each term, with the lists stored in one large file Each element in the linked postings file consists

of a record number (the location of a given term), the term frequency in that record, and a pointer to the next element in the linked list for that given term By storing the postings in a single file, no storage is wasted, and the files are easily accessed by following the links As the location of both the head and tail

of each linked list is stored in the binary tree, the entire list does not need to be read for each addition, but only once for use in creating the final postings file (step two)

Note that both the binary tree and the linked postings list are capable of further growth This is important

in indexing large data sets where data is usually processed from multiple separate files over a short

period of time The use of the binary tree and linked postings list could be considered as an updatable inverted file Although these structures are not as efficient to search, this method could be used for

creating and storing supplemental indices for use between updates to the primary index However, see the earlier discussion of B-trees for better ways of producing updatable inverted files

The binary tree and linked postings lists are saved for use by the term weighting routine (step two) This routine walks the binary tree and the linked postings list to create an alphabetical term list (dictionary) and a sequentially stored postings file To do this, each term is consecutively read from the binary tree (this automatically puts the list in alphabetical order), along with its related data A new sequentially stored postings file is allocated, with two elements per posting The linked postings list is then traversed, with the frequencies being used to calculate the term weights (if desired) The last step writes the record

Trang 39

numbers and corresponding term weights to the newly created sequential postings file These

sequentially stored postings files could not be created in step one because the number of postings is unknown at that point in processing, and input order is text order, not inverted file order The final index files therefore consist of the same dictionary and sequential postings file as for the basic inverted file described in section 3.3

Table 3.1 gives some statistics showing the differences between an older indexing scheme and the new indexing schemes The old indexing scheme refers to the indexing method discussed in section 3.3 in which records are parsed into a list of words within record locations, the list is inverted by sorting, and finally the term weights are added

Table 3.1: Indexing Statistics

Text Size Indexing Time Working Storage Index Storage

(megabytes) (hours) (megabytes) (megabytes)

environments The new method takes more time for the very small (1.6 megabyte) database because of its additional processing overhead As the size of the database increases, however, the processing time has an n log n relationship to the size of the database The older method contains a sort (not optimal)

Trang 40

which is n log n (best case) to n squared (worst case), making processing of the very large databases likely to have taken longer using this method.

3.4.2 A Fast Inversion Algorithm

The second technique to produce a sorted array inverted file is a fast inversion algorithm called INV (Copyright © Edward A Fox, Whay C Lee, Virginia Tech) This technique takes advantage of two principles: the large primary memories available on today's computers and the inherent order of the input data The following summary of this technique is adapted from a technical report by Fox and Lee (1991)

FAST-The first principle is important since personal computers with more than 1 megabyte of primary memory are common, and mainframes may have more than 100 megabytes of memory Even if databases are on the order of 1 gigabyte, if they can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized

The second principle is crucial since with large files it is very expensive to use polynomial or even n log

n sorting algorithms These costs are further compounded if memory is not used, since then the cost is for disk operations

The FAST-INV algorithm follows these two principles, using primary memory in a close to optimal fashion, and processing the data in three passes The overall scheme can be seen in Figure 3.7

The input to FAST-INV is a document vector file containing the concept vectors for each document in the collection to be indexed A sample document vector file can be seen in Figure 3.8 The document numbers appear in the left-hand column and the concept numbers of the words in each document appear

in the right- hand column This is similar to the initial word list shown in Figure 3.4 for the basic

method, except that the words are represented by concept numbers, one concept number for each unique word in the collection (i.e., 250,000 unique words implies 250,000 unique concept numbers) Note

however that the document vector file is in sorted order, so that concept numbers are sorted within

document numbers, and document numbers are sorted within collection This is necessary for INV to work correctly

FAST-Figure 3.7: Overall scheme of FAST-INV

Figure 3.8: Sample document vector

Preparation

Tiêu đề	Information Retrieval Data Structures & Algorithms
Tác giả	William B. Frakes, Ricardo Baeza-Yates
Chuyên ngành	Information Retrieval
Thể loại	Sách về Cấu trúc Dữ liệu & Thuật toán trong Truy xuất Thông tin

Định dạng
Số trang	630
Dung lượng	1,18 MB