The gray boxes indicate postings that can be safely ignored during scoring.Evaluation tree for the structured query #combine#od: 1 tropicalfish #od: 1 aquarium fish fish Top ten results
Trang 2Search Engines
Information Retrieval
in Practice
Trang 4W BRUCE CROFT DONALD METZLER TREVOR STROHMAN
Addison Wesley
Boston Columbus Indianapolis New York San Francisco Upper Saddle River
Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto
Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo
Trang 5Editorial Assistant Sarah Milmore
Managing Editor Jeff Holcomb
Online Product Manager Bethany Tidd
Director of Marketing Margaret Waples
Marketing Manager Erin Davis
Marketing Coordinator Kathryn Ferranti
Senior Manufacturing Buyer Carol Melville
Text Design, Composition, W Bruce Croft, Donald Metzler,
and Illustrations and Trevor Strohman
Art Direction Linda Knowles
Cover Design Elena Sidorova
Cover Image © Peter Gudella / Shutterstock
Many of the designations used by manufacturers and sellers to distinguish their productsare claimed as trademarks Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial caps
or all caps
The programs and applications presented in this book have been included for theirinstructional value They have been tested with care, but are not guaranteed for anyparticular purpose The publisher does not offer any warranties or representations, nordoes it accept any liabilities with respect to the programs or applications
Library of Congress Cataloging-in-Publication Data available upon request
Copyright © 2010 Pearson Education, Inc All rights reserved No part of this publicationmay be reproduced, stored in a retrieval system, or transmitted, in any form or by anymeans, electronic, mechanical, photocopying, recording, or otherwise, without the priorwritten permission of the publisher Printed in the United States of America For
information on obtaining permission for use of material in this work, please submit awritten request to Pearson Education, Inc., Rights and Contracts Department, 501
Boylston Street, Suite 900, Boston, MA 02116, fax (617) 671-3447, or online at
http://www.pearsoned.com/legal/permissions.htm
ISBN-13: 978-0-13-607224-9
ISBN-10: 0-13-607224-0
1 2 3 4 5 6 7 8 9 1 0 - H P - 1 3 1211 1009
Trang 6This book provides an overview of the important issues in information retrieval,and how those issues affect the design and implementation of search engines Notevery topic is covered at the same level of detail We focus instead on what weconsider to be the most important alternatives to implementing search enginecomponents and the information retrieval models underlying them Web searchengines are obviously a major topic, and we base our coverage primarily on thetechnology we all use on the Web,1 but search engines are also used in many otherapplications That is the reason for the strong emphasis on the information re-trieval theories and concepts that underlie all search engines
The target audience for the book is primarily undergraduates in computer ence or computer engineering, but graduate students should also find this useful
We also consider the book to be suitable for most students in information ence programs Finally, practicing search engineers should benefit from the book,whatever their background There is mathematics in the book, but nothing tooesoteric There are also code and programming exercises in the book, but nothingbeyond the capabilities of someone who has taken some basic computer scienceand programming classes
sci-The exercises at the end of each chapter make extensive use of a Java™-basedopen source search engine called Galago Galago was designed both for this bookand to incorporate lessons learned from experience with the Lemur and Indriprojects In other words, this is a fully functional search engine that can be used
to support real applications Many of the programming exercises require the use,modification, and extension of Galago components
1 In keeping with common usage, most uses of the word "web" in this book are not italized, except when we refer to the World Wide Web as a separate entity
Trang 7In the first chapter, we provide a high-level review of the field of information trieval and its relationship to search engines In the second chapter, we describethe architecture of a search engine This is done to introduce the entire range ofsearch engine components without getting stuck in the details of any particularaspect In Chapter 3, we focus on crawling, document feeds, and other techniquesfor acquiring the information that will be searched Chapter 4 describes the sta-tistical nature of text and the techniques that are used to process it, recognize im-portant features, and prepare it for indexing Chapter 5 describes how to createindexes for efficient search and how those indexes are used to process queries InChapter 6, we describe the techniques that are used to process queries and trans-form them into better representations of the user's information need
re-Ranking algorithms and the retrieval models they are based on are covered
in Chapter 7 This chapter also includes an overview of machine learning niques and how they relate to information retrieval and search engines Chapter
tech-8 describes the evaluation and performance metrics that are used to compare andtune search engines Chapter 9 covers the important classes of techniques used forclassification, filtering, clustering, and dealing with spam Social search is a termused to describe search applications that involve communities of people in tag-ging content or answering questions Search techniques for these applications andpeer-to-peer search are described in Chapter 10 Finally, in Chapter 11, we give anoverview of advanced techniques that capture more of the content of documentsthan simple word-based approaches This includes techniques that use linguisticfeatures, the document structure, and the content of nontextual media, such asimages or music
Information retrieval theory and the design, implementation, evaluation, anduse of search engines cover too many topics to describe them all in depth in onebook We have tried to focus on the most important topics while giving somecoverage to all aspects of this challenging and rewarding subject
Supplements
A range of supplementary material is provided for the book This material is signed both for those taking a course based on the book and for those giving thecourse Specifically, this includes:
de-• Extensive lecture slides (in PDF and PPT format)
Trang 8Preface VII
• Solutions to selected end-of-chapter problems (instructors only)
• Test collections for exercises
• Galago search engine
The supplements are available at www.search-engines-book.com, or at www.aw.com
Acknowledgments
First and foremost, this book would not have happened without the dous support and encouragement from our wives, Pam Aselton, Anne-MarieStrohman, and Shelley Wang The University of Massachusetts Amherst providedmaterial support for the preparation of the book and awarded a Conti Faculty Fel-lowship to Croft, which sped up our progress significantly The staff at the Centerfor Intelligent Information Retrieval (Jean Joyce, Kate Moruzzi, Glenn Stowell,and Andre Gauthier) made our lives easier in many ways, and our colleagues andstudents in the Center provided the stimulating environment that makes work-ing in this area so rewarding A number of people reviewed parts of the book and
tremen-we appreciated their comments Finally, tremen-we have to mention our children, Doug,Eric, Evan, and Natalie, or they would never forgive us
BRUCE CROFT
DONMETZLERTREVOR STROHMAN
Trang 101
2
3
Search Engines and Information Retrieval
1.1 What Is Information Retrieval?
1.2 The Big Issues
2.4 How Does ItReaUy Work?
Crawls and Feeds
3 1 Deciding What to Search
3.2 Crawling the Web
3.2 1 Retrieving Web Pages
3.2.2 The Web Crawler
3.2.3 Freshness
3.2.4 Focused Crawling
3.2.5 Deep Web
1146913131417171922232527283131323335374141
Trang 113.6 Storing the Documents
3.6 1 Using a Database System
4.3.5 Phrases and N-grams
4.4 Document Structure and Markup
I l l113115118
Trang 125 Ranking with Indexes
6 Queries and Interfaces
6 1 Information Needs and Queries
6.2 Query Transformation and Refinement
6.2 1 Stopping and Stemming Revisited
6.2.2 Spell Checking and Suggestions
Contents XI
125125126129131133134136138139140142144145148149151151154156156157158164165166168170178180181187187190190193
Trang 136.2.3 Query Expansion
6.2.4 Relevance Feedback
6.2.5 Context and Personalization
6.3 Showing the Results
6.3 1 Result Pages and Snippets
6.3.2 Advertising and Search
6.3.3 Clustering the Results
7.2 1 Information Retrieval as Classification
7.2.2 The BM25 Ranking Algorithm
7.3 Ranking Based on Language Models
7.3.1 Query Likelihood Ranking
7.3.2 Relevance Models and Pseudo -Relevance Feedback
7.4 Complex Queries and Combining Evidence
7.4.1 The Inference Network Model
7.4.2 The Galago Query Language
8.4 1 Recall and Precision
8.4.2 Averaging and Interpolation
8.4.3 Focusing on the Top Documents
8.4.4 Using Preferences
199208211215215218221226233233235237243244250252254261267268273279283284288291297297299305308308313318321
Trang 148.7 The Bottom Line
9 Classification and Clustering
9 1 Classification and Categorization
9.1.1 Naive Bayes
9.1.2 Support Vector Machines
9.1.3 Evaluation
9 1 4 Classifier and Feature Selection
9.1.5 Spam, Sentiment, and Online Advertising
9.2 Clustering
9.2.1 Hierarchical and K -Means Clustering
9.2.2 K Nearest Neighbor Clustering
9.2.3 Evaluation
9.2.4 How to Choose K
9.2.5 Clustering and Search
10 Social Search
10.1 What Is Social Search?
10.2 User Tags and Manual Indexing
10.2.1 Searching Tags
10.2.2 Inferring Missing Tags
10.2.3 Browsing and Tag Clouds
10.3 Searching with Communities
325330332333339340342351359359364373375384386387389397397400402404406408408409415420423423432438438
Trang 151052 P2P Networks
1 1 Beyond Bag of Words
11.1 Overview
112 Feature-Based Retrieval Models
113 Term Dependence Models
1 1.4 Structure Revisited
1 1.4.1 XML Retrieval
1 1.4.2 Entity Search
1 1.5 Longer Questions, Better Answers
1 1.6 Words, Pictures, and Music
1 1.7 One Search Fits All?
References
Index
442451451452454459461464466470479487513
Trang 16The query process
A uniform resource locator (URL), split into three parts
Crawling the Web The web crawler connects to web servers to
find pages Pages may link to other pages on the same server or
on different servers
An example robots.txt file
A simple crawling thread implementation
An HTTP HEAD request and server response
Age and freshness of a single page over time
Expected age of a page with mean change frequency A = 1/7
An example link with anchor text
BigTable stores data in a single logical table, which is split into
many smaller tablets
A BigTable row
Example of fingerprinting process
Example of simhash fingerprinting process
Main content block in a web page
9151633
3436373839404348
5556
5758626465
Trang 17Tag counts used to identify text blocks in a web page
Part of the DOM structure for the example web page
Rank versus probability of occurrence for words assuming
Zipf 's law (rank X probability = 0.1)
A log-log plot of Zipf s law compared to real data from AP89
The predicted relationship between probability of occurrence
and rank breaks down badly at high ranks
Vocabulary growth for the TREC AP89 collection compared
to Heaps' law
Vocabulary growth for the TREC GOV2 collection compared
to Heaps' law
Result size estimate for web search
Comparison of stemmer output for a TREC query Stopwordshave also been removed
Output of a POS tagger for a TREC query
Part of a web page from Wikipedia
HTML source for example Wikipedia page
A sample "Internet" consisting of just three web pages The
arrows denote links between the pages
Pseudocode for the iterative PageRank algorithm
Trackback links in blog postings
Text tagged by information extraction
Sentence model for statistical entity extractor
Chinese segmentation and bigrams
The components of the abstract model of ranking: documents,features, queries, the retrieval function, and document scores
A more concrete model of ranking Notice how both the queryand the document have feature functions in this model
An inverted index for the documents (sentences) in Table 5.1
An inverted index, with word counts, for the documents in
7981
8283
9598102103
108110112114116119127
128132134135136
Trang 18Aligning posting lists for "fish" and tide to find matches of the
word "fish" in the title field of a document
Pseudocode for a simple indexer
An example of index merging The first and second indexes are
merged together to produce the combined index
MapReduce
Mapper for a credit card summing algorithm
Reducer for a credit card summing algorithm
Mapper for documents
Reducer for word postings
Document-at-a-time query evaluation The numbers (x-.y)
represent a document number (x) and a word count (y)
A simple document-at-a-time retrieval algorithm
Term-at-a-time query evaluation
A simple term-at-a-time retrieval algorithm
Skip pointers in an inverted list The gray boxes show skip
pointers, which point into the white boxes, which are inverted
list postings
A term-at-a-time retrieval algorithm with conjunctive processing
A document-at-a-time retrieval algorithm with conjunctive
processing
MaxScore retrieval with the query "eucalyptus tree" The gray
boxes indicate postings that can be safely ignored during scoring.Evaluation tree for the structured query #combine(#od: 1 (tropicalfish) #od: 1 (aquarium fish) fish)
Top ten results for the query "tropical fish"
Geographic representation of Cape Cod using bounding
rectangles
Typical document summary for a web search
An example of a text span of words (w) bracketed by significantwords (s) using Luhn's algorithm
Advertisements displayed by a search engine for the query "fishtanks"
Clusters formed by a search engine from top-ranked documentsfor the query "tropical fish" Numbers in brackets are the
number of documents in the cluster
XVII
138157
158161162162163164
166167168169
170173174176179209214215216221
222
Trang 19Classifying a document as relevant or non-relevant
Example inference network model
Inference network with three nodes
Galago query for the dependence model
Galago query for web data
Example of a TREC topic
Recall and precision values for two rankings of six relevant
documents
Recall and precision values for rankings from two different querRecall-precision graphs for two queries
Interpolated recall-precision graphs for two queries
Average recall-precision graph using standard recall levels
Typical recall-precision graph for 50 queries from TREC
Probability distribution for test statistic values assuming the
null hypothesis The shaded area is the region of rejection for a
one-sided test
Example distribution of query effectiveness improvements Illustration of how documents are represented in the multiple-
Bernoulli event space In this example, there are 10 documents
(each with a unique id), two classes (spam and not spam), and a
vocabulary that consists of the terms "cheap", "buy", "banking",
"dinner", and "the"
Illustration of how documents are represented in the
multinomial event space In this example, there are 10
documents (each with a unique id), two classes (spam and not
spam), and a vocabulary that consists of the terms "cheap",
« 1 » « 1 1 » « 1 » 1 « l »
buy , banking , dinner , and the
225 225226228239240245269271282282302
311ies314315316317 318
327 335
346
349
Trang 20Data set that consists of two classes (pluses and minuses) The
data set on the left is linearly separable, whereas the one on the
right is not
Graphical illustration of Support Vector Machines for the
linearly separable case Here, the hyperplane defined by w is
shown, as well as the margin, the decision regions, and the
support vectors, which are indicated by circles
Generative process used by the Naive Bayes model First, a class
is chosen according to P(c), and then a document is chosen
according to P(d\c)
Example data set where non-parametric learning algorithms,
such as a nearest neighbor classifier, may outperform parametricalgorithms The pluses and minuses indicate positive and
negative training examples, respectively The solid gray line
shows the actual decision boundary, which is highly non-linear.Example output of SpamAssassin email spam filter
Example of web page spam, showing the main page and some
of the associated term and link spam
Example product review incorporating sentiment
Example semantic class match between a web page about
rainbow fish (a type of tropical fish) and an advertisement
for tropical fish food The nodes "Aquariums", "Fish", and
"Supplies" are example nodes within a semantic hierarchy
The web page is classified as "Aquariums - Fish" and the ad is
classified as "Supplies - Fish" Here, "Aquariums" is the least
common ancestor Although the web page and ad do not share
any terms in common, they can be matched because of their
semantic similarity
Example of divisive clustering with K = 4 The clustering
proceeds from left to right and top to bottom, resulting in four
clusters
Example of agglomerative clustering with K = 4 The
clustering proceeds from left to right and top to bottom,
resulting in four clusters
Dendrogram that illustrates the agglomerative clustering of theooints from Fieure 9.12
372
376
377
^77
Trang 219 14 Examples of clusters in a graph formed by connecting nodes
representing instances A link represents a distance between thetwo instances that is less than some threshold value
clustering with K = 5 The overlapping clusters for the black
points (A, B, C, and D) are shown The five nearest neighbors
for each black point are shaded gray and labeled accordingly .Example of overlapping clustering using Parzen windows The
clusters for the black points (A, B, C, and D) are shown The
shaded circles indicate the windows used to determine cluster
membership The neighbors for each black point are shaded
gray and labeled accordingly
Cluster hypothesis tests on two TREC collections The top
two compare the distributions of similarity values between
relevant-relevant and relevant-nonrelevant pairs (light gray) of
documents The bottom two show the local precision of the
relevant documents
Search results used to enrich a tag representation In this
example, the tag being expanded is "tropical fish" The query
"tropical fish" is run against a search engine, and the snippets
returned are then used to generate a distribution over related
terms
Example of a tag cloud in the form of a weighted list The
tags are in alphabetical order and weighted according to some
criteria, such as popularity
Illustration of the HITS algorithm Each row corresponds to a
single iteration of the algorithm and each column corresponds
to a specific step of the algorithm
Example of how nodes within a directed graph can be
represented as vectors For a given node p, its vector
renresentation has comnonent a set to 1 if v — > a
379d381
Trang 22Overview of the two common collaborative search scenarios.
On the left is co -located collaborative search, which involves
multiple participants in the same location at the same time
On the right is remote collaborative search, where participants
are in different locations and not necessarily all online and
searching at the same time
Example of a static filtering system Documents arrive over timeand are compared against each profile Arrows from documents
to profiles indicate the document matches the profile and is
retrieved
Example of an adaptive filtering system Documents arrive
over time and are compared against each profile Arrows from
documents to profiles indicate the document matches the
profile and is retrieved Unlike static filtering, where profiles arestatic over time, profiles are updated dynamically (e.g., when a
new match occurs)
A set of users within a recommender system Users and their
ratings for some item are given Users with question marks
above their heads have not yet rated the item It is the goal of
the recommender system to fill in these question marks
Illustration of collaborative filtering using clustering Groups
of similar users are outlined with dashed lines Users and their
ratings for some item are given In each group, there is a single
user who has not judged the item For these users, the unjudgeditem is assigned an automatic rating based on the ratings of
similar users
Metasearch engine architecture The query is broadcast to
multiple web search engines and result lists are merged
Network architectures for distributed search: (a) central hub;
(b) pure P2P; and (c) hierarchical P2P Dark circles are hub
or superpeer nodes, gray circles are provider nodes, and white
circles are consumer nodes
Neighborhoods (JVj) of a hub node (H) in a hierarchical P2P
443445
Trang 2311.1 Example Markov Random Field model assumptions, includingfull independence (top left), sequential dependence (top
right), full dependence (bottom left), and general dependence
Graphical model representations of the relevance model
technique (top) and latent concept expansion (bottom) used
for pseudo -relevance feedback with the query "hubble telescopeachievements"
Functions provided by a search engine interacting with a simpledatabase system
Example of an entity search for organizations using the TREC
Wall Street Journal 1987 Collection
Question answering system architecture
Examples of OCR errors
Examples of speech recognizer errors
Two images (a fish and a flower bed) with color histograms
The horizontal axis is hue value
Three examples of content-based image retrieval The collectionfor the first two consists of 1,560 images of cars, faces, apes,
and other miscellaneous subjects The last example is from a
collection of 2,048 trademark images In each case, the leftmostimage is the query
Key frames extracted from a TREC video clip
Examples of automatic text annotation of images
Three representations of Bach's "Fugue #10": audio, MIDI, andconventional music notation
455
459461
464467472473474
475476477478
Trang 24Statistics for the AP89 collection
Most frequent 50 words from AP89
Low-frequency words from AP89
Example word frequency ranking
Proportions of words occurring n times in 336,310 documents
from the TREC Volume 3 corpus The total vocabulary size
(number of unique words) is 508,209
Document frequencies and estimated frequencies for word
combinations (assuming independence) in the GOV2 Web
collection Collection size (N) is 25,205,179
Examples of errors made by the original Porter stemmer False
positives are pairs of words that have the same stem False
negatives are pairs that have different stems
Examples of words with the Arabic root ktb
High-frequency noun phrases from a TREC collection and
U.S patents from 1996
Statistics for the Google n-gram sample
Four sentences from the Wikipedia entry for tropical fish
Elias-7 code examples
Elias-$ code examples
Space requirements for numbers encoded in v-byte
45177 78 7879
80
84
9396
9910113?146147149
Trang 25Sample encodings for v-byte
Skip lengths (k) and expected processing steps
Partial entry for the Medical Subject (MeSH) Heading "Neck
Pain"
Term association measures
Most strongly associated words for "tropical" in a collection of
TREC news stories Co-occurrence counts are measured at thedocument level
Most strongly associated words for "fish" in a collection of
TREC news stories Co-occurrence counts are measured at thedocument level
Most strongly associated words for "fish" in a collection of
TREC news stories Co-occurrence counts are measured in
windows of five words
Contingency table of term occurrences for a particular query BM25 scores for an example document
Query likelihood scores for an example document
Highest-probability terms from relevance model for four
example queries (estimated using top 10 documents)
Highest-probability terms from relevance model for four
example queries (estimated using top 50 documents)
Conditional probabilities for example network
Highest-probability terms from four topics in LDA model
Statistics for three example text collections The average number
of words per document is calculated without stemming
Statistics for queries from example text collections
Sets of documents defined by a simple search with binary
relevance
Precision values at standard recall levels calculated using
interpolation
Definitions of some important efficiency metrics
Artificial effectiveness data for two retrieval algorithms (A and
B) over 10 queries The column B - A gives the difference in
effectiveness
149152
200203204
205
205248252260266
267272290
301301309
317323
328
Trang 26A list of kernels that are typically used with SVMs For each
kernel, the name, value, and implicit dimensionality are given .Example questions submitted to Yahoo! Answers
Translations automatically learned from a set of question and
answer pairs The 10 most likely translations for the terms
"everest", "xp", and "search" are given
Summary of static and adaptive filtering models For each, the
profile representation and profile updating algorithm are given .Contingency table for the possible outcomes of a filtering
system Here, TP (true positive) is the number of relevant
documents retrieved, FN (false negative) is the number of
relevant documents not retrieved, FP (false positive) is the
number of non-relevant documents retrieved, and TN (true
negative) is the number of non-relevant documents not retrieved.Most likely one- and two-word concepts produced using latentconcept expansion with the top 25 documents retrieved for
the query "hubble telescope achievements" on the TREC
419430
431
460469
Trang 28Sam Lowry, Brazil
I.I What Is Information Retrieval?
This book is designed to help people understand search engines, evaluate andcompare them, and modify them for specific applications Searching for infor-mation on the Web is, for most people, a daily activity Search and communi-cation are by far the most popular uses of the computer Not surprisingly, manypeople in companies and universities are trying to improve search by coming upwith easier and faster ways to find the right information These people, whetherthey call themselves computer scientists, software engineers, information scien-
tists, search engine optimizers, or something else, are working in the field of formation Retrieval 1 So, before we launch into a detailed journey through theinternals of search engines, we will take a few pages to provide a context for therest of the book
In-Gerard Salton, a pioneer in information retrieval and one of the leading figuresfrom the 1960s to the 1990s, proposed the following definition in his classic 1968textbook (Salton, 1968):
Information retrieval is a field concerned with the structure, analysis, ganization, storage, searching, and retrieval of information
or-Despite the huge advances in the understanding and technology of search in thepast 40 years, this definition is still appropriate and accurate The term "informa-
1 Information retrieval is often abbreviated as IR In this book, we mostly use the fullterm This has nothing to do with the fact that many people think IR means "infrared"
or something else
Trang 29tion" is very general, and information retrieval includes work on a wide range oftypes of information and a variety of applications related to search.
The primary focus of the field since the 1950s has been on text and text ments Web pages, email, scholarly papers, books, and news stories are just a few
docu-of the many examples docu-of documents All docu-of these documents have some amount
of structure, such as the tide, author, date, and abstract information associatedwith the content of papers that appear in scientific journals The elements of thisstructure are called attributes, or fields, when referring to database records Theimportant distinction between a document and a typical database record, such as
a bank account record or a flight reservation, is that most of the information inthe document is in the form of text, which is relatively unstructured
To illustrate this difference, consider the information contained in two typicalattributes of an account record, the account number and current balance Both arevery well defined, both in terms of their format (for example, a six-digit integerfor an account number and a real number with two decimal places for balance)and their meaning It is very easy to compare values of these attributes, and conse-quendy it is straightforward to implement algorithms to identify the records thatsatisfy queries such as "Find account number 321456" or "Find accounts withbalances greater than $50,000.00"
Now consider a news story about the merger of two banks The story will havesome attributes, such as the headline and source of the story, but the primary con-tent is the story itself In a database system, this critical piece of information wouldtypically be stored as a single large attribute with no internal structure Most ofthe queries submitted to a web search engine such as Google2 that relate to thisstory will be of the form "bank merger" or "bank takeover" To do this search,
we must design algorithms that can compare the text of the queries with the text
of the story and decide whether the story contains the information that is beingsought Defining the meaning of a word, a sentence, a paragraph, or a whole newsstory is much more difficult than defining an account number, and consequendycomparing text is not easy Understanding and modeling how people comparetexts, and designing computer algorithms to accurately perform this comparison,
is at the core of information retrieval
Increasingly, applications of information retrieval involve multimedia ments with structure, significant text content, and other media Popular infor-mation media include pictures, video, and audio, including music and speech In
docu-; http://www.google.com
Trang 301.1 What Is Information Retrieval? 3
some applications, such as in legal support, scanned document images are alsoimportant These media have content that, like text, is difficult to describe andcompare The current technology for searching non-text documents relies on textdescriptions of their content rather than the contents themselves, but progress isbeing made on techniques for direct comparison of images, for example
In addition to a range of media, information retrieval involves a range of tasksand applications The usual search scenario involves someone typing in a query to
a search engine and receiving answers in the form of a list of documents in ranked
order Although searching the World Wide Web (web search] is by far the most
common application involving information retrieval, search is also a crucial part
of applications in corporations, government, and many other domains Vertical search is a specialized form of web search where the domain of the search is re- stricted to a particular topic Enterprise search involves finding the required infor-
mation in the huge variety of computer files scattered across a corporate intranet.Web pages are certainly a part of that distributed information store, but mostinformation will be found in sources such as email, reports, presentations, spread-
sheets, and structured data in corporate databases Desktop search is the personal
version of enterprise search, where the information sources are the files stored
on an individual computer, including email messages and web pages that have
re-cently been browsed Peer-to-peer search involves finding information in networks
of nodes or computers without any centralized control This type of search began
as a file sharing tool for music but can be used in any community based on sharedinterests, or even shared locality in the case of mobile devices Search and relatedinformation retrieval techniques are used for advertising, for intelligence analy-sis, for scientific discovery, for health care, for customer support, for real estate,
and so on Any application that involves a collection^ of text or other unstructured
information will need to organize and search that information
Search based on a user query (sometimes called ad hoc search because the range
of possible queries is huge and not prespecified) is not the only text-based task
that is studied in information retrieval Other tasks include filtering, classification, and question answering Filtering or tracking involves detecting stories of interest
based on a person's interests and providing an alert using email or some othermechanism Classification or categorization uses a defined set of labels or classes
3 The term database is often used to refer to a collection of either structured or tured data To avoid confusion, we mostly use the term document collection (or just collection) for text However, the terms web database and search engine database are so
unstruc-common that we occasionally use them in this book
Trang 31(such as the categories listed in the Yahoo! Directory4) and automatically assignsthose labels to documents Question answering is similar to search but is aimed
at more specific questions, such as "What is the height of Mt Everest?" The goal
of question answering is to return a specific answer found in the text, rather than
a list of documents Table 1.1 summarizes some of these aspects or dimensions ofthe field of information retrieval
Examples of
ContentTextImagesVideoScanned documents
AudioMusic
Examples ofApplicationsWeb searchVertical searchEnterprise searchDesktop searchPeer-to-peer search
Examples ofTasks
Ad hoc searchFilteringClassificationQuestion answering
Table 1.1 Some dimensions of information retrieval
1.2 The Big Issues
Information retrieval researchers have focused on a few key issues that remain just
as important in the era of commercial web search engines working with billions
of web pages as they were when tests were done in the 1960s on document
col-lections containing about 1.5 megabytes of text One of these issues is relevance.
Relevance is a fundamental concept in information retrieval Loosely speaking, arelevant document contains the information that a person was looking for whenshe submitted a query to the search engine Although this sounds simple, there aremany factors that go into a person's decision as to whether a particular document
is relevant These factors must be taken into account when designing algorithmsfor comparing text and ranking documents Simply comparing the text of a querywith the text of a document and looking for an exact match, as might be done in
a database system or using the grep utility in Unix, produces very poor results interms of relevance One obvious reason for this is that language can be used to ex-: http://dir.yahoo.com/
Trang 321.2 The Big Issues 5
press the same concepts in many different ways, often with very different words
This is referred to as the vocabulary mismatch problem in information retrieval.
It is also important to distinguish between topical relevance and user relevance.
A text document is topically relevant to a query if it is on the same topic For ample, a news story about a tornado in Kansas would be topically relevant to thequery "severe weather events" The person who asked the question (often calledthe user) may not consider the story relevant, however, if she has seen that storybefore, or if the story is five years old, or if the story is in Chinese from a Chi-nese news agency User relevance takes these additional features of the story intoaccount
ex-To address the issue of relevance, researchers propose retrieval models and test
how well they work A retrieval model is a formal representation of the process of
matching a query and a document It is the basis of the ranking algorithm that is
used in a search engine to produce the ranked list of documents A good retrievalmodel will find documents that are likely to be considered relevant by the personwho submitted the query Some retrieval models focus on topical relevance, but
a search engine deployed in a real environment must use ranking algorithms thatincorporate user relevance
An interesting feature of the retrieval models used in information retrieval isthat they typically model the statistical properties of text rather than the linguisticstructure This means, for example, that the ranking algorithms are typically farmore concerned with the counts of word occurrences than whether the word is anoun or an adjective More advanced models do incorporate linguistic features,but they tend to be of secondary importance The use of word frequency infor-mation to represent text started with another information retrieval pioneer, H.P.Luhn, in the 1950s This view of text did not become popular in other fields ofcomputer science, such as natural language processing, until the 1990s
Another core issue for information retrieval is evaluation Since the quality of
a document ranking depends on how well it matches a person's expectations, itwas necessary early on to develop evaluation measures and experimental proce-dures for acquiring this data and using it to compare ranking algorithms CyrilCleverdon led the way in developing evaluation methods in the early 1960s, and
two of the measures he used, precision and recall, are still popular Precision is a
very intuitive measure, and is the proportion of retrieved documents that are evant Recall is the proportion of relevant documents that are retrieved Whenthe recall measure is used, there is an assumption that all the relevant documentsfor a given query are known Such an assumption is clearly problematic in a web
Trang 33rel-search environment, but with smaller test collections of documents, this measure
can be useful A test collection5 for information retrieval experiments consists of
a collection of text documents, a sample of typical queries, and a list of relevant
documents for each query (the relevance judgments} The best-known test
collec-tions are those associated with the TREC6 evaluation forum
Evaluation of retrieval models and search engines is a very active area, withmuch of the current focus on using large volumes of log data from user interac-
tions, such as clickthrough data, which records the documents that were clicked
on during a search session Clickthrough and other log data is strongly correlatedwith relevance so it can be used to evaluate search, but search engine companiesstill use relevance judgments in addition to log data to ensure the validity of theirresults
The third core issue for information retrieval is the emphasis on users and their
information needs This should be clear given that the evaluation of search is
user-centered That is, the users of a search engine are the ultimate judges of quality.This has led to numerous studies on how people interact with search engines and,
in particular, to the development of techniques to help people express their formation needs An information need is the underlying cause of the query that
in-a person submits to in-a sein-arch engine In contrin-ast to in-a request to in-a din-atin-abin-ase system,such as for the balance of a bank account, text queries are often poor descriptions
of what the user actually wants A one-word query such as "cats" could be a requestfor information on where to buy cats or for a description of the Broadway musi-cal Despite their lack of specificity, however, one-word queries are very common
in web search Techniques such as query suggestion, query expansion, and relevance feedback use interaction and context to refine the initial query in order to produce
better ranked lists
These issues will come up throughout this book, and will be discussed in siderably more detail We now have sufficient background to start talking aboutthe main product of research in information retrieval—namely, search engines
con-1.3 Search Engines
A search engine is the practical application of information retrieval techniques
to large-scale text collections A web search engine is the obvious example, but as
5 Also known as an evaluation corpus (plural corpora).
6 Text REtrieval Conference—http://trec.nist.gov/
Trang 341.3 Search Engines 7
has been mentioned, search engines can be found in many different applications,such as desktop search or enterprise search Search engines have been around formany years For example, MEDLINE, the online medical literature search sys-tem, started in the 1970s The term "search engine" was originally used to refer
to specialized hardware for text search From the mid-1980s onward, however, itgradually came to be used in preference to "information retrieval system" as thename for the software system that compares queries to documents and producesranked result lists of documents There is much more to a search engine than theranking algorithm, of course, and we will discuss the general architecture of thesesystems in the next chapter
Search engines come in a number of configurations that reflect the tions they are designed for Web search engines, such as Google and Yahoo !,7 must
applica-be able to capture, or crawl, many terabytes of data, and then provide subsecond
response times to millions of queries submitted every day from around the world.Enterprise search engines—for example, Autonomy8—must be able to processthe large variety of information sources in a company and use company-specific
knowledge as part of search and related tasks, such as data mining Data mining
refers to the automatic discovery of interesting structure in data and includes
tech-niques such as clustering Desktop search engines, such as the Microsoft Vista™
search feature, must be able to rapidly incorporate new documents, web pages,and email as the person creates or looks at them, as well as provide an intuitiveinterface for searching this very heterogeneous mix of information There is over-lap between these categories with systems such as Google, for example, which isavailable in configurations for enterprise and desktop search
Of en source search engines are another important class of systems that have
somewhat different design goals than the commercial search engines There are anumber of these systems, and the Wikipedia page for information retrieval9 pro-vides links to many of them Three systems of particular interest are Lucene,10
Lemur,11 and the system provided with this book, Galago.12 Lucene is a popularJava-based search engine that has been used for a wide range of commercial ap-plications The information retrieval techniques that it uses are relatively simple.http://www.yahoo.com
Trang 35Lemur is an open source toolkit that includes the Indri C++-based search engine.Lemur has primarily been used by information retrieval researchers to compareadvanced search techniques Galago is a Java-based search engine that is based onthe Lemur and Indri projects The assignments in this book make extensive use ofGalago It is designed to be fast, adaptable, and easy to understand, and incorpo-rates very effective information retrieval techniques.
The "big issues" in the design of search engines include the ones identified forinformation retrieval: effective ranking algorithms, evaluation, and user interac-tion There are, however, a number of additional critical features of search enginesthat result from their deployment in large-scale, operational environments Fore-
most among these features is the performance of the search engine in terms of sures such as response time, query throughput, and indexing speed Response time
mea-is the delay between submitting a query and receiving the result lmea-ist, throughputmeasures the number of queries that can be processed in a given time, and index-ing speed is the rate at which text documents can be transformed into indexes
for searching An index is a data structure that improves the speed of search The
design of indexes for search engines is one of the major topics in this book.Another important performance measure is how fast new data can be incorpo-rated into the indexes Search applications typically deal with dynamic, constantly
changing information Coverage measures how much of the existing information
in, say, a corporate information environment has been indexed and stored in the
search engine, and recency on freshness measures the "age" of the stored
informa-tion
Search engines can be used with small collections, such as a few hundred emailsand documents on a desktop, or extremely large collections, such as the entireWeb There may be only a few users of a given application, or many thousands
Scalability is clearly an important issue for search engine design Designs that
work for a given application should continue to work as the amount of data andthe number of users grow In section 1.1, we described how search engines are used
in many applications and for many tasks To do this, they have to be customizable
or adaptable This means that many different aspects of the search engine, such as
the ranking algorithm, the interface, or the indexing strategy, must be able to betuned and adapted to the requirements of the application
Practical issues that impact search engine design also occur for specific
appli-cations The best example of this is spam in web search Spam is generally thought
of as unwanted email, but more generally it could be defined as misleading, propriate, or non-relevant information in a document that is designed for some
Trang 36-Efficient search and indexing
Incorporating new data
-Coverage and freshness
Fig 1.1 Search engine design and the core information retrieval issues
Based on this discussion of the relationship between information retrieval andsearch engines, we now consider what roles computer scientists and others play inthe design and use of search engines
1.4 Search Engineers
Information retrieval research involves the development of mathematical models
of text and language, large-scale experiments with test collections or users, and
a lot of scholarly paper writing For these reasons, it tends to be done by demics or people in research laboratories These people are primarily trained incomputer science, although information science, mathematics, and, occasionally,social science and computational linguistics are also represented So who works
Trang 37aca-with search engines ? To a large extent, it is the same sort of people but aca-with a more
practical emphasis The computing industry has started to use the term search engineer to describe this type of person Search engineers are primarily people
trained in computer science, mostly with a systems or database background prisingly few of them have training in information retrieval, which is one of themajor motivations for this book
Sur-What is the role of a search engineer? Certainly the people who work in themajor web search companies designing and implementing new search engines aresearch engineers, but the majority of search engineers are the people who modify,extend, maintain, or tune existing search engines for a wide range of commercialapplications People who design or "optimize" content for search engines are alsosearch engineers, as are people who implement techniques to deal with spam Thesearch engines that search engineers work with cover the entire range mentioned
in the last section: they primarily use open source and enterprise search enginesfor application development, but also get the most out of desktop and web searchengines
The importance and pervasiveness of search in modern computer applicationshas meant that search engineering has become a crucial profession in the com-puter industry There are, however, very few courses being taught in computerscience departments that give students an appreciation of the variety of issues thatare involved, especially from the information retrieval perspective This book is in-tended to give potential search engineers the understanding and tools they need
References and Further Reading
In each chapter, we provide some pointers to papers and books that give moredetail on the topics that have been covered This additional reading should not
be necessary to understand material that has been presented, but instead will givemore background, more depth in some cases, and, for advanced topics, will de-scribe techniques and research results that are not covered in this book
The classic references on information retrieval, in our opinion, are the booksbySalton (1968; 1983) and van Rijsbergen (1979) Van Rijsbergen's book remainspopular, since it is available on the Web.13 All three books provide excellent de-scriptions of the research done in the early years of information retrieval, up tothe late 1970s Salton's early book was particularly important in terms of defining
13http://www.dcs.gla.ac.uk/Keith/Preface.html
Trang 38on information retrieval and search also appear in the European Conference
on Information Retrieval (ECIR), the Conference on Information and edge Management (CIKM), and the Web Search and Data Mining Conference(WSDM) The WSDM conference is a spin-off of the World Wide Web Confer-ence (WWW), which has included some important papers on web search Theproceedings from the TREC workshops are available online and contain usefuldescriptions of new research techniques from many different academic and indus-try groups An overview of the TREC experiments can be found in Voorhees andHarman (2005) An increasing number of search-related papers are beginning toappear in database conferences, such as VLDB and SIGMOD Occasional papersalso show up in language technology conferences, such as ACL and HLT (As-sociation for Computational Linguistics and Human Language Technologies),machine learning conferences, and others
1.2 Site search is another common application of search engines In this case,
search is restricted to the web pages at a given website Compare site search toweb search, vertical search, and enterprise search
14http://www.acm.org/dl
Trang 391.3 Use the Web to find as many examples as you can of open source search
en-gines, information retrieval systems, or related technology Give a brief tion of each search engine and summarize the similarities and differences betweenthem
descrip-1.4 List five web services or sites that you use that appear to use search, not
includ-ing web search engines Describe the role of search for that service Also describewhether the search is based on a database or gr ep style of matching, or if the search
is using some type of ranking
Trang 40Architecture of a Search Engine
"While your first question may be the most tinent, you may or may not realize it is also themost irrelevant."
per-The Architect, Matrix Reloaded
2.1 What Is an Architecture?
In this chapter, we describe the basic software architecture of a search engine though there is no universal agreement on the definition, a software architecturegenerally consists of software components, the interfaces provided by those com-ponents, and the relationships between them An architecture is used to describe
Al-a system Al-at Al-a pAl-articulAl-ar level of Al-abstrAl-action An exAl-ample of Al-an Al-architecture used toprovide a standard for integrating search and related language technology compo-nents is UIMA (Unstructured Information Management Architecture).1 UIMAdefines interfaces for components in order to simplify the addition of new tech-nologies into systems that handle text and other unstructured data
Our search engine architecture is used to present high-level descriptions ofthe important components of the system and the relationships between them It
is not a code-level description, although some of the components do correspond
to software modules in the Galago search engine and other systems We use thisarchitecture in this chapter and throughout the book to provide context to thediscussion of specific techniques
An architecture is designed to ensure that a system will satisfy the applicationrequirements or goals The two primary goals of a search engine are:
• Effectiveness (quality): We want to be able to retrieve the most relevant set ofdocuments possible for a query
• Efficiency (speed): We want to process queries from users as quickly as ble
possi-http://www.research.ibm.com/UIMA
1