Search Engines Information Retrieval in Practice

The gray boxes indicate postings that can be safely ignored during scoring.Evaluation tree for the structured query #combine#od: 1 tropicalfish #od: 1 aquarium fish fish Top ten results

Trang 2

Search Engines

Information Retrieval

in Practice

Trang 4

W BRUCE CROFT DONALD METZLER TREVOR STROHMAN

Addison Wesley

Boston Columbus Indianapolis New York San Francisco Upper Saddle River

Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto

Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo

Trang 5

Editorial Assistant Sarah Milmore

Managing Editor Jeff Holcomb

Online Product Manager Bethany Tidd

Director of Marketing Margaret Waples

Marketing Manager Erin Davis

Marketing Coordinator Kathryn Ferranti

Senior Manufacturing Buyer Carol Melville

Text Design, Composition, W Bruce Croft, Donald Metzler,

and Illustrations and Trevor Strohman

Art Direction Linda Knowles

Cover Design Elena Sidorova

Cover Image © Peter Gudella / Shutterstock

Many of the designations used by manufacturers and sellers to distinguish their productsare claimed as trademarks Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial caps

or all caps

The programs and applications presented in this book have been included for theirinstructional value They have been tested with care, but are not guaranteed for anyparticular purpose The publisher does not offer any warranties or representations, nordoes it accept any liabilities with respect to the programs or applications

Library of Congress Cataloging-in-Publication Data available upon request

Copyright © 2010 Pearson Education, Inc All rights reserved No part of this publicationmay be reproduced, stored in a retrieval system, or transmitted, in any form or by anymeans, electronic, mechanical, photocopying, recording, or otherwise, without the priorwritten permission of the publisher Printed in the United States of America For

information on obtaining permission for use of material in this work, please submit awritten request to Pearson Education, Inc., Rights and Contracts Department, 501

Boylston Street, Suite 900, Boston, MA 02116, fax (617) 671-3447, or online at

http://www.pearsoned.com/legal/permissions.htm

ISBN-13: 978-0-13-607224-9

ISBN-10: 0-13-607224-0

1 2 3 4 5 6 7 8 9 1 0 - H P - 1 3 1211 1009

Trang 6

This book provides an overview of the important issues in information retrieval,and how those issues affect the design and implementation of search engines Notevery topic is covered at the same level of detail We focus instead on what weconsider to be the most important alternatives to implementing search enginecomponents and the information retrieval models underlying them Web searchengines are obviously a major topic, and we base our coverage primarily on thetechnology we all use on the Web,1 but search engines are also used in many otherapplications That is the reason for the strong emphasis on the information re-trieval theories and concepts that underlie all search engines

The target audience for the book is primarily undergraduates in computer ence or computer engineering, but graduate students should also find this useful

We also consider the book to be suitable for most students in information ence programs Finally, practicing search engineers should benefit from the book,whatever their background There is mathematics in the book, but nothing tooesoteric There are also code and programming exercises in the book, but nothingbeyond the capabilities of someone who has taken some basic computer scienceand programming classes

sci-The exercises at the end of each chapter make extensive use of a Java™-basedopen source search engine called Galago Galago was designed both for this bookand to incorporate lessons learned from experience with the Lemur and Indriprojects In other words, this is a fully functional search engine that can be used

to support real applications Many of the programming exercises require the use,modification, and extension of Galago components

1 In keeping with common usage, most uses of the word "web" in this book are not italized, except when we refer to the World Wide Web as a separate entity

Trang 7

In the first chapter, we provide a high-level review of the field of information trieval and its relationship to search engines In the second chapter, we describethe architecture of a search engine This is done to introduce the entire range ofsearch engine components without getting stuck in the details of any particularaspect In Chapter 3, we focus on crawling, document feeds, and other techniquesfor acquiring the information that will be searched Chapter 4 describes the sta-tistical nature of text and the techniques that are used to process it, recognize im-portant features, and prepare it for indexing Chapter 5 describes how to createindexes for efficient search and how those indexes are used to process queries InChapter 6, we describe the techniques that are used to process queries and trans-form them into better representations of the user's information need

re-Ranking algorithms and the retrieval models they are based on are covered

in Chapter 7 This chapter also includes an overview of machine learning niques and how they relate to information retrieval and search engines Chapter

tech-8 describes the evaluation and performance metrics that are used to compare andtune search engines Chapter 9 covers the important classes of techniques used forclassification, filtering, clustering, and dealing with spam Social search is a termused to describe search applications that involve communities of people in tag-ging content or answering questions Search techniques for these applications andpeer-to-peer search are described in Chapter 10 Finally, in Chapter 11, we give anoverview of advanced techniques that capture more of the content of documentsthan simple word-based approaches This includes techniques that use linguisticfeatures, the document structure, and the content of nontextual media, such asimages or music

Information retrieval theory and the design, implementation, evaluation, anduse of search engines cover too many topics to describe them all in depth in onebook We have tried to focus on the most important topics while giving somecoverage to all aspects of this challenging and rewarding subject

Supplements

A range of supplementary material is provided for the book This material is signed both for those taking a course based on the book and for those giving thecourse Specifically, this includes:

de-• Extensive lecture slides (in PDF and PPT format)

Trang 8

Preface VII

• Solutions to selected end-of-chapter problems (instructors only)

• Test collections for exercises

• Galago search engine

The supplements are available at www.search-engines-book.com, or at www.aw.com

Acknowledgments

First and foremost, this book would not have happened without the dous support and encouragement from our wives, Pam Aselton, Anne-MarieStrohman, and Shelley Wang The University of Massachusetts Amherst providedmaterial support for the preparation of the book and awarded a Conti Faculty Fel-lowship to Croft, which sped up our progress significantly The staff at the Centerfor Intelligent Information Retrieval (Jean Joyce, Kate Moruzzi, Glenn Stowell,and Andre Gauthier) made our lives easier in many ways, and our colleagues andstudents in the Center provided the stimulating environment that makes work-ing in this area so rewarding A number of people reviewed parts of the book and

tremen-we appreciated their comments Finally, tremen-we have to mention our children, Doug,Eric, Evan, and Natalie, or they would never forgive us

BRUCE CROFT

DONMETZLERTREVOR STROHMAN

Trang 10

1

2

3

Search Engines and Information Retrieval

1.1 What Is Information Retrieval?

1.2 The Big Issues

2.4 How Does ItReaUy Work?

Crawls and Feeds

3 1 Deciding What to Search

3.2 Crawling the Web

3.2 1 Retrieving Web Pages

3.2.2 The Web Crawler

3.2.3 Freshness

3.2.4 Focused Crawling

3.2.5 Deep Web

1146913131417171922232527283131323335374141

Trang 11

3.6 Storing the Documents

3.6 1 Using a Database System

4.3.5 Phrases and N-grams

4.4 Document Structure and Markup

I l l113115118

Trang 12

5 Ranking with Indexes

6 Queries and Interfaces

6 1 Information Needs and Queries

6.2 Query Transformation and Refinement

6.2 1 Stopping and Stemming Revisited

6.2.2 Spell Checking and Suggestions

Contents XI

125125126129131133134136138139140142144145148149151151154156156157158164165166168170178180181187187190190193

Trang 13

6.2.3 Query Expansion

6.2.4 Relevance Feedback

6.2.5 Context and Personalization

6.3 Showing the Results

6.3 1 Result Pages and Snippets

6.3.2 Advertising and Search

6.3.3 Clustering the Results

7.2 1 Information Retrieval as Classification

7.2.2 The BM25 Ranking Algorithm

7.3 Ranking Based on Language Models

7.3.1 Query Likelihood Ranking

7.3.2 Relevance Models and Pseudo -Relevance Feedback

7.4 Complex Queries and Combining Evidence

7.4.1 The Inference Network Model

7.4.2 The Galago Query Language

8.4 1 Recall and Precision

8.4.2 Averaging and Interpolation

8.4.3 Focusing on the Top Documents

8.4.4 Using Preferences

199208211215215218221226233233235237243244250252254261267268273279283284288291297297299305308308313318321

Trang 14

8.7 The Bottom Line

9 Classification and Clustering

9 1 Classification and Categorization

9.1.1 Naive Bayes

9.1.2 Support Vector Machines

9.1.3 Evaluation

9 1 4 Classifier and Feature Selection

9.1.5 Spam, Sentiment, and Online Advertising

9.2 Clustering

9.2.1 Hierarchical and K -Means Clustering

9.2.2 K Nearest Neighbor Clustering

9.2.3 Evaluation

9.2.4 How to Choose K

9.2.5 Clustering and Search

10 Social Search

10.1 What Is Social Search?

10.2 User Tags and Manual Indexing

10.2.1 Searching Tags

10.2.2 Inferring Missing Tags

10.2.3 Browsing and Tag Clouds

10.3 Searching with Communities

325330332333339340342351359359364373375384386387389397397400402404406408408409415420423423432438438

Trang 15

1052 P2P Networks

1 1 Beyond Bag of Words

11.1 Overview

112 Feature-Based Retrieval Models

113 Term Dependence Models

1 1.4 Structure Revisited

1 1.4.1 XML Retrieval

1 1.4.2 Entity Search

1 1.5 Longer Questions, Better Answers

1 1.6 Words, Pictures, and Music

1 1.7 One Search Fits All?

References

Index

442451451452454459461464466470479487513

Trang 16

The query process

A uniform resource locator (URL), split into three parts

Crawling the Web The web crawler connects to web servers to

find pages Pages may link to other pages on the same server or

on different servers

An example robots.txt file

A simple crawling thread implementation

An HTTP HEAD request and server response

Age and freshness of a single page over time

Expected age of a page with mean change frequency A = 1/7

An example link with anchor text

BigTable stores data in a single logical table, which is split into

many smaller tablets

A BigTable row

Example of fingerprinting process

Example of simhash fingerprinting process

Main content block in a web page

9151633

3436373839404348

5556

5758626465

Trang 17

Tag counts used to identify text blocks in a web page

Part of the DOM structure for the example web page

Rank versus probability of occurrence for words assuming

Zipf 's law (rank X probability = 0.1)

A log-log plot of Zipf s law compared to real data from AP89

The predicted relationship between probability of occurrence

and rank breaks down badly at high ranks

Vocabulary growth for the TREC AP89 collection compared

to Heaps' law

Vocabulary growth for the TREC GOV2 collection compared

to Heaps' law

Result size estimate for web search

Comparison of stemmer output for a TREC query Stopwordshave also been removed

Output of a POS tagger for a TREC query

Part of a web page from Wikipedia

HTML source for example Wikipedia page

A sample "Internet" consisting of just three web pages The

arrows denote links between the pages

Pseudocode for the iterative PageRank algorithm

Trackback links in blog postings

Text tagged by information extraction

Sentence model for statistical entity extractor

Chinese segmentation and bigrams

The components of the abstract model of ranking: documents,features, queries, the retrieval function, and document scores

A more concrete model of ranking Notice how both the queryand the document have feature functions in this model

An inverted index for the documents (sentences) in Table 5.1

An inverted index, with word counts, for the documents in

7981

8283

9598102103

108110112114116119127

128132134135136

Trang 18

Aligning posting lists for "fish" and tide to find matches of the

word "fish" in the title field of a document

Pseudocode for a simple indexer

An example of index merging The first and second indexes are

merged together to produce the combined index

MapReduce

Mapper for a credit card summing algorithm

Reducer for a credit card summing algorithm

Mapper for documents

Reducer for word postings

Document-at-a-time query evaluation The numbers (x-.y)

represent a document number (x) and a word count (y)

A simple document-at-a-time retrieval algorithm

Term-at-a-time query evaluation

A simple term-at-a-time retrieval algorithm

Skip pointers in an inverted list The gray boxes show skip

pointers, which point into the white boxes, which are inverted

list postings

A term-at-a-time retrieval algorithm with conjunctive processing

A document-at-a-time retrieval algorithm with conjunctive

processing

MaxScore retrieval with the query "eucalyptus tree" The gray

boxes indicate postings that can be safely ignored during scoring.Evaluation tree for the structured query #combine(#od: 1 (tropicalfish) #od: 1 (aquarium fish) fish)

Top ten results for the query "tropical fish"

Geographic representation of Cape Cod using bounding

rectangles

Typical document summary for a web search

An example of a text span of words (w) bracketed by significantwords (s) using Luhn's algorithm

Advertisements displayed by a search engine for the query "fishtanks"

Clusters formed by a search engine from top-ranked documentsfor the query "tropical fish" Numbers in brackets are the

number of documents in the cluster

XVII

138157

158161162162163164

166167168169

170173174176179209214215216221

222

Trang 19

Classifying a document as relevant or non-relevant

Example inference network model

Inference network with three nodes

Galago query for the dependence model

Galago query for web data

Example of a TREC topic

Recall and precision values for two rankings of six relevant

documents

Recall and precision values for rankings from two different querRecall-precision graphs for two queries

Interpolated recall-precision graphs for two queries

Average recall-precision graph using standard recall levels

Typical recall-precision graph for 50 queries from TREC

Probability distribution for test statistic values assuming the

null hypothesis The shaded area is the region of rejection for a

one-sided test

Example distribution of query effectiveness improvements Illustration of how documents are represented in the multiple-

Bernoulli event space In this example, there are 10 documents

(each with a unique id), two classes (spam and not spam), and a

vocabulary that consists of the terms "cheap", "buy", "banking",

"dinner", and "the"

Illustration of how documents are represented in the

multinomial event space In this example, there are 10

documents (each with a unique id), two classes (spam and not

spam), and a vocabulary that consists of the terms "cheap",

« 1 » « 1 1 » « 1 » 1 « l »

buy , banking , dinner , and the

225 225226228239240245269271282282302

311ies314315316317 318

327 335

346

349

Trang 20

Data set that consists of two classes (pluses and minuses) The

data set on the left is linearly separable, whereas the one on the

right is not

Graphical illustration of Support Vector Machines for the

linearly separable case Here, the hyperplane defined by w is

shown, as well as the margin, the decision regions, and the

support vectors, which are indicated by circles

Generative process used by the Naive Bayes model First, a class

is chosen according to P(c), and then a document is chosen

according to P(d\c)

Example data set where non-parametric learning algorithms,

such as a nearest neighbor classifier, may outperform parametricalgorithms The pluses and minuses indicate positive and

negative training examples, respectively The solid gray line

shows the actual decision boundary, which is highly non-linear.Example output of SpamAssassin email spam filter

Example of web page spam, showing the main page and some

of the associated term and link spam

Example product review incorporating sentiment

Example semantic class match between a web page about

rainbow fish (a type of tropical fish) and an advertisement

for tropical fish food The nodes "Aquariums", "Fish", and

"Supplies" are example nodes within a semantic hierarchy

The web page is classified as "Aquariums - Fish" and the ad is

classified as "Supplies - Fish" Here, "Aquariums" is the least

common ancestor Although the web page and ad do not share

any terms in common, they can be matched because of their

semantic similarity

Example of divisive clustering with K = 4 The clustering

proceeds from left to right and top to bottom, resulting in four

clusters

Example of agglomerative clustering with K = 4 The

clustering proceeds from left to right and top to bottom,

resulting in four clusters

Dendrogram that illustrates the agglomerative clustering of theooints from Fieure 9.12

372

376

377

^77

Trang 21

9 14 Examples of clusters in a graph formed by connecting nodes

representing instances A link represents a distance between thetwo instances that is less than some threshold value

clustering with K = 5 The overlapping clusters for the black

points (A, B, C, and D) are shown The five nearest neighbors

for each black point are shaded gray and labeled accordingly .Example of overlapping clustering using Parzen windows The

clusters for the black points (A, B, C, and D) are shown The

shaded circles indicate the windows used to determine cluster

membership The neighbors for each black point are shaded

gray and labeled accordingly

Cluster hypothesis tests on two TREC collections The top

two compare the distributions of similarity values between

relevant-relevant and relevant-nonrelevant pairs (light gray) of

documents The bottom two show the local precision of the

relevant documents

Search results used to enrich a tag representation In this

example, the tag being expanded is "tropical fish" The query

"tropical fish" is run against a search engine, and the snippets

returned are then used to generate a distribution over related

terms

Example of a tag cloud in the form of a weighted list The

tags are in alphabetical order and weighted according to some

criteria, such as popularity

Illustration of the HITS algorithm Each row corresponds to a

single iteration of the algorithm and each column corresponds

to a specific step of the algorithm

Example of how nodes within a directed graph can be

represented as vectors For a given node p, its vector

renresentation has comnonent a set to 1 if v — > a

379d381

Trang 22

Overview of the two common collaborative search scenarios.

On the left is co -located collaborative search, which involves

multiple participants in the same location at the same time

On the right is remote collaborative search, where participants

are in different locations and not necessarily all online and

searching at the same time

Example of a static filtering system Documents arrive over timeand are compared against each profile Arrows from documents

to profiles indicate the document matches the profile and is

retrieved

Example of an adaptive filtering system Documents arrive

over time and are compared against each profile Arrows from

documents to profiles indicate the document matches the

profile and is retrieved Unlike static filtering, where profiles arestatic over time, profiles are updated dynamically (e.g., when a

new match occurs)

A set of users within a recommender system Users and their

ratings for some item are given Users with question marks

above their heads have not yet rated the item It is the goal of

the recommender system to fill in these question marks

Illustration of collaborative filtering using clustering Groups

of similar users are outlined with dashed lines Users and their

ratings for some item are given In each group, there is a single

user who has not judged the item For these users, the unjudgeditem is assigned an automatic rating based on the ratings of

similar users

Metasearch engine architecture The query is broadcast to

multiple web search engines and result lists are merged

Network architectures for distributed search: (a) central hub;

(b) pure P2P; and (c) hierarchical P2P Dark circles are hub

or superpeer nodes, gray circles are provider nodes, and white

circles are consumer nodes

Neighborhoods (JVj) of a hub node (H) in a hierarchical P2P

443445

Trang 23

11.1 Example Markov Random Field model assumptions, includingfull independence (top left), sequential dependence (top

right), full dependence (bottom left), and general dependence

Graphical model representations of the relevance model

technique (top) and latent concept expansion (bottom) used

for pseudo -relevance feedback with the query "hubble telescopeachievements"

Functions provided by a search engine interacting with a simpledatabase system

Example of an entity search for organizations using the TREC

Wall Street Journal 1987 Collection

Question answering system architecture

Examples of OCR errors

Examples of speech recognizer errors

Two images (a fish and a flower bed) with color histograms

The horizontal axis is hue value

Three examples of content-based image retrieval The collectionfor the first two consists of 1,560 images of cars, faces, apes,

and other miscellaneous subjects The last example is from a

collection of 2,048 trademark images In each case, the leftmostimage is the query

Key frames extracted from a TREC video clip

Examples of automatic text annotation of images

Three representations of Bach's "Fugue #10": audio, MIDI, andconventional music notation

455

459461

464467472473474

475476477478

Trang 24

Statistics for the AP89 collection

Most frequent 50 words from AP89

Low-frequency words from AP89

Example word frequency ranking

Proportions of words occurring n times in 336,310 documents

from the TREC Volume 3 corpus The total vocabulary size

(number of unique words) is 508,209

Document frequencies and estimated frequencies for word

combinations (assuming independence) in the GOV2 Web

collection Collection size (N) is 25,205,179

Examples of errors made by the original Porter stemmer False

positives are pairs of words that have the same stem False

negatives are pairs that have different stems

Examples of words with the Arabic root ktb

High-frequency noun phrases from a TREC collection and

U.S patents from 1996

Statistics for the Google n-gram sample

Four sentences from the Wikipedia entry for tropical fish

Elias-7 code examples

Elias-$ code examples

Space requirements for numbers encoded in v-byte

45177 78 7879

80

84

9396

9910113?146147149

Trang 25

Sample encodings for v-byte

Skip lengths (k) and expected processing steps

Partial entry for the Medical Subject (MeSH) Heading "Neck

Pain"

Term association measures

Most strongly associated words for "tropical" in a collection of

TREC news stories Co-occurrence counts are measured at thedocument level

Most strongly associated words for "fish" in a collection of

TREC news stories Co-occurrence counts are measured at thedocument level

Most strongly associated words for "fish" in a collection of

TREC news stories Co-occurrence counts are measured in

windows of five words

Contingency table of term occurrences for a particular query BM25 scores for an example document

Query likelihood scores for an example document

Highest-probability terms from relevance model for four

example queries (estimated using top 10 documents)

Highest-probability terms from relevance model for four

example queries (estimated using top 50 documents)

Conditional probabilities for example network

Highest-probability terms from four topics in LDA model

Statistics for three example text collections The average number

of words per document is calculated without stemming

Statistics for queries from example text collections

Sets of documents defined by a simple search with binary

relevance

Precision values at standard recall levels calculated using

interpolation

Definitions of some important efficiency metrics

Artificial effectiveness data for two retrieval algorithms (A and

B) over 10 queries The column B - A gives the difference in

effectiveness

149152

200203204

205

205248252260266

267272290

301301309

317323

328

Trang 26

A list of kernels that are typically used with SVMs For each

kernel, the name, value, and implicit dimensionality are given .Example questions submitted to Yahoo! Answers

Translations automatically learned from a set of question and

answer pairs The 10 most likely translations for the terms

"everest", "xp", and "search" are given

Summary of static and adaptive filtering models For each, the

profile representation and profile updating algorithm are given .Contingency table for the possible outcomes of a filtering

system Here, TP (true positive) is the number of relevant

documents retrieved, FN (false negative) is the number of

relevant documents not retrieved, FP (false positive) is the

number of non-relevant documents retrieved, and TN (true

negative) is the number of non-relevant documents not retrieved.Most likely one- and two-word concepts produced using latentconcept expansion with the top 25 documents retrieved for

the query "hubble telescope achievements" on the TREC

419430

431

460469

Trang 28

Sam Lowry, Brazil

I.I What Is Information Retrieval?

This book is designed to help people understand search engines, evaluate andcompare them, and modify them for specific applications Searching for infor-mation on the Web is, for most people, a daily activity Search and communi-cation are by far the most popular uses of the computer Not surprisingly, manypeople in companies and universities are trying to improve search by coming upwith easier and faster ways to find the right information These people, whetherthey call themselves computer scientists, software engineers, information scien-

tists, search engine optimizers, or something else, are working in the field of formation Retrieval 1 So, before we launch into a detailed journey through theinternals of search engines, we will take a few pages to provide a context for therest of the book

In-Gerard Salton, a pioneer in information retrieval and one of the leading figuresfrom the 1960s to the 1990s, proposed the following definition in his classic 1968textbook (Salton, 1968):

Information retrieval is a field concerned with the structure, analysis, ganization, storage, searching, and retrieval of information

or-Despite the huge advances in the understanding and technology of search in thepast 40 years, this definition is still appropriate and accurate The term "informa-

1 Information retrieval is often abbreviated as IR In this book, we mostly use the fullterm This has nothing to do with the fact that many people think IR means "infrared"

or something else

Trang 29

tion" is very general, and information retrieval includes work on a wide range oftypes of information and a variety of applications related to search.

The primary focus of the field since the 1950s has been on text and text ments Web pages, email, scholarly papers, books, and news stories are just a few

docu-of the many examples docu-of documents All docu-of these documents have some amount

of structure, such as the tide, author, date, and abstract information associatedwith the content of papers that appear in scientific journals The elements of thisstructure are called attributes, or fields, when referring to database records Theimportant distinction between a document and a typical database record, such as

a bank account record or a flight reservation, is that most of the information inthe document is in the form of text, which is relatively unstructured

To illustrate this difference, consider the information contained in two typicalattributes of an account record, the account number and current balance Both arevery well defined, both in terms of their format (for example, a six-digit integerfor an account number and a real number with two decimal places for balance)and their meaning It is very easy to compare values of these attributes, and conse-quendy it is straightforward to implement algorithms to identify the records thatsatisfy queries such as "Find account number 321456" or "Find accounts withbalances greater than $50,000.00"

Now consider a news story about the merger of two banks The story will havesome attributes, such as the headline and source of the story, but the primary con-tent is the story itself In a database system, this critical piece of information wouldtypically be stored as a single large attribute with no internal structure Most ofthe queries submitted to a web search engine such as Google2 that relate to thisstory will be of the form "bank merger" or "bank takeover" To do this search,

we must design algorithms that can compare the text of the queries with the text

of the story and decide whether the story contains the information that is beingsought Defining the meaning of a word, a sentence, a paragraph, or a whole newsstory is much more difficult than defining an account number, and consequendycomparing text is not easy Understanding and modeling how people comparetexts, and designing computer algorithms to accurately perform this comparison,

is at the core of information retrieval

Increasingly, applications of information retrieval involve multimedia ments with structure, significant text content, and other media Popular infor-mation media include pictures, video, and audio, including music and speech In

docu-; http://www.google.com

Trang 30

1.1 What Is Information Retrieval? 3

some applications, such as in legal support, scanned document images are alsoimportant These media have content that, like text, is difficult to describe andcompare The current technology for searching non-text documents relies on textdescriptions of their content rather than the contents themselves, but progress isbeing made on techniques for direct comparison of images, for example

In addition to a range of media, information retrieval involves a range of tasksand applications The usual search scenario involves someone typing in a query to

a search engine and receiving answers in the form of a list of documents in ranked

order Although searching the World Wide Web (web search] is by far the most

common application involving information retrieval, search is also a crucial part

of applications in corporations, government, and many other domains Vertical search is a specialized form of web search where the domain of the search is restricted to a particular topic Enterprise search involves finding the required infor-

mation in the huge variety of computer files scattered across a corporate intranet.Web pages are certainly a part of that distributed information store, but mostinformation will be found in sources such as email, reports, presentations, spread-

sheets, and structured data in corporate databases Desktop search is the personal

version of enterprise search, where the information sources are the files stored

on an individual computer, including email messages and web pages that have

re-cently been browsed Peer-to-peer search involves finding information in networks

of nodes or computers without any centralized control This type of search began

as a file sharing tool for music but can be used in any community based on sharedinterests, or even shared locality in the case of mobile devices Search and relatedinformation retrieval techniques are used for advertising, for intelligence analy-sis, for scientific discovery, for health care, for customer support, for real estate,

and so on Any application that involves a collection^ of text or other unstructured

information will need to organize and search that information

Search based on a user query (sometimes called ad hoc search because the range

of possible queries is huge and not prespecified) is not the only text-based task

that is studied in information retrieval Other tasks include filtering, classification, and question answering Filtering or tracking involves detecting stories of interest

based on a person's interests and providing an alert using email or some othermechanism Classification or categorization uses a defined set of labels or classes

3 The term database is often used to refer to a collection of either structured or tured data To avoid confusion, we mostly use the term document collection (or just collection) for text However, the terms web database and search engine database are so

unstruc-common that we occasionally use them in this book

Trang 31

(such as the categories listed in the Yahoo! Directory4) and automatically assignsthose labels to documents Question answering is similar to search but is aimed

at more specific questions, such as "What is the height of Mt Everest?" The goal

of question answering is to return a specific answer found in the text, rather than

a list of documents Table 1.1 summarizes some of these aspects or dimensions ofthe field of information retrieval

Examples of

ContentTextImagesVideoScanned documents

AudioMusic

Examples ofApplicationsWeb searchVertical searchEnterprise searchDesktop searchPeer-to-peer search

Examples ofTasks

Ad hoc searchFilteringClassificationQuestion answering

Table 1.1 Some dimensions of information retrieval

1.2 The Big Issues

Information retrieval researchers have focused on a few key issues that remain just

as important in the era of commercial web search engines working with billions

of web pages as they were when tests were done in the 1960s on document

col-lections containing about 1.5 megabytes of text One of these issues is relevance.

Relevance is a fundamental concept in information retrieval Loosely speaking, arelevant document contains the information that a person was looking for whenshe submitted a query to the search engine Although this sounds simple, there aremany factors that go into a person's decision as to whether a particular document

is relevant These factors must be taken into account when designing algorithmsfor comparing text and ranking documents Simply comparing the text of a querywith the text of a document and looking for an exact match, as might be done in

a database system or using the grep utility in Unix, produces very poor results interms of relevance One obvious reason for this is that language can be used to ex-: http://dir.yahoo.com/

Trang 32

1.2 The Big Issues 5

press the same concepts in many different ways, often with very different words

This is referred to as the vocabulary mismatch problem in information retrieval.

It is also important to distinguish between topical relevance and user relevance.

A text document is topically relevant to a query if it is on the same topic For ample, a news story about a tornado in Kansas would be topically relevant to thequery "severe weather events" The person who asked the question (often calledthe user) may not consider the story relevant, however, if she has seen that storybefore, or if the story is five years old, or if the story is in Chinese from a Chi-nese news agency User relevance takes these additional features of the story intoaccount

ex-To address the issue of relevance, researchers propose retrieval models and test

how well they work A retrieval model is a formal representation of the process of

matching a query and a document It is the basis of the ranking algorithm that is

used in a search engine to produce the ranked list of documents A good retrievalmodel will find documents that are likely to be considered relevant by the personwho submitted the query Some retrieval models focus on topical relevance, but

a search engine deployed in a real environment must use ranking algorithms thatincorporate user relevance

An interesting feature of the retrieval models used in information retrieval isthat they typically model the statistical properties of text rather than the linguisticstructure This means, for example, that the ranking algorithms are typically farmore concerned with the counts of word occurrences than whether the word is anoun or an adjective More advanced models do incorporate linguistic features,but they tend to be of secondary importance The use of word frequency infor-mation to represent text started with another information retrieval pioneer, H.P.Luhn, in the 1950s This view of text did not become popular in other fields ofcomputer science, such as natural language processing, until the 1990s

Another core issue for information retrieval is evaluation Since the quality of

a document ranking depends on how well it matches a person's expectations, itwas necessary early on to develop evaluation measures and experimental proce-dures for acquiring this data and using it to compare ranking algorithms CyrilCleverdon led the way in developing evaluation methods in the early 1960s, and

two of the measures he used, precision and recall, are still popular Precision is a

very intuitive measure, and is the proportion of retrieved documents that are evant Recall is the proportion of relevant documents that are retrieved Whenthe recall measure is used, there is an assumption that all the relevant documentsfor a given query are known Such an assumption is clearly problematic in a web

Trang 33

rel-search environment, but with smaller test collections of documents, this measure

can be useful A test collection5 for information retrieval experiments consists of

a collection of text documents, a sample of typical queries, and a list of relevant

documents for each query (the relevance judgments} The best-known test

collec-tions are those associated with the TREC6 evaluation forum

Evaluation of retrieval models and search engines is a very active area, withmuch of the current focus on using large volumes of log data from user interac-

tions, such as clickthrough data, which records the documents that were clicked

on during a search session Clickthrough and other log data is strongly correlatedwith relevance so it can be used to evaluate search, but search engine companiesstill use relevance judgments in addition to log data to ensure the validity of theirresults

The third core issue for information retrieval is the emphasis on users and their

information needs This should be clear given that the evaluation of search is

user-centered That is, the users of a search engine are the ultimate judges of quality.This has led to numerous studies on how people interact with search engines and,

in particular, to the development of techniques to help people express their formation needs An information need is the underlying cause of the query that

in-a person submits to in-a sein-arch engine In contrin-ast to in-a request to in-a din-atin-abin-ase system,such as for the balance of a bank account, text queries are often poor descriptions

of what the user actually wants A one-word query such as "cats" could be a requestfor information on where to buy cats or for a description of the Broadway musi-cal Despite their lack of specificity, however, one-word queries are very common

in web search Techniques such as query suggestion, query expansion, and relevance feedback use interaction and context to refine the initial query in order to produce

better ranked lists

These issues will come up throughout this book, and will be discussed in siderably more detail We now have sufficient background to start talking aboutthe main product of research in information retrieval—namely, search engines

con-1.3 Search Engines

A search engine is the practical application of information retrieval techniques

to large-scale text collections A web search engine is the obvious example, but as

5 Also known as an evaluation corpus (plural corpora).

6 Text REtrieval Conference—http://trec.nist.gov/

Trang 34

1.3 Search Engines 7

has been mentioned, search engines can be found in many different applications,such as desktop search or enterprise search Search engines have been around formany years For example, MEDLINE, the online medical literature search sys-tem, started in the 1970s The term "search engine" was originally used to refer

to specialized hardware for text search From the mid-1980s onward, however, itgradually came to be used in preference to "information retrieval system" as thename for the software system that compares queries to documents and producesranked result lists of documents There is much more to a search engine than theranking algorithm, of course, and we will discuss the general architecture of thesesystems in the next chapter

Search engines come in a number of configurations that reflect the tions they are designed for Web search engines, such as Google and Yahoo !,7 must

applica-be able to capture, or crawl, many terabytes of data, and then provide subsecond

response times to millions of queries submitted every day from around the world.Enterprise search engines—for example, Autonomy8—must be able to processthe large variety of information sources in a company and use company-specific

knowledge as part of search and related tasks, such as data mining Data mining

refers to the automatic discovery of interesting structure in data and includes

tech-niques such as clustering Desktop search engines, such as the Microsoft Vista™

search feature, must be able to rapidly incorporate new documents, web pages,and email as the person creates or looks at them, as well as provide an intuitiveinterface for searching this very heterogeneous mix of information There is over-lap between these categories with systems such as Google, for example, which isavailable in configurations for enterprise and desktop search

Of en source search engines are another important class of systems that have

somewhat different design goals than the commercial search engines There are anumber of these systems, and the Wikipedia page for information retrieval9 pro-vides links to many of them Three systems of particular interest are Lucene,10

Lemur,11 and the system provided with this book, Galago.12 Lucene is a popularJava-based search engine that has been used for a wide range of commercial ap-plications The information retrieval techniques that it uses are relatively simple.http://www.yahoo.com

Trang 35

Lemur is an open source toolkit that includes the Indri C++-based search engine.Lemur has primarily been used by information retrieval researchers to compareadvanced search techniques Galago is a Java-based search engine that is based onthe Lemur and Indri projects The assignments in this book make extensive use ofGalago It is designed to be fast, adaptable, and easy to understand, and incorpo-rates very effective information retrieval techniques.

The "big issues" in the design of search engines include the ones identified forinformation retrieval: effective ranking algorithms, evaluation, and user interac-tion There are, however, a number of additional critical features of search enginesthat result from their deployment in large-scale, operational environments Fore-

most among these features is the performance of the search engine in terms of sures such as response time, query throughput, and indexing speed Response time

mea-is the delay between submitting a query and receiving the result lmea-ist, throughputmeasures the number of queries that can be processed in a given time, and index-ing speed is the rate at which text documents can be transformed into indexes

for searching An index is a data structure that improves the speed of search The

design of indexes for search engines is one of the major topics in this book.Another important performance measure is how fast new data can be incorpo-rated into the indexes Search applications typically deal with dynamic, constantly

changing information Coverage measures how much of the existing information

in, say, a corporate information environment has been indexed and stored in the

search engine, and recency on freshness measures the "age" of the stored

informa-tion

Search engines can be used with small collections, such as a few hundred emailsand documents on a desktop, or extremely large collections, such as the entireWeb There may be only a few users of a given application, or many thousands

Scalability is clearly an important issue for search engine design Designs that

work for a given application should continue to work as the amount of data andthe number of users grow In section 1.1, we described how search engines are used

in many applications and for many tasks To do this, they have to be customizable

or adaptable This means that many different aspects of the search engine, such as

the ranking algorithm, the interface, or the indexing strategy, must be able to betuned and adapted to the requirements of the application

Practical issues that impact search engine design also occur for specific

appli-cations The best example of this is spam in web search Spam is generally thought

of as unwanted email, but more generally it could be defined as misleading, propriate, or non-relevant information in a document that is designed for some

Trang 36

-Efficient search and indexing

Incorporating new data

-Coverage and freshness

Fig 1.1 Search engine design and the core information retrieval issues

Based on this discussion of the relationship between information retrieval andsearch engines, we now consider what roles computer scientists and others play inthe design and use of search engines

1.4 Search Engineers

Information retrieval research involves the development of mathematical models

of text and language, large-scale experiments with test collections or users, and

a lot of scholarly paper writing For these reasons, it tends to be done by demics or people in research laboratories These people are primarily trained incomputer science, although information science, mathematics, and, occasionally,social science and computational linguistics are also represented So who works

Trang 37

aca-with search engines ? To a large extent, it is the same sort of people but aca-with a more

practical emphasis The computing industry has started to use the term search engineer to describe this type of person Search engineers are primarily people

trained in computer science, mostly with a systems or database background prisingly few of them have training in information retrieval, which is one of themajor motivations for this book

Sur-What is the role of a search engineer? Certainly the people who work in themajor web search companies designing and implementing new search engines aresearch engineers, but the majority of search engineers are the people who modify,extend, maintain, or tune existing search engines for a wide range of commercialapplications People who design or "optimize" content for search engines are alsosearch engineers, as are people who implement techniques to deal with spam Thesearch engines that search engineers work with cover the entire range mentioned

in the last section: they primarily use open source and enterprise search enginesfor application development, but also get the most out of desktop and web searchengines

The importance and pervasiveness of search in modern computer applicationshas meant that search engineering has become a crucial profession in the com-puter industry There are, however, very few courses being taught in computerscience departments that give students an appreciation of the variety of issues thatare involved, especially from the information retrieval perspective This book is in-tended to give potential search engineers the understanding and tools they need

References and Further Reading

In each chapter, we provide some pointers to papers and books that give moredetail on the topics that have been covered This additional reading should not

be necessary to understand material that has been presented, but instead will givemore background, more depth in some cases, and, for advanced topics, will de-scribe techniques and research results that are not covered in this book

The classic references on information retrieval, in our opinion, are the booksbySalton (1968; 1983) and van Rijsbergen (1979) Van Rijsbergen's book remainspopular, since it is available on the Web.13 All three books provide excellent de-scriptions of the research done in the early years of information retrieval, up tothe late 1970s Salton's early book was particularly important in terms of defining

13http://www.dcs.gla.ac.uk/Keith/Preface.html

Trang 38

on information retrieval and search also appear in the European Conference

on Information Retrieval (ECIR), the Conference on Information and edge Management (CIKM), and the Web Search and Data Mining Conference(WSDM) The WSDM conference is a spin-off of the World Wide Web Confer-ence (WWW), which has included some important papers on web search Theproceedings from the TREC workshops are available online and contain usefuldescriptions of new research techniques from many different academic and indus-try groups An overview of the TREC experiments can be found in Voorhees andHarman (2005) An increasing number of search-related papers are beginning toappear in database conferences, such as VLDB and SIGMOD Occasional papersalso show up in language technology conferences, such as ACL and HLT (As-sociation for Computational Linguistics and Human Language Technologies),machine learning conferences, and others

1.2 Site search is another common application of search engines In this case,

search is restricted to the web pages at a given website Compare site search toweb search, vertical search, and enterprise search

14http://www.acm.org/dl

Trang 39

1.3 Use the Web to find as many examples as you can of open source search

en-gines, information retrieval systems, or related technology Give a brief tion of each search engine and summarize the similarities and differences betweenthem

descrip-1.4 List five web services or sites that you use that appear to use search, not

includ-ing web search engines Describe the role of search for that service Also describewhether the search is based on a database or gr ep style of matching, or if the search

is using some type of ranking

Trang 40

Architecture of a Search Engine

"While your first question may be the most tinent, you may or may not realize it is also themost irrelevant."

per-The Architect, Matrix Reloaded

2.1 What Is an Architecture?

In this chapter, we describe the basic software architecture of a search engine though there is no universal agreement on the definition, a software architecturegenerally consists of software components, the interfaces provided by those com-ponents, and the relationships between them An architecture is used to describe

Al-a system Al-at Al-a pAl-articulAl-ar level of Al-abstrAl-action An exAl-ample of Al-an Al-architecture used toprovide a standard for integrating search and related language technology compo-nents is UIMA (Unstructured Information Management Architecture).1 UIMAdefines interfaces for components in order to simplify the addition of new tech-nologies into systems that handle text and other unstructured data

Our search engine architecture is used to present high-level descriptions ofthe important components of the system and the relationships between them It

is not a code-level description, although some of the components do correspond

to software modules in the Galago search engine and other systems We use thisarchitecture in this chapter and throughout the book to provide context to thediscussion of specific techniques

An architecture is designed to ensure that a system will satisfy the applicationrequirements or goals The two primary goals of a search engine are:

• Effectiveness (quality): We want to be able to retrieve the most relevant set ofdocuments possible for a query

• Efficiency (speed): We want to process queries from users as quickly as ble

possi-http://www.research.ibm.com/UIMA

1

Định dạng
Số trang	547
Dung lượng	26,71 MB