Architectural Issues of Web−Enabled Electronic Business phần 5 pps

An intelligent search engine would use the searchresults of the general purpose search engines as its starting search space, from which it would adaptively learnfrom the users feedback t

Trang 1

Chapter 9: Intelligent Web Search Through Adaptive Learning From Relevance Feedback

Introduction

This chapter presents the authors approaches to intelligent Web search systems that are built on top of existingsearch engine design and implementation techniques An intelligent search engine would use the searchresults of the general purpose search engines as its starting search space, from which it would adaptively learnfrom the users feedback to boost and to enhance the search performance and accuracy It may use featureextraction, document clustering and filtering, and other methods to help an adaptive learning process The

goal is to design practical and efficient algorithms by exploring the nature of the Web search With these new algorithms, three intelligent Web search enginesWEBSAIL, YARROW and FEATURES are built that are able

to achieve significant search precision increase with just four to five iterations of real−time learning from ausers relevance feedback The characteristics of those three intelligent search engines are reported in thischapter

Background

Recently, three general approaches have been taken to increase Web search accuracy and performance One is

the development of meta−search engines that forward user queries to multiple search engines at the same

time in order to increase the coverage and hope to include what the user wants in a short list of top−ranked

Trang 2

results Examples of such meta−search engines include MetaCrawler (MC), Inference Find (IF), and Dogpile (DP) Another approach is the development of topic−specific search engines that are specialized in particular

topics These topics range from vacation guides (VG) to kids' health (KH) The third approach is to use some

group or personal profiles to personalize the Web search Examples of such efforts include GroupLens

(Konstan et al., 1997), PHOAKS (Terveen, Hill, Amento, McDonald, & Creter, 1997), among others The first

generation meta−search engines address the problem of decreasing coverage by simultaneously queryingmultiple general−purpose engines These meta−search engines suffer to certain extent the inherited problem

of information overflow that it is difficult for users to pin down specific information for which they are

searching Specialized search engines typically contain much more accurate and narrowly focused

information However, it is not easy for a novice user to know where and which specialized engine to use.Most personalized Web search projects reported so far involve collecting users behavior at a centralizedserver or a proxy server While it is effective for the purpose of e−commerce where vendors can collectivelylearn consumer behaviors, this approach does present the privacy problem Users of the search engines wouldhave to submit their search habits to some type of servers, though most likely the information collected isanonymous

The clustering, user profiling, and other advanced techniques used by these search engines and other projects

(Bollacker, Lawrence, & Giles, 1998, 1999) are static in the sense that they are built before the search begins.

They cannot be changed dynamically during the real−time search process Thus, they do not reflect the

changing interests of the user at different time, at different location or on different subjects The static nature

of the existing search engines makes it very difficult, if not impossible, to support the dynamic changes of the

users search interests The augmented features of personalization (or customization) certainly help a searchengine to increase its search performance, however their ability is very limited An intelligent search engineshould be built on top of existing search engine design and implementation techniques It should use thesearch results of the general−purpose search engines as its starting search space, from which it would

adaptively learn in real−time from the users relevance feedback to boost and to enhance the search

performance and the relevance accuracy With the ability to perform real−time adaptive learning from

relevance feedback, the search engine is able to learn the users search interest changes or shifts, and thusprovides the user with improved search results

Relevance feedback is the most popular query reformation method in information retrieval ( Baeza−Yates &Ribeiro−Neto 1999, Salton 1975) It is essentially an adaptive learning process from the document examplesjudged by the user as relevant or irrelevant It requires a sequence of iterations of relevance feedback to searchfor the desired documents As it is known in (Salton, 1975), a single iteration of similarity−based relevancefeedback usually produces improvements from 40 to 60 percent in the search precision, evaluated at certainfixed levels of the recall ,and averaged over a number of user queries Some people might think that Websearch users are not willing to try iterations of relevance feedback to search for their desired documents.However, the authors think otherwise It is not a question of whether the Web search users are not willing totry iterations of relevance feedback to perform their search Rather it is a question of whether an adaptivelearning system can be built that supports high search precision increase with just a few iterations of relevancefeedback The Web search users may have no patience to try more than a dozen iterations of relevance

feedback But, if a system has a 20% or so search precision increase with just about four to five iterations of

relevance feedback, are the users willing to use such a system? The authors believe that the answer is yes.

Intelligent Web search systems that dynamically learn the users information needs in real−time must be built

to advance the state of art in Web search Machine−learning techniques can be used to improve Web search,because machine−learning algorithms are able to adjust the search process dynamically so as to satisfy theusers information needs Unfortunately, the existing machine−learning algorithms (e.g., Angluin, 1987;Littlestone, 1988), including the most popular similarity−based relevance feedback algorithm (Rocchio,1971), suffer from the large number of iterations required to achieve the search goal Average users are notwilling to go through too many iterations of learning to find what they want

Chapter 9: Intelligent Web Search Through Adaptive Learning From Relevance Feedback

Trang 3

Web Search and Adaptive Learning

Overview

There have been great research efforts on applications of machine−learning to automatic extraction, clustering

and classification of information from the Web Some earlier research includes WebWatcher (Armstrong,

Freitage, Joachims, & Mitchell, 1995) that interactively help users locate desired information by employing

learned knowledge about which hyperlinks are likely to lead to the target information; Syskill and Webert

(Pazzani, Muramatsu, & Billus, 1996), a system that uses a Bayesian classifier to learn about interesting Web

pages for the user; and NewsWeeder (Lang, 1995), a news−filtering system that allows the users to rate each

news article being read and learns a user profile based on those ratings Some research is aimed at providing

adaptive Web service through learning For example, Ahoy! The Homepage Finder in (Shakes, Langheinrich,

& Etzioni, 1997) performs dynamic reference shifting; Adaptive Web Sites in (Etzioni & Weld 1995,

Perkowitz & Etzioni 2000) automatically improve their organization and presentation based on user access

data; and Adaptive Web Page Recommendation Services (Balabanovi, 1997) recommends potentially

interesting Web pages to the users Since so much work has been done on intelligent Web search and onlearning from the Web by many researchers, a comprehensive review is beyond the scope and the limitedspace of this chapter Interested readers may find good surveys of the previous research on learning the Web

in Kobayashi and Takeda (2000)

Dynamic Features and Dynamic Vector Space

In spite of the World Wide Webs size and the high dimensionality of Web document index features, thetraditional vector space model in information retrieval (Baeza−Yates & Ribeiro−Neto,1999; Salton, 1989;Salton et al., 1975) has been used for Web document representation and search However, to implementreal−time adaptive learning with limited computing resource, the traditional vector space model cannot beapplied directly Recall that back in 1998, the AltaVista (AV) system was running on 20 multi−processormachines, all of them having more than 130 Giga−Bytes of RAM and over 500 Giga−Bytes of disk space(Baeza−Yates & Ribeiro−Neto,1999) A new model is needed that is efficient enough both in time and spacefor Web search implementations with limited computing resources The new model may also be used toenhance the computing performance of a Web search system even if enough computing resources are

available

Let us now examine indexing in Web search In the discussion, keywords are used as document index

features Let X denote the set of all index keywords for the whole Web (or, practically, a portion of the whole Web) Given any Web document d, let I(d) denote the set of all index keywords in X that are used to index d

with non−zero values Then, the following two properties hold:

The size of I(d) is substantially smaller than the size of X Practically, I(d) can be bounded by a

constant The rationale behind this is that in the simplest case only a few of the keywords in d are

needed to index it

•

For any search process related to the search query q, let D(q) denote the collection of all the

documents that match q, then the set of index keywords relevant to q, denoted by F(q), is

Although the size of F(q) varies from different queries, it is still substantially smaller than the size of

X, and might be bounded by a few hundreds or a few thousands in practice.

•

Web Search and Adaptive Learning

Trang 4

Definition 1 Given any search query q, F(q), which is given in the above paragraph, is defined as the set of

dynamic features relevant to the search query q.

Definition 2 Given any search query q, the dynamic vector space V(q) relevant to q is defined as the vector

space that is constructed with all the documents in D(q) such that each of those documents is indexed by the dynamic features in F(q).

The General Setting of Learning

Lest be a Web search system.For any query q, S first finds the set of documents D(q) that match the query q It finds D(q) with the help of a general−purpose search strategy through searching its internal database, or

through external search engine such as AltaVista (AV) when no matches are found within its internal

database It then finds the set of dynamic features F(q), and later constructs the dynamic vector space V(q) Once D(q), F(q) and V(q) have been found, S starts its adaptive learning process with the help of the learning algorithm that is to be presented in the following subsections More precisely, let } K denotes a dynamic feature (i.e., an index keyword) S maintains a common w = for dynamic features in F(q) The components of

w have non−negative real values The learning algorithm uses w to extract and learn the most relevant features

and to classify documents in D(q) as relevant or irrelevant weight vector )

Algorithm TW2

As the authors have investigated (Chen, Meng, & Fowler, 1999; Chen & Meng, Chen, Meng, Fowler, & Zhu,2000), intelligent Web search can be modeled as an adaptive learning process such as adaptive learning,where the search engine acts as a learner and the user as a teacher The user sends a query to the engine, andthe engine uses the query to search the index database and returns a list of URLs that are ranked according to

a ranking function Then the user provides the engine relevance feedback, and the engine uses the feedback toimprove its next search and returns a refined list of URLs The learning (or search) process ends when theengine finds the desired documents for the user Conceptually, a query entered by the user can be understood

as the logical expression of the collection of the documents wanted by the user A list of URLs returned by theengine can be interpreted as an approximation of the collection of the desired documents

Let us now consider how to use adaptive learning from equivalence queries to approach the problem of Websearch The vector space model (Baeza−Yates & Ribeiro−Neto, 1999; Salton, 1989; Salton et al., 1975) isused to represent documents The vector space may consist of Boolean vectors It may also consist of

discretized vectors, for example, the frequency vector of the index keywords A target concept is a collection

of documents, which is equivalent to the set of vectors of the documents in the collection The learner is thesearch engine and the teacher is the user The goal of the search engine is to find the target concept in

real−time with a minimal number of mistakes (or equivalence queries).

The authors designed the algorithm TW2, a tailored version of Winnow2 (Littlestone 1988), which is described

in the following As described in the general setting of learning, for each query q entered by the user,

algorithm TW2 uses a common weight vector w and a real−valued threshold q to classify documents in D(q) Initially, all weights in w have a value of 0 Let a > 1 be the promotion and demotion factor Algorithm TW2

classifies documents whose vectors as relevant, and all others as irrelevant If theuser provides a document that contradicts the classification of TW2, then TW2 is said to have made a mistake.When the user responds with a document that may or may not contradict to the current classification, TW2updates the weights through promotion or demotion It should be noticed that in contrast to algorithm

Winnow2 to set all initial weights in w to 1, algorithm TW2 sets all initial weights in w to 0 and has a different

promotion strategy accordingly Another substantial difference between TW2 and Winnow2 is that TW2

accepts document examples that may not contradict its current classification to promote or demote its weight

The General Setting of Learning

Trang 5

vector, while Winnow2 only accepts examples that contradict its current classification to perform promotion

or demotion The rationale behind setting all the initial weights to 0 by algorithm TW2 is to focus attention on

the propagation of the influence of the relevant documents, and to use irrelevant documents to adjust thefocused search space Moreover, this approach is computationally feasible because existing effective

document−ranking mechanisms can be coupled with the learning process

In contrast to the linear lower bounds proved for Rocchios similarity−based relevance feedback algorithm

(Chen & Zhu, 2002), algorithm TW2 has surprisingly small mistake bounds for learning any collection of

documents represented by a disjunction of a small number of relevant features The mistake bounds areindependent of the dimensionality of the index features For example, one can show that to learn a collection

of documents represented by a disjunction of at most k relevant features (or index keywords) over the

n−dimensional Boolean vector space, TW2 makes at most mistakes, where A is the

number of dynamic features that occurred in the learning process The actual implementation of algorithm

TW2 requires the help of document ranking and equivalence query simulation that are to be addressed later.

Feature Learning Algorithm FEX (Feature EXtraction)

Given any user query q, for any dynamic feature ) Ki F(q) with 1 I n, define the rank of Ki as h(Ki) = ho(Ki)

+ wi Here, ho(Ki) is the initial rank for Ki Reacal that Ki is some index keyword With the feature ranking

function h and the common weight vector w, FEX extracts and learns the most relevant features as follows.

Document Ranking

Let g be a ranking function independent of TW2 and FEX Define the ranking function f for documents in D(q

) for any user query q as follows For any Web document d ∈ D(q) with vector d = (x1,,xn) ∈ V(q), define

Feature Learning Algorithm FEX (Feature EXtraction)

Trang 6

Here, g remains constant for each document d during the learning process of the learning algorithm Various strategies can be used to define g, for example, PageRank (Brin & Page, 1998), classical tf−idf scheme, vector

spread, or cited−based rankings (Yuwono & Lee, 1996) The two additional tuning parameters are used to doindividual document promotions or demotions of the documents that have been judged by the user Initially,

let ß(d) 0 and γ(d) = 1 ß(d) and γ (d) can be updated in a similar fashion as the weight value wi is updated by

algorithm TW2

Equivalence Query Simulation

Our system will use the ranking function f that was defined above to rank the documents in D(q) for each user query q, and for each iteration of leaning, it returns the top 10 ranked documents to the user These top 10

ranked documents represent an approximation to the classification made by the learning algorithm that hasbeen used by the system The quantity 10 can be replaced by, say, 25 or 50 But it should not be too large fortwo reasons: (1) the user may only be interested in a very small number of top ranked documents, and (2) thedisplay space for visualization is limited The user can examine the short list of documents and can end thesearch process, or, if some documents are judged as misclassified, document relevance feedback can beprovided Sometimes, in addition to the top 10 ranked documents, the system may also provide the user with ashort list of other documents below the top 10 Documents in the second short list may be selected randomly,

or the bottom 10 ranked documents can be included The motivation for the second list is to give the usersome better view of the classification made by the learning algorithm

The Websail System and the Yarrow System

The WEBSAIL System is a real−time adaptive Web search learner designed and implemented to show thatthe learning algorithm TW2 not only works in theory but also works practically The detailed report of thesystem can be found in Chen et al (2000c) WEBSAIL employs TW2 as its learning component and is able tohelp the user search for the desired documents with as little relevance feedback as possible WEBSAIL has agraphic user interface to allow the user to enter his/her query and to specify the number of the top matcheddocument URLs to be returned WEBSAIL maintains an internal index database of about 800,000 documents.Each of those documents is indexed with about 300 keywords It also has a meta−search component to queryAltaVista whenever needed When the user enters a query and starts a search process, WEBSAIL first

searches its internal index database If no relevant documents can be found within its database then it receives

a list of top matched documents externally with the help of its meta−search component WEBSAIL displaysthe search result to the user in a format as shown in Figure 1

Equivalence Query Simulation

Trang 7

Figure 1: The display format of WEBSAIL

Also as shown in Figure 1, WEBSAIL provides at each iteration the top 10 and the bottom 10 ranked

document URLs Each document URL is preceded with two radio buttons for the user to judge whether thedocument is relevant to the search query or not The document URLs are clickable for viewing the actualdocument contents so that the user can judge more accurately whether a document is relevant or not After theuser clicks a few radio buttons, he/she can click the feedback button to submit the feedback to TW2

WEBSAIL has a function to parse out the feedback provided by the user when the feedback button is clicked

Having received the feedback from the user, TW2 updates its common weight vector w and also performs

individual document promotions or demotions At the end of the current iteration of learning, WEBSAILre−ranks the documents and displays the top 10 and the bottom10 document URLs to the user

At each iteration, the dispatcher of WEBSAIL parses query or relevance feedback information from theinterface and decides which of the following components should be invoked to continue the search process:TW2, or Index Database Searcher, or Meta−Searcher When meta−search is needed, Meta−Searcher is called

to query AltaVista to receive a list of the top matched documents The Meta−Searcher has a parser and anindexer that work in real−time to parse the received documents and to index each of them with at most 64keywords The received documents, once indexed, will also be cached in the index database

The following relative Recall and relative Precision are used to measure the performance of WEBSAIL For

any query q, the relative Recall and the relative Precision are

where R is the total number of relevant documents among the set of the retrieved documents, and Rm is the number of relevant documents ranked among the top m positions in the final search result of the search

engine The authors have selected 100 queries to calculate the average relative Recall of WEBSAIL Eachquery is represented by a collection of at most five keywords For each query, WEBSAIL is tested with the

returning document number m as 50, 100, 150, 200, respectively For each test, the number of iterations used

and the number of documents judged by the user were recorded The relative Recall and Precision werecalculated based on manual examination of the relevance of the returned documents The experiments revealthat WEBSAIL achieves an average of 0.95 relative Recall and an average of 0.46 relative Precision with anaverage of 3.72 iterations and an average of 13.46 documents judged as relevance feedback

The Yarrow system (Chen & Meng, 2000) is a multi−threaded program Its architecture differs from that ofWEBSAIL in two aspects: (1) it replaces the meta−searcher of WEBSAIL with a generic Query Constructorand a group of meta−searchers, and it does not maintain its own internal index database For each searchprocess, it creates a thread and destroys the thread when the search process ends Because of its light−weight

Equivalence Query Simulation

Trang 8

size, it can be easily converted or ported to run in different environments or platforms The predominantfeature of YARROW, compared with existing meta−search engines, is the fact that it learns from the usersfeedback in real−time on client side The learning algorithm TW2 used in YARROW has some surprisinglysmall mistake bound YARROW may be well used as a plug−in component for Web browsers on client side.

A detailed report of the Yarrow system is given in Chen and Meng (2000)

The Features System

The FEATURES system (Chen, Meng, Fowler, & Zhu, 2001) is also a multi−threaded system, and itsarchitecture is shown in Figure 2 The key difference between FEATURES and WEBSAIL is that

FEATURES employs the two learning algorithmsFEX and TW2to update the common weight vector w

concurrently

Figure 2: The architecture of FEATURES

For each query, FEATURES usually shows the top 10 ranked documents, plus the top 10 ranked features, tothe user for him/her to judge document relevance and feature relevance The format of presenting the top 10ranked documents together with the top 10 ranked features is shown in Figure 3 In this format, documentURLs and features are preceded by radio buttons for the user to indicate whether they are relevant or not

Figure 3: The display format of FEATURES

If the current task is a learning process from the users document and feature relevance feedback, Dispatchersends the feature relevance feedback information to the feature learner FEX and the document relevancefeedback information to the document learner TW2 FEX uses the relevant and irrelevant features as judged

The Features System

Trang 9

by the user to promote and demote the related feature weights in the common weight vector w TW2 uses the

relevant and irrelevant documents judged by the user as positive and negative examples to promote anddemote the weight vector Once FEX and TW2 have finished promotions and demotions, the updated weight

vector w is sent to Query Searcher and to Feature Ranker Feature Ranker re−ranks all the dynamic features,

that are then sent to Html Constructor Query Searcher searches Index Database to find the matched

documents that are then sent to Document Ranker Document Ranker re−ranks the matched documents andthen sends them to Html Constructor to select documents and features to be displayed Empirical results(Chen et al., 2001) show that FEATURES has substantially better search performance than AltaVista

Timing Statistics

On December 13th and 14th of 2001, the authors conducted the experiments to collect the timing statistics forusing WEBSAIL, YARROW and FEATURES Thirty (30) query words were used to test each of thesemeta−search engines Every time a query was sent, the wall−clock time needed for the meta−search engine tolist the sorted result was recorded in the program Also recorded was the wall−clock time to refine the searchresults based on the users feedback Since YARROW supports multiple external search engines,

ALTAVISTA and NORTHERN LIGHT were selected as the external search engines when YARROW wastested The external search engine used by WEBSAIL and FEATURES is ALTAVISTA The following tables

show the statistical results at 95% confidence interval level The original responding time is torig and the refining time is trefine, and C.I denotes the confidence interval.

Table 1: Response time of WEBSAIL (in seconds)

Table 2: Response Time of YARROW (in seconds)

Table 3: Response time of FEATURES (in seconds)

The statistics from the table indicate that while the standard deviations and the confidence intervals arerelatively high, they are in a reasonable range that users can accept It takes WEBSAIL, YARROW andFEATURES in the order of a few seconds to 20 seconds to respond initially because they need to get theinformation from external search engines over the network However, even for the initial response time is notlong and hence is acceptable by the user

Timing Statistics

Trang 10

The Commercial Applications

Intelligent Web search can find many commercial applications This section will concentrate on the

applications to E−commerce E−commerce can be viewed as three major components, the service and goodssuppliers, the consumers, and the information intermediaries (or infomediaries) The service and goodssuppliers are the producer or the source of the e−commerce flow The consumers are the destination of theflow Informediaries, according to Grover and Teng (2001), are an essential part of E−commerce An

enormous amount of information has to be produced, analyzed and managed in order for e−commerce tosucceed In this context, Web search is a major player in the infomediaries Other components of

infomediaries include communities of interest (e.g., online purchase), industry magnet sites (e.g.,

www.amazon.com), e−retailers, or even individual corporate sites (Grover & Teng, 2001) The

machine−learning approaches in Web search studied in this chapter are particularly important in the wholecontext of E−commerce The key feature of the machine−learning approach for Web search is interactivelearning and narrowing the search results to what the user wants This feature can be used in many

e−commerce applications The following are a few examples

Building a partnership: As pointed out in Tewari et al (2001), building a partnership between the buyers and

the seller is extremely important for the success of an e−Business Tewari et al used Multi−Attribute

Resource Intermediaries (MARI) infrastructure to approximate buyer and seller preferences They comparethe degree of matching between buyers and sellers by computing a distance between the two vectors Wheninteractive learning features explored in this chapter are used in this process, the buyers and the sellers can

negotiate the deal in real−time, thus greatly enhancing the capability of the system A collection of sellers

may provide an initial list of items available at certain prices for buyers to choose The buyers may also have alist of expectations According to the model proposed in (Tewari et.al, the possibility of a match is computedstatically If a machine−learning approach is taken, the buyers and the sellers may interactively find a bestdeal, similar to the situation where a face−to−face negotiation is taking place

Brokering between buyers and sellers: Brokerage between the producers and the consumers is a critical

E−commerce component Given a large number of producers and a large number of consumers, how toefficiently find a match between what is offered on the market and what a buyer is looking for? The workdescribed in Meyyappan (2001) and, Santos et al (2001) provided a framework for e−commerce searchbrokers A broker here is to compare price information, product features, the reputation of the producer, andother information for a potential buyer While in the previous category the seller and the buyer may negotiateinteractively Here the buyer interacts with the broker(s) only, very similar to the real−world situation Theinteractive machine−learning and related Web search technology can be applied in this category as well Themachine−learning algorithm will use the collection of potential sellers as a starting space, interactively searchthe optimal seller for the user based on the information collected by the brokerage software (Meyyappan2001) and (Santos et.al 2001) provided a framework for this brokerage to take place The machine−learningalgorithm discussed in this chapter can be used for a buyer to interact with the broker to get the best that is

available on the market For example, a broker may act as a meta−search engine that collects information

from a number of sellers, behaving very much like general−purpose search engines A buyer asks her broker

to get certain information; the broker, which is a meta−search engine equipped with TW2 or other learningalgorithms may search, collect, collate and rank the information returned from seller sources to the buyer Thebuyer can interact with the broker, just as if in the scenario of Web search The broker will refine its list untilthe buyer finds a satisfactory product and the seller

Interactive catalog: The service providers or the sellers can allow consumers to browse the catalog

interactively While browsing the learning algorithm can pick up users' interests and supply better information

to the customer, much like what adaptive Web sites (Perkowitz & Etzioni, 2000) do for the customers Here

the learning can take place in two forms The seller can explicitly ask how the potential buyers (browsers of

The Commercial Applications

Trang 11

the catalog) feel about the usefulness of the catalog This can be analogous to the interactive learning usingalgorithms such as TW2 Researchers have reported approaches of this type (though they didnt use TW2explicitly.) See (Herlocker and Konstan 2001) for an example and other similar projects In the second

approach, the provider of the catalog (seller) would learn the user interests and behaviors implicitly as

reported in Claypool et.al (2001) The learning algorithm such as TW2 can be embedded in the catalog

software The buyers interests and intention can be captured through modified browser software The learningalgorithm can then revise the catalog listings by taking the buyers Web page clicks as feedback This is verysimilar to the Web search situation

Web commerce infrastructure: Chaudhury, Mallick, and Rao (2000) describe using the Web in e−commerce

as various channels The Web can be used as advertising channel, ordering channel, and customer support

channel All these channels should be supported by an interactive system where customer feedback can be

quickly captured, analyzed and used in updating the e−commerce system

Future Work

In the future, the authors plan to improve the interface of their systems Right now, the systems display theURLs of the documents If the user wants to know the contents of the document, he/she needs to click theURL to view the content The authors plan to display the URL of a document together with a good preview ofits content The authors also want to highlight those index keywords in the preview and allow them to beclickable for feature extracting and learning

The authors also plan to apply clustering techniques to increase the performance of their system It is easy toobserve that in most cases documents that are relevant to a search query can be divided into a few differentclusters or groups The authors believe that document clustering techniques such as graph spectral partitioningcan be used to reduce the number of the iterations of the learning process and to increase the performance ofthe system

Acknowledgment

The authors thank the two anonymous referees and the editor, Dr Nansi Shi, for their valuable comments onthe draft of this chapter The final presentation of the chapter has greatly benefited from their comments

URL References

(AV) AltaVista: www.altavista.com

(IF) Inference Find: www.infind.com

(KH) Kidshealth.com: www.kidshealth.com

(VG) Vacations.Com: www.vacations.com

(DP) Dogpile: www.dogpile.com

Future Work

Trang 12

(IS) Infoseek: www.infoseek.com

(MC) MetaCrawler: www.metacrawler.com

References

Angluin, D (1987) Queries and concept learning Machine−learning, 2, 319−432.

Armstrong, R., Freitag, D., Joachims, T., & Mitchell, T (1995) Webwatcher: A learning apprentice for the

World Wide Web In Working Notes of the AAAI Spring Symposium on Information Gathering from

Heterogeneous, Distributed Environments, 6−12 AAAI Press.

Balabanovi, M (1997) An adaptive Web page recommendation service In Proceedings of the First

International Conference on Autonomous Agents, 378−387 New York: ACM Press.

Baeza−Yates, R., & Ribeiro−Neto, B (1999) Modern Information Retrieval Reading, MA:

Addison−Wesley

Bollacker, K., Lawrence, S., & Giles, C.L (1998) Citeseer: An autonomous Web agent for automatic

retrieval and identification of interesting publications In Proceedings of the Second International Conference

on Autonomous Agents, 116−113 New York: ACM Press.

Bollacker, K., Lawrence, S., & Giles, C L (1999) A system for automatic personalized tracking of scientific

literature on the Web In Proceedings of the Fourth ACM Conference on Digital Libraries, 105−113 New

York: ACM Press

Brin, S., & Page, L (1998) The anatomy of a large−scale hypertextual Web search engine In Proceedings of

the Seventh World Wide Web Conference.

Chaudhury, A., Mallick, D.N & Rao, H.R (2001) Web channels in E−commerce Communications of theACM, 44(1), 99−103

Chen, Z., & Meng, X (2000) Yarrow: A real−time client site meta search learner In Proceedings of the AAAI

2000 Workshop on Artificial Intelligence for Web Search (the full version will appear in Journal of Intelligent Information Systems), pp 12−17.

Chen, Z., Meng, X., & Fowler, R.H (1999) Searching the Web with queries Knowledge and InformationSystems 1, 369−375

Chen, Z., Meng, X., Fowler, R H., & Zhu, B (2001) FEATURES: Real time adaptive features and documentlearning for Web search Journal for the American Society for Information Science, 52(8), 655665

Chen, Z., & Zhu, B (2002) Some formal analysis of the Rocchios similarity−based relevance feedbackalgorithm Information Retrieval, 5(1), 61−86

Chen, Z., Meng, X., Zhu, B., & Fowler, R (2000) Websail: From on−line learning to Web search In

Proceedings of the 2000 International Conference on Web Information Systems Engineering (the full version

will appear in Journal of Knowledge and Information Systems, 4, 219−227.

References

Trang 13

Claypool, M., Brown, D., Le, P., and Waseda, M (2001) Inferring user interest, IEEE Internet Computing,

Konstan, J., Miller, B., Maltz, D., Herlocker, J., Gordon, L., & Riedl, J (1997) GroupLens: Applying

collaborative filtering to Usernet news Communications of ACM, 40(3), 77−87.

Lang, K (1995) Newsweeder: Learning to filter news In Proceedings of the Twelfth International

Conference on Machine−learning, 331−339 Lewis, D (1991) Learning in intelligent information retrieval.

In Proceedings of the Eighth International Workshop on Machine−learning, 235−239.

Littlestone, N (1988) Learning quickly when irrelevant attributes abound: A new linear−threshold algorithm

Machine−learning, 2, 285−318.

Meng, X., & Chen, Z (1999) Personalize Web search using information on clients side In Advances in

Computer Science and Technologies (985−992) Denver, CO: International Academic Publishers.

Meyyappan, A (2001) Proposing a new multi−routing agent architecture for E−marketplace In Proceedings

of the 2001 International Internet Computing Conference, 275−277.

Pazzani, M., Muramatsu, J & Billus, D (1996) Syskill & Webert: Identifying interesting Web Sites In

Proceedings of the Thirteenth National Conference on Artificial Intelligence, 54−61.

Perkowitz, M & Etzioni, O (2000) Adaptive Web sites: Concept and case study Artificial Intelligence, 118,

Santos, S.C., Anglim, S., & Meira, S.R.L (2001) A framework for Web−commerce search brokers In

Proceedings the 2001 International of the Internet Computing Conference, 261−267.

References

Trang 14

Shakes, J., Langheinrich, M., & Etzioni, O (1997) Dynamic reference sifting: A case study in the homepage

domain In Proceedings of the Sixth International World Wide Web Conference, 189−200.

Terveen, T., Hill, W., Amento, B., McDonald, D., & Creter, J (1997) Phoaks: A system for sharing

recommendation Communications of ACM, 40(3), 50−62.

Tewari, G., Berkovich, A., Gabovich, V., Liang, S., Ramakrishnan A., & Maes, P (2001) Sustaining

individual incentives while maximizing aggregate social welfare: A mediated brokering technique for tradingagents in next−generation electronic markets In Proceedings of the 2001 International Internet ComputingConference, pp 247−253

Yuwono, B., & Lee, D (1996) Search and ranking algorithms for locating resources on the World Wide Web

In Proceedings of the International Conference on Data Engineering, 164−171.

References

Trang 15

Chapter 10: World Wide Web Search Engines

Wen−Chen Hu

University of North Dakota

Jyh−Haw Yeh

Boise State University

Abstract

The World Wide Web now holds more than 800 million pages covering almost all issues The Webs fastgrowing size and lack of structural style present a new challenge for information retrieval Numerous searchtechnologies have been applied to Web search engines; however, the dominant search method has yet to beidentified This chapter provides an overview of the existing technologies for Web search engines and

classifies them into six categories: 1) hyperlink exploration, 2) information retrieval, 3) metasearches, 4) SQLapproaches, 5) content−based multimedia searches, and 6) others At the end of this chapter, a comparativestudy of major commercial and experimental search engines is presented, and some future research directionsfor Web search engines are suggested

Introduction

One of the most common tasks performed on the Web is to search Web pages, which is also one of the mostfrustrating and problematic The situation is getting worse because of the Webs fast growing size and lack ofstructural style, as well as the inadequacy of existing Web search engine technologies (Lawrence & Giles,1999a) Traditional search techniques are based on users typing in search keywords which the search servicescan then use to locate the desired Web pages However, this approach normally retrieves too many

documents, of which only a small fraction are relevant to the users needs Furthermore, the most relevantdocuments do not necessarily appear at the top of the query output list A number of corporations and researchorganizations are taking a variety of approaches to try to solve these problems These approaches are diverse,and none of them dominate the field This chapter provides a survey and classification of the available WorldWide Web search engine techniques, with an emphasis on nontraditional approaches Related Web searchtechnology reviews can also be found in (Gudivada, Raghavan, Grosky, & Kasanagottu, 1997; Lawrence &Giles, 1998b; Lawrence & Giles, 1999b; Lu & Feng, 1998)

Requirements of Web Search Engines

It is first necessary to examine what kind of features a Web search engine is expected to have in order toconduct effective and efficient Web searches and what kind of challenges may be faced in the process ofdeveloping new Web search techniques The requirements for a Web search engine are listed below, in order

Trang 16

up−to−date Web information;

Web Search Engine Technologies

Numerous Web search engine technologies have been proposed, and each technology employs a very

different approach This survey classifies the technologies into six categories: i) hyperlink exploration, ii)information retrieval, iii) metasearches, iv) SQL approaches, v) content−based multimedia searches, and vi)others The chapter is organized as follows: Section 2 introduces the general structure of a search engine, andSections 3 to 8 introduce each of the six Web search engine technologies in turn A comparative study ofmajor commercial and experimental search engines is shown in Section 9 and the final section gives a

summary and suggests future research directions

Search Engine Structure

Two different approaches are applied to Web search services: genuine search engines and directories Thedifference lies in how listings are compiled:

- Search engines, such as Google, create their listings automatically

Figure 1: System structure of a Web search engine

Web Search Engine Technologies

Trang 17

A crawler is a program that automatically scans various Web sites and collects Web documents from them.Crawlers follow the links on a site to find other relevant pages Two search algorithmsbreadthưfirst searchesand depthưfirst searchesare widely used by crawlers to traverse the Web The crawler views the Web as agraph, with the nodes being the objects located at Uniform Resource Locators (URLs) The objects could be(Hypertext Transfer Protocols (HTTPs), File Transfer Protocols (FTPs), mailto (eưmail), news, telnet, etc.They also return to sites periodically to look for changes To speed up the collection of Web documents,several crawlers are usually sent out to traverse the Web at the same time Three simple tools can be used toimplement an experimental crawler:

lynx: Lynx is a text browser for Unix systems For example, the command lynx ưdump source

http://www.w3c.org/ downloads the Web page source code at http://www.w3c.org/

•

java.net: The java.net package of Java language provides plenty of networking utilities Two classes

in the package, java.net.URL and java.net.URLConnection, can be used to download Web pages

information, provided by HTML scripts, to locate the desired Web pages:

Content: Page content provides the most accurate, fullưtext information However, it is also theleastưused type of information, since context extraction is still far less practical

Search and Ranking Software

Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items Auser enters a keyword or keywords, along with Boolean modifiers such as and, or, or not, into a search engine,which then scans indexed Web pages for the keywords To determine in which order to display pages to theuser, the engine uses an algorithm to rank pages that contain the keywords (Zhang & Dong, 2000) For

example, the engine may count the number of times the keyword appears on a page To save time and space,

Crawler

Trang 18

the engine may only look for keywords in metatags, which are HTML tags that provide information about aWeb page Unlike most HTML tags, metatags do not affect a documents appearance Instead, they includesuch information as a Web pages contents and some relevant keywords The following six sections givevarious methods of indexing, searching, and ranking the Web pages.

Hyperlink Exploration

Hypermedia documents contain cross references to other related documents by using hyperlinks, which allowthe user to move easily from one to the other Links can be tremendously important sources of information forindexers; the creation of a hyperlink by the author of a Web page represents an implicit endorsement of thepage being to which it points This approach is based on identifying two important types of Web pages for agiven topic:

Authorities, which provide the best source of information on the topic, and

& Henzinger, 1999)

Analyzing the interconnections of a series of related pages can identify the authorities and hubs for a

particular topic A simple method to update a non−negative authority with a weight xp and a non−negative hub with a weight yp is given by Chakrabarti et al (1999) If a page is pointed to by many good hubs, its

authority weight is updated by using the following formula:

where the notation q®ðp indicates that q links to p Similarly, if a page points to many good authorities, its

hub weight is updated via

Unfortunately, applying the above formulas to the entire Web to find authorities and hubs is impracticable

Ideally, the formulas are applied to a small collection Ssð of pages that contain plenty of relevant documents The concepts of a root set and a base set have been proposed by Kleinberg (1999) to find Ssð The root set is usually constructed by collecting the t highest−ranked pages for the query sð from a search engine such as

Google or Yahoo! However, the root set may not contain most of the strongest authorities A base set istherefore built by including any page pointed to by a page in the root set and any page that points to a page inthe root set Figure 2 shows an example of a root set and a base set The above formulas can then be applied to

a much smaller set, the base set, instead of the entire Web

In addition to the methods used to find authorities and hubs, a number of search methods based on

connectivity have been proposed A comparative study of various hypertext link analysis algorithms is given

in (Borodin et al., 2001) The most widely used method is a Page Rank model (Brin & Page, 1998), which

Hyperlink Exploration

Trang 19

suggests the reputation of a page on a topic is proportional to the sum of the reputation weights of pagespointing to it on the same topic That is, links emanating from pages with high reputations are weighted moreheavily The concepts of authorities and hubs, together with the Page Rank model, can also be used to

compute the reputation rank of a page; those topics for which the page has a good reputation are then

identified (Rafiei & Mendelzon, 2000) Some other ad hoc methods include an Hyperlink Vector Voting(HVV) method (Li, 1998) and a system known as WebQuery (Carriere & Kazman, 1997) The former methoduses the content of hyperlinks to a document to rank its relevance to the query terms, while the latter systemstudies the structural relationships among the nodes returned in a content−based query and gives the highestranking to the most highly connected nodes An improved algorithm obtained by augmenting with contentanalysis is introduced in Bharat and Henzinger (1998)

Figure 2: Expanding the root set into a base set

Information Retrieval (IR)

IR techniques are widely used in Web document searches (Gudivada et al, 1997) Among them, relevancefeedback and data clustering are two of the most popular techniques used by search engines The formermethod has not so far been applied to any commercial products because it requires some interaction withusers, who normally prefer to use a keyword−only interface The latter method has achieved more successsince it does not require any interaction with users to achieve acceptable results

•

Information Retrieval (IR)

Trang 20

For the above example, the search results after modification should not include Result #4.

Data Clustering

Data clustering is used to improve the search results by dividing the whole data set into data clusters Eachdata cluster contains objects of high similarity, and clusters are produced that group documents relevant to theusers query separately from irrelevant ones For example, the formula below gives a similarity measure:

where weightik is the weight assigned to termk in a document Di (Baeza−Yates, 1992) Clustering should not

be based on the whole Web resource, but on smaller separate query results In Zamir and Etzioni (1998), aSuffix Tree Clustering (STC) algorithm based on phrases shared between documents is used to create clusters.Beside clustering the search results, a proposed similarity function has been used to cluster similar queriesaccording to their contents as well as user logs (Wen, Nie, & Zhang, 2001) The resulting clusters can provideuseful information for Frequently Asked Queries (FAQ) identification Another Web document clusteringalgorithm is suggested in Chang and Hsu (1999)

of a metasearch engine, which consists of three major components:

Dispatch: Determines to which search engines a specific query is sent The selection is usually based

on network and local computational resources, as well as the long−term performance of searchengines on specific query terms

Tiêu đề	Intelligent Web Search Through Adaptive Learning From Relevance Feedback
Tác giả	Zhixiang Chen, Binhai Zhu, Xiannong Meng
Trường học	University of Texas−Pan American
Thể loại	Chương
Năm xuất bản	2003
Thành phố	Edinburg

Định dạng
Số trang	41
Dung lượng	404,76 KB