An intelligent search engine would use the searchresults of the general purpose search engines as its starting search space, from which it would adaptively learnfrom the users feedback t
Trang 1Chapter 9: Intelligent Web Search Through Adaptive Learning From Relevance Feedback
Introduction
This chapter presents the authors approaches to intelligent Web search systems that are built on top of existingsearch engine design and implementation techniques An intelligent search engine would use the searchresults of the general purpose search engines as its starting search space, from which it would adaptively learnfrom the users feedback to boost and to enhance the search performance and accuracy It may use featureextraction, document clustering and filtering, and other methods to help an adaptive learning process The
goal is to design practical and efficient algorithms by exploring the nature of the Web search With these new algorithms, three intelligent Web search enginesWEBSAIL, YARROW and FEATURES are built that are able
to achieve significant search precision increase with just four to five iterations of real−time learning from ausers relevance feedback The characteristics of those three intelligent search engines are reported in thischapter
Background
Recently, three general approaches have been taken to increase Web search accuracy and performance One is
the development of meta−search engines that forward user queries to multiple search engines at the same
time in order to increase the coverage and hope to include what the user wants in a short list of top−ranked
Trang 2results Examples of such meta−search engines include MetaCrawler (MC), Inference Find (IF), and Dogpile (DP) Another approach is the development of topic−specific search engines that are specialized in particular
topics These topics range from vacation guides (VG) to kids' health (KH) The third approach is to use some
group or personal profiles to personalize the Web search Examples of such efforts include GroupLens
(Konstan et al., 1997), PHOAKS (Terveen, Hill, Amento, McDonald, & Creter, 1997), among others The first
generation meta−search engines address the problem of decreasing coverage by simultaneously queryingmultiple general−purpose engines These meta−search engines suffer to certain extent the inherited problem
of information overflow that it is difficult for users to pin down specific information for which they are
searching Specialized search engines typically contain much more accurate and narrowly focused
information However, it is not easy for a novice user to know where and which specialized engine to use.Most personalized Web search projects reported so far involve collecting users behavior at a centralizedserver or a proxy server While it is effective for the purpose of e−commerce where vendors can collectivelylearn consumer behaviors, this approach does present the privacy problem Users of the search engines wouldhave to submit their search habits to some type of servers, though most likely the information collected isanonymous
The clustering, user profiling, and other advanced techniques used by these search engines and other projects
(Bollacker, Lawrence, & Giles, 1998, 1999) are static in the sense that they are built before the search begins.
They cannot be changed dynamically during the real−time search process Thus, they do not reflect the
changing interests of the user at different time, at different location or on different subjects The static nature
of the existing search engines makes it very difficult, if not impossible, to support the dynamic changes of the
users search interests The augmented features of personalization (or customization) certainly help a searchengine to increase its search performance, however their ability is very limited An intelligent search engineshould be built on top of existing search engine design and implementation techniques It should use thesearch results of the general−purpose search engines as its starting search space, from which it would
adaptively learn in real−time from the users relevance feedback to boost and to enhance the search
performance and the relevance accuracy With the ability to perform real−time adaptive learning from
relevance feedback, the search engine is able to learn the users search interest changes or shifts, and thusprovides the user with improved search results
Relevance feedback is the most popular query reformation method in information retrieval ( Baeza−Yates &Ribeiro−Neto 1999, Salton 1975) It is essentially an adaptive learning process from the document examplesjudged by the user as relevant or irrelevant It requires a sequence of iterations of relevance feedback to searchfor the desired documents As it is known in (Salton, 1975), a single iteration of similarity−based relevancefeedback usually produces improvements from 40 to 60 percent in the search precision, evaluated at certainfixed levels of the recall ,and averaged over a number of user queries Some people might think that Websearch users are not willing to try iterations of relevance feedback to search for their desired documents.However, the authors think otherwise It is not a question of whether the Web search users are not willing totry iterations of relevance feedback to perform their search Rather it is a question of whether an adaptivelearning system can be built that supports high search precision increase with just a few iterations of relevancefeedback The Web search users may have no patience to try more than a dozen iterations of relevance
feedback But, if a system has a 20% or so search precision increase with just about four to five iterations of
relevance feedback, are the users willing to use such a system? The authors believe that the answer is yes.
Intelligent Web search systems that dynamically learn the users information needs in real−time must be built
to advance the state of art in Web search Machine−learning techniques can be used to improve Web search,because machine−learning algorithms are able to adjust the search process dynamically so as to satisfy theusers information needs Unfortunately, the existing machine−learning algorithms (e.g., Angluin, 1987;Littlestone, 1988), including the most popular similarity−based relevance feedback algorithm (Rocchio,1971), suffer from the large number of iterations required to achieve the search goal Average users are notwilling to go through too many iterations of learning to find what they want
Chapter 9: Intelligent Web Search Through Adaptive Learning From Relevance Feedback
Trang 3Web Search and Adaptive Learning
Overview
There have been great research efforts on applications of machine−learning to automatic extraction, clustering
and classification of information from the Web Some earlier research includes WebWatcher (Armstrong,
Freitage, Joachims, & Mitchell, 1995) that interactively help users locate desired information by employing
learned knowledge about which hyperlinks are likely to lead to the target information; Syskill and Webert
(Pazzani, Muramatsu, & Billus, 1996), a system that uses a Bayesian classifier to learn about interesting Web
pages for the user; and NewsWeeder (Lang, 1995), a news−filtering system that allows the users to rate each
news article being read and learns a user profile based on those ratings Some research is aimed at providing
adaptive Web service through learning For example, Ahoy! The Homepage Finder in (Shakes, Langheinrich,
& Etzioni, 1997) performs dynamic reference shifting; Adaptive Web Sites in (Etzioni & Weld 1995,
Perkowitz & Etzioni 2000) automatically improve their organization and presentation based on user access
data; and Adaptive Web Page Recommendation Services (Balabanovi, 1997) recommends potentially
interesting Web pages to the users Since so much work has been done on intelligent Web search and onlearning from the Web by many researchers, a comprehensive review is beyond the scope and the limitedspace of this chapter Interested readers may find good surveys of the previous research on learning the Web
in Kobayashi and Takeda (2000)
Dynamic Features and Dynamic Vector Space
In spite of the World Wide Webs size and the high dimensionality of Web document index features, thetraditional vector space model in information retrieval (Baeza−Yates & Ribeiro−Neto,1999; Salton, 1989;Salton et al., 1975) has been used for Web document representation and search However, to implementreal−time adaptive learning with limited computing resource, the traditional vector space model cannot beapplied directly Recall that back in 1998, the AltaVista (AV) system was running on 20 multi−processormachines, all of them having more than 130 Giga−Bytes of RAM and over 500 Giga−Bytes of disk space(Baeza−Yates & Ribeiro−Neto,1999) A new model is needed that is efficient enough both in time and spacefor Web search implementations with limited computing resources The new model may also be used toenhance the computing performance of a Web search system even if enough computing resources are
available
Let us now examine indexing in Web search In the discussion, keywords are used as document index
features Let X denote the set of all index keywords for the whole Web (or, practically, a portion of the whole Web) Given any Web document d, let I(d) denote the set of all index keywords in X that are used to index d
with non−zero values Then, the following two properties hold:
The size of I(d) is substantially smaller than the size of X Practically, I(d) can be bounded by a
constant The rationale behind this is that in the simplest case only a few of the keywords in d are
needed to index it
•
For any search process related to the search query q, let D(q) denote the collection of all the
documents that match q, then the set of index keywords relevant to q, denoted by F(q), is
Although the size of F(q) varies from different queries, it is still substantially smaller than the size of
X, and might be bounded by a few hundreds or a few thousands in practice.
•
Web Search and Adaptive Learning
Trang 4Definition 1 Given any search query q, F(q), which is given in the above paragraph, is defined as the set of
dynamic features relevant to the search query q.
Definition 2 Given any search query q, the dynamic vector space V(q) relevant to q is defined as the vector
space that is constructed with all the documents in D(q) such that each of those documents is indexed by the dynamic features in F(q).
The General Setting of Learning
Lest be a Web search system.For any query q, S first finds the set of documents D(q) that match the query q It finds D(q) with the help of a general−purpose search strategy through searching its internal database, or
through external search engine such as AltaVista (AV) when no matches are found within its internal
database It then finds the set of dynamic features F(q), and later constructs the dynamic vector space V(q) Once D(q), F(q) and V(q) have been found, S starts its adaptive learning process with the help of the learning algorithm that is to be presented in the following subsections More precisely, let } K denotes a dynamic feature (i.e., an index keyword) S maintains a common w = for dynamic features in F(q) The components of
w have non−negative real values The learning algorithm uses w to extract and learn the most relevant features
and to classify documents in D(q) as relevant or irrelevant weight vector )
Algorithm TW2
As the authors have investigated (Chen, Meng, & Fowler, 1999; Chen & Meng, Chen, Meng, Fowler, & Zhu,2000), intelligent Web search can be modeled as an adaptive learning process such as adaptive learning,where the search engine acts as a learner and the user as a teacher The user sends a query to the engine, andthe engine uses the query to search the index database and returns a list of URLs that are ranked according to
a ranking function Then the user provides the engine relevance feedback, and the engine uses the feedback toimprove its next search and returns a refined list of URLs The learning (or search) process ends when theengine finds the desired documents for the user Conceptually, a query entered by the user can be understood
as the logical expression of the collection of the documents wanted by the user A list of URLs returned by theengine can be interpreted as an approximation of the collection of the desired documents
Let us now consider how to use adaptive learning from equivalence queries to approach the problem of Websearch The vector space model (Baeza−Yates & Ribeiro−Neto, 1999; Salton, 1989; Salton et al., 1975) isused to represent documents The vector space may consist of Boolean vectors It may also consist of
discretized vectors, for example, the frequency vector of the index keywords A target concept is a collection
of documents, which is equivalent to the set of vectors of the documents in the collection The learner is thesearch engine and the teacher is the user The goal of the search engine is to find the target concept in
real−time with a minimal number of mistakes (or equivalence queries).
The authors designed the algorithm TW2, a tailored version of Winnow2 (Littlestone 1988), which is described
in the following As described in the general setting of learning, for each query q entered by the user,
algorithm TW2 uses a common weight vector w and a real−valued threshold q to classify documents in D(q) Initially, all weights in w have a value of 0 Let a > 1 be the promotion and demotion factor Algorithm TW2
classifies documents whose vectors as relevant, and all others as irrelevant If theuser provides a document that contradicts the classification of TW2, then TW2 is said to have made a mistake.When the user responds with a document that may or may not contradict to the current classification, TW2updates the weights through promotion or demotion It should be noticed that in contrast to algorithm
Winnow2 to set all initial weights in w to 1, algorithm TW2 sets all initial weights in w to 0 and has a different
promotion strategy accordingly Another substantial difference between TW2 and Winnow2 is that TW2
accepts document examples that may not contradict its current classification to promote or demote its weight
The General Setting of Learning
Trang 5vector, while Winnow2 only accepts examples that contradict its current classification to perform promotion
or demotion The rationale behind setting all the initial weights to 0 by algorithm TW2 is to focus attention on
the propagation of the influence of the relevant documents, and to use irrelevant documents to adjust thefocused search space Moreover, this approach is computationally feasible because existing effective
document−ranking mechanisms can be coupled with the learning process
In contrast to the linear lower bounds proved for Rocchios similarity−based relevance feedback algorithm
(Chen & Zhu, 2002), algorithm TW2 has surprisingly small mistake bounds for learning any collection of
documents represented by a disjunction of a small number of relevant features The mistake bounds areindependent of the dimensionality of the index features For example, one can show that to learn a collection
of documents represented by a disjunction of at most k relevant features (or index keywords) over the
n−dimensional Boolean vector space, TW2 makes at most mistakes, where A is the
number of dynamic features that occurred in the learning process The actual implementation of algorithm
TW2 requires the help of document ranking and equivalence query simulation that are to be addressed later.
Feature Learning Algorithm FEX (Feature EXtraction)
Given any user query q, for any dynamic feature ) Ki F(q) with 1 I n, define the rank of Ki as h(Ki) = ho(Ki)
+ wi Here, ho(Ki) is the initial rank for Ki Reacal that Ki is some index keyword With the feature ranking
function h and the common weight vector w, FEX extracts and learns the most relevant features as follows.
Document Ranking
Let g be a ranking function independent of TW2 and FEX Define the ranking function f for documents in D(q
) for any user query q as follows For any Web document d ∈ D(q) with vector d = (x1,,xn) ∈ V(q), define
Feature Learning Algorithm FEX (Feature EXtraction)
Trang 6Here, g remains constant for each document d during the learning process of the learning algorithm Various strategies can be used to define g, for example, PageRank (Brin & Page, 1998), classical tf−idf scheme, vector
spread, or cited−based rankings (Yuwono & Lee, 1996) The two additional tuning parameters are used to doindividual document promotions or demotions of the documents that have been judged by the user Initially,
let ß(d) 0 and γ(d) = 1 ß(d) and γ (d) can be updated in a similar fashion as the weight value wi is updated by
algorithm TW2
Equivalence Query Simulation
Our system will use the ranking function f that was defined above to rank the documents in D(q) for each user query q, and for each iteration of leaning, it returns the top 10 ranked documents to the user These top 10
ranked documents represent an approximation to the classification made by the learning algorithm that hasbeen used by the system The quantity 10 can be replaced by, say, 25 or 50 But it should not be too large fortwo reasons: (1) the user may only be interested in a very small number of top ranked documents, and (2) thedisplay space for visualization is limited The user can examine the short list of documents and can end thesearch process, or, if some documents are judged as misclassified, document relevance feedback can beprovided Sometimes, in addition to the top 10 ranked documents, the system may also provide the user with ashort list of other documents below the top 10 Documents in the second short list may be selected randomly,
or the bottom 10 ranked documents can be included The motivation for the second list is to give the usersome better view of the classification made by the learning algorithm
The Websail System and the Yarrow System
The WEBSAIL System is a real−time adaptive Web search learner designed and implemented to show thatthe learning algorithm TW2 not only works in theory but also works practically The detailed report of thesystem can be found in Chen et al (2000c) WEBSAIL employs TW2 as its learning component and is able tohelp the user search for the desired documents with as little relevance feedback as possible WEBSAIL has agraphic user interface to allow the user to enter his/her query and to specify the number of the top matcheddocument URLs to be returned WEBSAIL maintains an internal index database of about 800,000 documents.Each of those documents is indexed with about 300 keywords It also has a meta−search component to queryAltaVista whenever needed When the user enters a query and starts a search process, WEBSAIL first
searches its internal index database If no relevant documents can be found within its database then it receives
a list of top matched documents externally with the help of its meta−search component WEBSAIL displaysthe search result to the user in a format as shown in Figure 1
Equivalence Query Simulation
Trang 7Figure 1: The display format of WEBSAIL
Also as shown in Figure 1, WEBSAIL provides at each iteration the top 10 and the bottom 10 ranked
document URLs Each document URL is preceded with two radio buttons for the user to judge whether thedocument is relevant to the search query or not The document URLs are clickable for viewing the actualdocument contents so that the user can judge more accurately whether a document is relevant or not After theuser clicks a few radio buttons, he/she can click the feedback button to submit the feedback to TW2
WEBSAIL has a function to parse out the feedback provided by the user when the feedback button is clicked
Having received the feedback from the user, TW2 updates its common weight vector w and also performs
individual document promotions or demotions At the end of the current iteration of learning, WEBSAILre−ranks the documents and displays the top 10 and the bottom10 document URLs to the user
At each iteration, the dispatcher of WEBSAIL parses query or relevance feedback information from theinterface and decides which of the following components should be invoked to continue the search process:TW2, or Index Database Searcher, or Meta−Searcher When meta−search is needed, Meta−Searcher is called
to query AltaVista to receive a list of the top matched documents The Meta−Searcher has a parser and anindexer that work in real−time to parse the received documents and to index each of them with at most 64keywords The received documents, once indexed, will also be cached in the index database
The following relative Recall and relative Precision are used to measure the performance of WEBSAIL For
any query q, the relative Recall and the relative Precision are
where R is the total number of relevant documents among the set of the retrieved documents, and Rm is the number of relevant documents ranked among the top m positions in the final search result of the search
engine The authors have selected 100 queries to calculate the average relative Recall of WEBSAIL Eachquery is represented by a collection of at most five keywords For each query, WEBSAIL is tested with the
returning document number m as 50, 100, 150, 200, respectively For each test, the number of iterations used
and the number of documents judged by the user were recorded The relative Recall and Precision werecalculated based on manual examination of the relevance of the returned documents The experiments revealthat WEBSAIL achieves an average of 0.95 relative Recall and an average of 0.46 relative Precision with anaverage of 3.72 iterations and an average of 13.46 documents judged as relevance feedback
The Yarrow system (Chen & Meng, 2000) is a multi−threaded program Its architecture differs from that ofWEBSAIL in two aspects: (1) it replaces the meta−searcher of WEBSAIL with a generic Query Constructorand a group of meta−searchers, and it does not maintain its own internal index database For each searchprocess, it creates a thread and destroys the thread when the search process ends Because of its light−weight
Equivalence Query Simulation
Trang 8size, it can be easily converted or ported to run in different environments or platforms The predominantfeature of YARROW, compared with existing meta−search engines, is the fact that it learns from the usersfeedback in real−time on client side The learning algorithm TW2 used in YARROW has some surprisinglysmall mistake bound YARROW may be well used as a plug−in component for Web browsers on client side.
A detailed report of the Yarrow system is given in Chen and Meng (2000)
The Features System
The FEATURES system (Chen, Meng, Fowler, & Zhu, 2001) is also a multi−threaded system, and itsarchitecture is shown in Figure 2 The key difference between FEATURES and WEBSAIL is that
FEATURES employs the two learning algorithmsFEX and TW2to update the common weight vector w
concurrently
Figure 2: The architecture of FEATURES
For each query, FEATURES usually shows the top 10 ranked documents, plus the top 10 ranked features, tothe user for him/her to judge document relevance and feature relevance The format of presenting the top 10ranked documents together with the top 10 ranked features is shown in Figure 3 In this format, documentURLs and features are preceded by radio buttons for the user to indicate whether they are relevant or not
Figure 3: The display format of FEATURES
If the current task is a learning process from the users document and feature relevance feedback, Dispatchersends the feature relevance feedback information to the feature learner FEX and the document relevancefeedback information to the document learner TW2 FEX uses the relevant and irrelevant features as judged
The Features System
Trang 9by the user to promote and demote the related feature weights in the common weight vector w TW2 uses the
relevant and irrelevant documents judged by the user as positive and negative examples to promote anddemote the weight vector Once FEX and TW2 have finished promotions and demotions, the updated weight
vector w is sent to Query Searcher and to Feature Ranker Feature Ranker re−ranks all the dynamic features,
that are then sent to Html Constructor Query Searcher searches Index Database to find the matched
documents that are then sent to Document Ranker Document Ranker re−ranks the matched documents andthen sends them to Html Constructor to select documents and features to be displayed Empirical results(Chen et al., 2001) show that FEATURES has substantially better search performance than AltaVista
Timing Statistics
On December 13th and 14th of 2001, the authors conducted the experiments to collect the timing statistics forusing WEBSAIL, YARROW and FEATURES Thirty (30) query words were used to test each of thesemeta−search engines Every time a query was sent, the wall−clock time needed for the meta−search engine tolist the sorted result was recorded in the program Also recorded was the wall−clock time to refine the searchresults based on the users feedback Since YARROW supports multiple external search engines,
ALTAVISTA and NORTHERN LIGHT were selected as the external search engines when YARROW wastested The external search engine used by WEBSAIL and FEATURES is ALTAVISTA The following tables
show the statistical results at 95% confidence interval level The original responding time is torig and the refining time is trefine, and C.I denotes the confidence interval.
Table 1: Response time of WEBSAIL (in seconds)
Table 2: Response Time of YARROW (in seconds)
Table 3: Response time of FEATURES (in seconds)
The statistics from the table indicate that while the standard deviations and the confidence intervals arerelatively high, they are in a reasonable range that users can accept It takes WEBSAIL, YARROW andFEATURES in the order of a few seconds to 20 seconds to respond initially because they need to get theinformation from external search engines over the network However, even for the initial response time is notlong and hence is acceptable by the user
Timing Statistics
Trang 10The Commercial Applications
Intelligent Web search can find many commercial applications This section will concentrate on the
applications to E−commerce E−commerce can be viewed as three major components, the service and goodssuppliers, the consumers, and the information intermediaries (or infomediaries) The service and goodssuppliers are the producer or the source of the e−commerce flow The consumers are the destination of theflow Informediaries, according to Grover and Teng (2001), are an essential part of E−commerce An
enormous amount of information has to be produced, analyzed and managed in order for e−commerce tosucceed In this context, Web search is a major player in the infomediaries Other components of
infomediaries include communities of interest (e.g., online purchase), industry magnet sites (e.g.,
www.amazon.com), e−retailers, or even individual corporate sites (Grover & Teng, 2001) The
machine−learning approaches in Web search studied in this chapter are particularly important in the wholecontext of E−commerce The key feature of the machine−learning approach for Web search is interactivelearning and narrowing the search results to what the user wants This feature can be used in many
e−commerce applications The following are a few examples
Building a partnership: As pointed out in Tewari et al (2001), building a partnership between the buyers and
the seller is extremely important for the success of an e−Business Tewari et al used Multi−Attribute
Resource Intermediaries (MARI) infrastructure to approximate buyer and seller preferences They comparethe degree of matching between buyers and sellers by computing a distance between the two vectors Wheninteractive learning features explored in this chapter are used in this process, the buyers and the sellers can
negotiate the deal in real−time, thus greatly enhancing the capability of the system A collection of sellers
may provide an initial list of items available at certain prices for buyers to choose The buyers may also have alist of expectations According to the model proposed in (Tewari et.al, the possibility of a match is computedstatically If a machine−learning approach is taken, the buyers and the sellers may interactively find a bestdeal, similar to the situation where a face−to−face negotiation is taking place
Brokering between buyers and sellers: Brokerage between the producers and the consumers is a critical
E−commerce component Given a large number of producers and a large number of consumers, how toefficiently find a match between what is offered on the market and what a buyer is looking for? The workdescribed in Meyyappan (2001) and, Santos et al (2001) provided a framework for e−commerce searchbrokers A broker here is to compare price information, product features, the reputation of the producer, andother information for a potential buyer While in the previous category the seller and the buyer may negotiateinteractively Here the buyer interacts with the broker(s) only, very similar to the real−world situation Theinteractive machine−learning and related Web search technology can be applied in this category as well Themachine−learning algorithm will use the collection of potential sellers as a starting space, interactively searchthe optimal seller for the user based on the information collected by the brokerage software (Meyyappan2001) and (Santos et.al 2001) provided a framework for this brokerage to take place The machine−learningalgorithm discussed in this chapter can be used for a buyer to interact with the broker to get the best that is
available on the market For example, a broker may act as a meta−search engine that collects information
from a number of sellers, behaving very much like general−purpose search engines A buyer asks her broker
to get certain information; the broker, which is a meta−search engine equipped with TW2 or other learningalgorithms may search, collect, collate and rank the information returned from seller sources to the buyer Thebuyer can interact with the broker, just as if in the scenario of Web search The broker will refine its list untilthe buyer finds a satisfactory product and the seller
Interactive catalog: The service providers or the sellers can allow consumers to browse the catalog
interactively While browsing the learning algorithm can pick up users' interests and supply better information
to the customer, much like what adaptive Web sites (Perkowitz & Etzioni, 2000) do for the customers Here
the learning can take place in two forms The seller can explicitly ask how the potential buyers (browsers of
The Commercial Applications
Trang 11the catalog) feel about the usefulness of the catalog This can be analogous to the interactive learning usingalgorithms such as TW2 Researchers have reported approaches of this type (though they didnt use TW2explicitly.) See (Herlocker and Konstan 2001) for an example and other similar projects In the second
approach, the provider of the catalog (seller) would learn the user interests and behaviors implicitly as
reported in Claypool et.al (2001) The learning algorithm such as TW2 can be embedded in the catalog
software The buyers interests and intention can be captured through modified browser software The learningalgorithm can then revise the catalog listings by taking the buyers Web page clicks as feedback This is verysimilar to the Web search situation
Web commerce infrastructure: Chaudhury, Mallick, and Rao (2000) describe using the Web in e−commerce
as various channels The Web can be used as advertising channel, ordering channel, and customer support
channel All these channels should be supported by an interactive system where customer feedback can be
quickly captured, analyzed and used in updating the e−commerce system
Future Work
In the future, the authors plan to improve the interface of their systems Right now, the systems display theURLs of the documents If the user wants to know the contents of the document, he/she needs to click theURL to view the content The authors plan to display the URL of a document together with a good preview ofits content The authors also want to highlight those index keywords in the preview and allow them to beclickable for feature extracting and learning
The authors also plan to apply clustering techniques to increase the performance of their system It is easy toobserve that in most cases documents that are relevant to a search query can be divided into a few differentclusters or groups The authors believe that document clustering techniques such as graph spectral partitioningcan be used to reduce the number of the iterations of the learning process and to increase the performance ofthe system
Acknowledgment
The authors thank the two anonymous referees and the editor, Dr Nansi Shi, for their valuable comments onthe draft of this chapter The final presentation of the chapter has greatly benefited from their comments
URL References
(AV) AltaVista: www.altavista.com
(IF) Inference Find: www.infind.com
(KH) Kidshealth.com: www.kidshealth.com
(VG) Vacations.Com: www.vacations.com
(DP) Dogpile: www.dogpile.com
Future Work
Trang 12(IS) Infoseek: www.infoseek.com
(MC) MetaCrawler: www.metacrawler.com
References
Angluin, D (1987) Queries and concept learning Machine−learning, 2, 319−432.
Armstrong, R., Freitag, D., Joachims, T., & Mitchell, T (1995) Webwatcher: A learning apprentice for the
World Wide Web In Working Notes of the AAAI Spring Symposium on Information Gathering from
Heterogeneous, Distributed Environments, 6−12 AAAI Press.
Balabanovi, M (1997) An adaptive Web page recommendation service In Proceedings of the First
International Conference on Autonomous Agents, 378−387 New York: ACM Press.
Baeza−Yates, R., & Ribeiro−Neto, B (1999) Modern Information Retrieval Reading, MA:
Addison−Wesley
Bollacker, K., Lawrence, S., & Giles, C.L (1998) Citeseer: An autonomous Web agent for automatic
retrieval and identification of interesting publications In Proceedings of the Second International Conference
on Autonomous Agents, 116−113 New York: ACM Press.
Bollacker, K., Lawrence, S., & Giles, C L (1999) A system for automatic personalized tracking of scientific
literature on the Web In Proceedings of the Fourth ACM Conference on Digital Libraries, 105−113 New
York: ACM Press
Brin, S., & Page, L (1998) The anatomy of a large−scale hypertextual Web search engine In Proceedings of
the Seventh World Wide Web Conference.
Chaudhury, A., Mallick, D.N & Rao, H.R (2001) Web channels in E−commerce Communications of theACM, 44(1), 99−103
Chen, Z., & Meng, X (2000) Yarrow: A real−time client site meta search learner In Proceedings of the AAAI
2000 Workshop on Artificial Intelligence for Web Search (the full version will appear in Journal of Intelligent Information Systems), pp 12−17.
Chen, Z., Meng, X., & Fowler, R.H (1999) Searching the Web with queries Knowledge and InformationSystems 1, 369−375
Chen, Z., Meng, X., Fowler, R H., & Zhu, B (2001) FEATURES: Real time adaptive features and documentlearning for Web search Journal for the American Society for Information Science, 52(8), 655665
Chen, Z., & Zhu, B (2002) Some formal analysis of the Rocchios similarity−based relevance feedbackalgorithm Information Retrieval, 5(1), 61−86
Chen, Z., Meng, X., Zhu, B., & Fowler, R (2000) Websail: From on−line learning to Web search In
Proceedings of the 2000 International Conference on Web Information Systems Engineering (the full version
will appear in Journal of Knowledge and Information Systems, 4, 219−227.
References
Trang 13Claypool, M., Brown, D., Le, P., and Waseda, M (2001) Inferring user interest, IEEE Internet Computing,
Konstan, J., Miller, B., Maltz, D., Herlocker, J., Gordon, L., & Riedl, J (1997) GroupLens: Applying
collaborative filtering to Usernet news Communications of ACM, 40(3), 77−87.
Lang, K (1995) Newsweeder: Learning to filter news In Proceedings of the Twelfth International
Conference on Machine−learning, 331−339 Lewis, D (1991) Learning in intelligent information retrieval.
In Proceedings of the Eighth International Workshop on Machine−learning, 235−239.
Littlestone, N (1988) Learning quickly when irrelevant attributes abound: A new linear−threshold algorithm
Machine−learning, 2, 285−318.
Meng, X., & Chen, Z (1999) Personalize Web search using information on clients side In Advances in
Computer Science and Technologies (985−992) Denver, CO: International Academic Publishers.
Meyyappan, A (2001) Proposing a new multi−routing agent architecture for E−marketplace In Proceedings
of the 2001 International Internet Computing Conference, 275−277.
Pazzani, M., Muramatsu, J & Billus, D (1996) Syskill & Webert: Identifying interesting Web Sites In
Proceedings of the Thirteenth National Conference on Artificial Intelligence, 54−61.
Perkowitz, M & Etzioni, O (2000) Adaptive Web sites: Concept and case study Artificial Intelligence, 118,
Santos, S.C., Anglim, S., & Meira, S.R.L (2001) A framework for Web−commerce search brokers In
Proceedings the 2001 International of the Internet Computing Conference, 261−267.
References
Trang 14Shakes, J., Langheinrich, M., & Etzioni, O (1997) Dynamic reference sifting: A case study in the homepage
domain In Proceedings of the Sixth International World Wide Web Conference, 189−200.
Terveen, T., Hill, W., Amento, B., McDonald, D., & Creter, J (1997) Phoaks: A system for sharing
recommendation Communications of ACM, 40(3), 50−62.
Tewari, G., Berkovich, A., Gabovich, V., Liang, S., Ramakrishnan A., & Maes, P (2001) Sustaining
individual incentives while maximizing aggregate social welfare: A mediated brokering technique for tradingagents in next−generation electronic markets In Proceedings of the 2001 International Internet ComputingConference, pp 247−253
Yuwono, B., & Lee, D (1996) Search and ranking algorithms for locating resources on the World Wide Web
In Proceedings of the International Conference on Data Engineering, 164−171.
References
Trang 15Chapter 10: World Wide Web Search Engines
Wen−Chen Hu
University of North Dakota
Jyh−Haw Yeh
Boise State University
Copyright © 2003, Idea Group Inc Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc is prohibited
Abstract
The World Wide Web now holds more than 800 million pages covering almost all issues The Webs fastgrowing size and lack of structural style present a new challenge for information retrieval Numerous searchtechnologies have been applied to Web search engines; however, the dominant search method has yet to beidentified This chapter provides an overview of the existing technologies for Web search engines and
classifies them into six categories: 1) hyperlink exploration, 2) information retrieval, 3) metasearches, 4) SQLapproaches, 5) content−based multimedia searches, and 6) others At the end of this chapter, a comparativestudy of major commercial and experimental search engines is presented, and some future research directionsfor Web search engines are suggested
Introduction
One of the most common tasks performed on the Web is to search Web pages, which is also one of the mostfrustrating and problematic The situation is getting worse because of the Webs fast growing size and lack ofstructural style, as well as the inadequacy of existing Web search engine technologies (Lawrence & Giles,1999a) Traditional search techniques are based on users typing in search keywords which the search servicescan then use to locate the desired Web pages However, this approach normally retrieves too many
documents, of which only a small fraction are relevant to the users needs Furthermore, the most relevantdocuments do not necessarily appear at the top of the query output list A number of corporations and researchorganizations are taking a variety of approaches to try to solve these problems These approaches are diverse,and none of them dominate the field This chapter provides a survey and classification of the available WorldWide Web search engine techniques, with an emphasis on nontraditional approaches Related Web searchtechnology reviews can also be found in (Gudivada, Raghavan, Grosky, & Kasanagottu, 1997; Lawrence &Giles, 1998b; Lawrence & Giles, 1999b; Lu & Feng, 1998)
Requirements of Web Search Engines
It is first necessary to examine what kind of features a Web search engine is expected to have in order toconduct effective and efficient Web searches and what kind of challenges may be faced in the process ofdeveloping new Web search techniques The requirements for a Web search engine are listed below, in order
Trang 16up−to−date Web information;
Web Search Engine Technologies
Numerous Web search engine technologies have been proposed, and each technology employs a very
different approach This survey classifies the technologies into six categories: i) hyperlink exploration, ii)information retrieval, iii) metasearches, iv) SQL approaches, v) content−based multimedia searches, and vi)others The chapter is organized as follows: Section 2 introduces the general structure of a search engine, andSections 3 to 8 introduce each of the six Web search engine technologies in turn A comparative study ofmajor commercial and experimental search engines is shown in Section 9 and the final section gives a
summary and suggests future research directions
Search Engine Structure
Two different approaches are applied to Web search services: genuine search engines and directories Thedifference lies in how listings are compiled:
- Search engines, such as Google, create their listings automatically
Figure 1: System structure of a Web search engine
Web Search Engine Technologies
Trang 17A crawler is a program that automatically scans various Web sites and collects Web documents from them.Crawlers follow the links on a site to find other relevant pages Two search algorithmsbreadthưfirst searchesand depthưfirst searchesare widely used by crawlers to traverse the Web The crawler views the Web as agraph, with the nodes being the objects located at Uniform Resource Locators (URLs) The objects could be(Hypertext Transfer Protocols (HTTPs), File Transfer Protocols (FTPs), mailto (eưmail), news, telnet, etc.They also return to sites periodically to look for changes To speed up the collection of Web documents,several crawlers are usually sent out to traverse the Web at the same time Three simple tools can be used toimplement an experimental crawler:
lynx: Lynx is a text browser for Unix systems For example, the command lynx ưdump source
http://www.w3c.org/ downloads the Web page source code at http://www.w3c.org/
•
java.net: The java.net package of Java language provides plenty of networking utilities Two classes
in the package, java.net.URL and java.net.URLConnection, can be used to download Web pages
information, provided by HTML scripts, to locate the desired Web pages:
Content: Page content provides the most accurate, fullưtext information However, it is also theleastưused type of information, since context extraction is still far less practical
Search and Ranking Software
Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items Auser enters a keyword or keywords, along with Boolean modifiers such as and, or, or not, into a search engine,which then scans indexed Web pages for the keywords To determine in which order to display pages to theuser, the engine uses an algorithm to rank pages that contain the keywords (Zhang & Dong, 2000) For
example, the engine may count the number of times the keyword appears on a page To save time and space,
Crawler
Trang 18the engine may only look for keywords in metatags, which are HTML tags that provide information about aWeb page Unlike most HTML tags, metatags do not affect a documents appearance Instead, they includesuch information as a Web pages contents and some relevant keywords The following six sections givevarious methods of indexing, searching, and ranking the Web pages.
Hyperlink Exploration
Hypermedia documents contain cross references to other related documents by using hyperlinks, which allowthe user to move easily from one to the other Links can be tremendously important sources of information forindexers; the creation of a hyperlink by the author of a Web page represents an implicit endorsement of thepage being to which it points This approach is based on identifying two important types of Web pages for agiven topic:
Authorities, which provide the best source of information on the topic, and
& Henzinger, 1999)
Analyzing the interconnections of a series of related pages can identify the authorities and hubs for a
particular topic A simple method to update a non−negative authority with a weight xp and a non−negative hub with a weight yp is given by Chakrabarti et al (1999) If a page is pointed to by many good hubs, its
authority weight is updated by using the following formula:
where the notation q®ðp indicates that q links to p Similarly, if a page points to many good authorities, its
hub weight is updated via
Unfortunately, applying the above formulas to the entire Web to find authorities and hubs is impracticable
Ideally, the formulas are applied to a small collection Ssð of pages that contain plenty of relevant documents The concepts of a root set and a base set have been proposed by Kleinberg (1999) to find Ssð The root set is usually constructed by collecting the t highest−ranked pages for the query sð from a search engine such as
Google or Yahoo! However, the root set may not contain most of the strongest authorities A base set istherefore built by including any page pointed to by a page in the root set and any page that points to a page inthe root set Figure 2 shows an example of a root set and a base set The above formulas can then be applied to
a much smaller set, the base set, instead of the entire Web
In addition to the methods used to find authorities and hubs, a number of search methods based on
connectivity have been proposed A comparative study of various hypertext link analysis algorithms is given
in (Borodin et al., 2001) The most widely used method is a Page Rank model (Brin & Page, 1998), which
Hyperlink Exploration
Trang 19suggests the reputation of a page on a topic is proportional to the sum of the reputation weights of pagespointing to it on the same topic That is, links emanating from pages with high reputations are weighted moreheavily The concepts of authorities and hubs, together with the Page Rank model, can also be used to
compute the reputation rank of a page; those topics for which the page has a good reputation are then
identified (Rafiei & Mendelzon, 2000) Some other ad hoc methods include an Hyperlink Vector Voting(HVV) method (Li, 1998) and a system known as WebQuery (Carriere & Kazman, 1997) The former methoduses the content of hyperlinks to a document to rank its relevance to the query terms, while the latter systemstudies the structural relationships among the nodes returned in a content−based query and gives the highestranking to the most highly connected nodes An improved algorithm obtained by augmenting with contentanalysis is introduced in Bharat and Henzinger (1998)
Figure 2: Expanding the root set into a base set
Information Retrieval (IR)
IR techniques are widely used in Web document searches (Gudivada et al, 1997) Among them, relevancefeedback and data clustering are two of the most popular techniques used by search engines The formermethod has not so far been applied to any commercial products because it requires some interaction withusers, who normally prefer to use a keyword−only interface The latter method has achieved more successsince it does not require any interaction with users to achieve acceptable results
•
Information Retrieval (IR)
Trang 20For the above example, the search results after modification should not include Result #4.
Data Clustering
Data clustering is used to improve the search results by dividing the whole data set into data clusters Eachdata cluster contains objects of high similarity, and clusters are produced that group documents relevant to theusers query separately from irrelevant ones For example, the formula below gives a similarity measure:
where weightik is the weight assigned to termk in a document Di (Baeza−Yates, 1992) Clustering should not
be based on the whole Web resource, but on smaller separate query results In Zamir and Etzioni (1998), aSuffix Tree Clustering (STC) algorithm based on phrases shared between documents is used to create clusters.Beside clustering the search results, a proposed similarity function has been used to cluster similar queriesaccording to their contents as well as user logs (Wen, Nie, & Zhang, 2001) The resulting clusters can provideuseful information for Frequently Asked Queries (FAQ) identification Another Web document clusteringalgorithm is suggested in Chang and Hsu (1999)
of a metasearch engine, which consists of three major components:
Dispatch: Determines to which search engines a specific query is sent The selection is usually based
on network and local computational resources, as well as the long−term performance of searchengines on specific query terms