An Experiment in Building Vertical Search Engine
Trang 1Vietnam National University of Hanoi
University of Engineering and Technology
STUDENT RESEARCH SEMINAR, 2012
Project:
An Experiment in Building Vertical Search Engine
Students:
Faculty: Information Technology
Supervisor: Dr Lê Quang Hiếu
Hanoi, 2012
Trang 2PROJECT SUMMARY
Project: An Experiment in Building Vertical Search Engine
Project members: Phạm Ngọc Quân, Phạm Lê Lợi, Bùi Hữu Điệp, Lê Đăng Đạt
Project supervisor: Dr Lê Quang Hiếu, Department of Information Technology, UET Management: University of Engineering and Technology, VNU Hanoi
Research time: 9/2011 – 3/2012
1 Motivation
When using a general search engine, user will get result involving many aspects, due tounclassified websites However, for people who are searching for information about aspecific topic, the websites in this category should be prioritized In this case, with usingthe popular search engine, users will have to read through the search results and choosesuitable ones, causing inconvenience Our research’s purpose is a experiment to build asearch engine that allows users to choose the domain they like, the returning resultswould closely relate to the chosen domain
2 Main content
In this project, we focus on building a search engine that works as an upgrade of thepopular search engine The vertical search engine collects the search results from one orseveral popular search engine (if there are decent differences in the searching methods ofthese search engines) After that, a classification module will decide which sites aredomain-related Finally, the result will be filtered with removing the out-of-topicwebpages and then returned to users We also add some suggestions for keyword such askeyword correction, keyword expansion so that user can get better result
3 Research result
In our experiment, we demonstrate the idea as an experiment for building a verticalsearch engine with the topic chosen as Football The search engine contains the collectingdata module that gets search results from Yahoo and Bing search engine, and the
Trang 3classification module uses Support Vector Machine classifier The whole project wasdone in Java and installed in a website using JSP.
Another separately experiment is the keyword suggestion module that was written inPython Since the running time of this module when integrated with JSP is inconvenient,this module is removed from the whole search engine, but can be tested severally
Trang 4TABLE OF CONTENTS
I INTRODUCTION 6
II LITERATURE REVIEW 7
1 In the world 7
2 In Vietnam 9
3 Our research goal 9
III VERTICAL SEARCH ENGINE 10
1 System Architecture 10
2 System’s Features 12
IV SYSTEM MODULES 12
1 Meta Search Engine 12
1.1 Introduction 12
1.2 Operation 12
1.3 Structure 12
2 Website Filter module 13
2.1 Introduction 13
2.2 Support Vector Machine Model introduction 13
2.3 LIBLINEAR introduction 14
2.4 Website Filter 15
3 Keyword Suggestion Model 17
3.1 Introduction 17
3.2 Operation 17
3.3 Algorithms 18
4 Search Interface 20
V EXPERIMENTAL RESULT 23
VI CONCLUSION 26
VII REFERENCES 27
Trang 5LIST OF FIGURES
Figure 1: System architecture 10
Figure 2: Meta Search Structure 13
Figure 3: Data classified with a hyperplane by SVM 14
Figure 4: Keyword Suggestion workflow 19
Figure 5: Search Home Page 21
Figure 6: Search Result for the simple keyword “football” 22
Figure 7: Search Results with page change 22
Figure 8: Vertical Search Experiment 23
Figure 9: Keyword Suggestion Experiment 1 25
Figure 10: Keyword Suggestion Experiment 2 18
Trang 6I INTRODUCTION
Internet is becoming more and more popular in every country; its capacity is also gettinglarger every second Internet’s complex structure and its huge amount of data had beenirresolvable obstacles for internet users That was the reason for the introduction of alarge number of search engines, and Google was a big success However, a generalsearch engine like Google, which treats data in all domains equally, would becomeinconvenient when users prefer one specific domain to the others In this situation, avertical search engine, with the contribution of domain-specific expertise would performgreater
A vertical search engine, as distinct from a general web search engine, focuses on aspecific segment of online content The vertical content area may be based on topicality,media type, or genre of content Common verticals include shopping, the automotiveindustry, legal information, medical information, and travel In contrast to general Websearch engines, which attempt to index large portions of the World Wide Web using
a web crawler, vertical search engines typically use a focused crawler that attempts toindex only Web pages that are relevant to a pre-defined topic or set of topics
Some vertical search sites focus on individual verticals, while other sites include multiplevertical searches within one search engine Vertical search offers several potentialbenefits over general search engines:
Greater precision due to limited scope
Leverage domain knowledge including taxonomies and ontology
Support specific unique user tasks
A part of vertical search engine which focus on specific topic is domain-specific search.Domain-specific search solutions focus on one area of knowledge, creating customizedsearch experiences, that because of the domain's limited corpus and clear relationshipsbetween concepts, provide extremely relevant results for searchers.[2]
Normally, the process of building a search engine will consist of the steps below:
Trang 7 Creating a crawler that collects the websites from the internet This step covers theinternet, as well as making a database of websites for the search purpose.
Indexing the websites
Query Processor: process with the query (from the user) with natural languageprocessing and match the query in the websites for the list the appropriate results
Determining the website ranks, and returns the ranked list to the user
However, these steps require massive of storage as well as a remarkable algorithm forranking the websites Instead of making a whole new search engine, metasearch engine isanother method to make a new one
A metasearch engine is a search tool that sends user requests to several other search
engines and/or databases and aggregates the results into a single list or displays themaccording to their source Metasearch engines enable users to enter search criteria onceand access several search engines simultaneously Metasearch engines operate on thepremise that the Web is too large for any one search engine to index it all and that morecomprehensive search results can be obtained by combining the results from severalsearch engines This also may save the user from having to use multiple search enginesseparately [3]
On our research, we combine the technology of domain specific search engine and theidea of metasearch engine, resulting a two levels structure that coordinate each searchengine’s own advantages
II LITERATURE REVIEW
Huge expenses to build the index, find the data, maintain the process
Majority of time spent on building relevancy and less on design and creating a unique experience
Trang 8 Search APIs reduce the complexity of building an index
Vertical search engines still spend significant resources on creating unique data
More resources are spent on designing the best relevancy and a unique experience
Future
New search engines tap into huge amounts of distributed data
More time for developing unique approaches to presenting relevant information and creating a unique experience
Vertical search engines have a distinct advantage over the general search engines Theyalready know what their users are interested in A search for Jaguar in Yahoo! may returnthe automobile, the Mac OS, or the animal However, vertical search engines thatspecialize in sports, autos, or animals would not have that problem This assumption ofuser interest gives vertical search engines more flexibility in creating new models ofrelevancy ranking.[4]
General search engines, like Google, Yahoo or Bing are certainly famous to everyinternet user, they are considered indispensable tools People now even go to searchpages to find websites that they have already known, instead of directly enter websites’name on address bar On the other hand, vertical search engine and metasearch engineobtain very few successes Some vertical search engines were released such as MedNar,PubMed, BizNar, some metasearch websites like iBoogie, InfoGrid were built but none
of them are become famous in the entire world Currently, the vertical search mechanismand metasearch method separately seem not powerful enough to overwhelm classicsearching machine, they should be researched further or be combined together
A good representative for vertical search is Truevert, which is an environmental verticalsearch engine that is going beyond the basic assumption of a niche user's intentions Theybuild a unique natural language dictionary to enhance relevancy A search for "CFL" on a
regular search engine could return "Canadian Football League" but Truevert recognizes
this as the acronym for "Compact Flourescent Lighting", a much more relevant term forenvironmental concerns.[5]
Trang 9Back in history, Yahoo was once the dictator on searching aspect, it then becomes thesecond after the risen of Google The success of vertical search engine may be on thefuture, when convenience is more appreciated
2 In Vietnam
In Vietnam, general search engine are very popular: Almost every website has its ownsearch engine However, the idea of vertical search and metasearch has not beenindustrially explored
3 Our research goal
We want to demonstrate the vertical search engine idea that re-filter the results of thegeneral search engine based on a specific topic such as medicine, health, football,weather, economy, etc A simple vertical search engine should be done with similar userinterface as general search engines, the speed should be acceptable, the results should beprioritized based on a topic and the keywords can be suggested for the users
Other than that, the experiment can also be an approach for providing a mechanism toquickly build one vertical search engine with least effort
Trang 10III VERTICAL SEARCH ENGINE
1 System Architecture
The architecture for the system is layer-based, each layer represent one levels of filter, themore layers we have, the more irrelevant websites are filter out, and thus, the better theresults are The layers are independent so that we can easily add, remove one layer orreplace it by a new one
We have developed 4 modules: Meta Search Engine,Webpage Filter and KeywordSuggestion, Search Interface:
Figure 1: System architecture
Search Interface
Meta-search Engine
Keyword Suggestion
Filter
Knowledge Base
Trang 11Metasearch Engine use metasearch technique to ask other search engine about the findingkeyword It then get all returning results together with their scores, transform them fromhtml form to normal text form and then send the result to the upper module: Textclassification
Webpage Filter takes results from Web Crawler, refine the results based on theknowledge base, with the technique using Support Vector Machine Classification All thepassed results are sent to Interface to display
Keyword Suggestion, independent with WebCrawler, gets suggestion from other searchengine, then use Information Gain (IG) to rearrange results Top high-score suggests aresent to Interface to display
Search Interface allows user to choose number of pages to get and enter the keyword Itthen calls Text Classification and Keyword Suggestion to get returning pages andsuggests respectively Finally, the results are displayed to users
2 System’s Features
By combining the popular and efficient search engine of Bing or Yahoo and thefunctionality to refine the results as categories, the system can offer a vertical searchservice that helps the users to find the efficient information in the topic, without self-filtering the information of the search results The keyword suggestion function helps theuser to get the keywords inside the topics, which is the upgrade on the keywordsuggestion function of the popular search engine
The system also offer a method to setup a personal or topic search engine that shouldserve an organization or company in the limited time where the specific information isneeded to search, which is not reliable when using the popular search engine
Trang 12IV SYSTEM MODULES
a Meta Search Engine
1.1 Introduction
Meta Search Engine is the module stands between Search Engine and the Internet It is anJava program which receives keyword from upper layer, then use the Internet as theresource to find related page It returns a list of unordered pages in all aspects to theupper layer
1.2 Operation
Receive keyword from user, Web Crawler do following tasks:
Ask multiple Search Engines and get multiple lists of pages
Add lists together, remove duplication, and check pages’ availability
Transform received pages into normal text and pass to upper layer
1.3 Structure
This module consists of 2 smaller modules: Web downloader and Html to text converter
Figure 2: The metasearch structure
Web downloader asks other search engine about the keyword by requesting appropriateUrls Then from the returning html pages, it extracts links to searching pages, their briefdescription, their scores and their availability It then enters each page and get page’scontent, create a list of html files and send to next module
Web downloader
Text Converter
Trang 13The text converter after receiving html list from Web downloader, it reads through html’scontents, removes Java Scripts, Css and other frames to obtain a list of nature languagelists That list is later sent to Classification module.
2 Webpage Filter module
2.1 Introduction
This is the main model that decides the quality of search results Using previouslycollected data and learning model, this module will decide whether a webpage belongs tothe topic of the search engine (The topic we chose for demonstration is Football) Theclassification is based from clustering web pages, using Support Vector Machinealgorithms as classifier, with words in the web pages as attributes
2.2 Support Vector Machine Introduction
A support vector machine (SVM) is a concept in statistics and computer science for a set
of related supervised learning methods that analyze data and recognize patterns, usedfor classification and regression analysis An SVM model is a representation of theexamples as points in space, mapped so that the examples of the separate categories aredivided by a clear gap that is as wide as possible New examples are then mapped intothat same space and predicted to belong to a category based on which side of the gap theyfall on
More formally, a support vector machine constructs a hyperplane or set of hyperplanes in
a high- or infinite-dimensional space, which can be used for classification, regression, orother tasks Intuitively, a good separation is achieved by the hyperplane that has thelargest distance to the nearest training data point of any class (so-called functionalmargin), since in general the larger the margin the lower the generalization error of theclassifier.[6]
Trang 14Figure 3: Data classified with a hyperplane by SVM
2.3 The LIBLINEAR Library Introduction
LIBLINEAR is a library for classification, built by the Machine Learning Group at National Taiwan University
Being linear classifier for data with millions of instances and features, LIBLINEAR supports:
L2-regularized classifiers, L2-loss linear SVM, L1 loss linear SVM and logistic regression
L1-regularized classifiers, L2-loss linear SVM and logistic regression
Main features of LIBLINEAR:
Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
Cross validation for model selection
Probability estimates (logistic regression only)
Weights for unbalanced data
MATLAB/Octave, Java, Python, Ruby interfaces[7]
The SVM built in LIBLINEAR works as followed: