An Experiment in Building Vertical Search Engine

Trang 1

Vietnam National University of Hanoi

University of Engineering and Technology

STUDENT RESEARCH SEMINAR, 2012

Project:

An Experiment in Building Vertical Search Engine

Students:

Faculty: Information Technology

Supervisor: Dr Lê Quang Hiếu

Hanoi, 2012

Trang 2

PROJECT SUMMARY

Project: An Experiment in Building Vertical Search Engine

Project members: Phạm Ngọc Quân, Phạm Lê Lợi, Bùi Hữu Điệp, Lê Đăng Đạt

Project supervisor: Dr Lê Quang Hiếu, Department of Information Technology, UET Management: University of Engineering and Technology, VNU Hanoi

Research time: 9/2011 – 3/2012

1 Motivation

When using a general search engine, user will get result involving many aspects, due tounclassified websites However, for people who are searching for information about aspecific topic, the websites in this category should be prioritized In this case, with usingthe popular search engine, users will have to read through the search results and choosesuitable ones, causing inconvenience Our research’s purpose is a experiment to build asearch engine that allows users to choose the domain they like, the returning resultswould closely relate to the chosen domain

2 Main content

In this project, we focus on building a search engine that works as an upgrade of thepopular search engine The vertical search engine collects the search results from one orseveral popular search engine (if there are decent differences in the searching methods ofthese search engines) After that, a classification module will decide which sites aredomain-related Finally, the result will be filtered with removing the out-of-topicwebpages and then returned to users We also add some suggestions for keyword such askeyword correction, keyword expansion so that user can get better result

3 Research result

In our experiment, we demonstrate the idea as an experiment for building a verticalsearch engine with the topic chosen as Football The search engine contains the collectingdata module that gets search results from Yahoo and Bing search engine, and the

Trang 3

classification module uses Support Vector Machine classifier The whole project wasdone in Java and installed in a website using JSP.

Another separately experiment is the keyword suggestion module that was written inPython Since the running time of this module when integrated with JSP is inconvenient,this module is removed from the whole search engine, but can be tested severally

Trang 4

TABLE OF CONTENTS

I INTRODUCTION 6

II LITERATURE REVIEW 7

1 In the world 7

2 In Vietnam 9

3 Our research goal 9

III VERTICAL SEARCH ENGINE 10

1 System Architecture 10

2 System’s Features 12

IV SYSTEM MODULES 12

1 Meta Search Engine 12

1.1 Introduction 12

1.2 Operation 12

1.3 Structure 12

2 Website Filter module 13

2.1 Introduction 13

2.2 Support Vector Machine Model introduction 13

2.3 LIBLINEAR introduction 14

2.4 Website Filter 15

3 Keyword Suggestion Model 17

3.1 Introduction 17

3.2 Operation 17

3.3 Algorithms 18

4 Search Interface 20

V EXPERIMENTAL RESULT 23

VI CONCLUSION 26

VII REFERENCES 27

Trang 5

LIST OF FIGURES

Figure 1: System architecture 10

Figure 2: Meta Search Structure 13

Figure 3: Data classified with a hyperplane by SVM 14

Figure 4: Keyword Suggestion workflow 19

Figure 5: Search Home Page 21

Figure 6: Search Result for the simple keyword “football” 22

Figure 7: Search Results with page change 22

Figure 8: Vertical Search Experiment 23

Figure 9: Keyword Suggestion Experiment 1 25

Figure 10: Keyword Suggestion Experiment 2 18

Trang 6

I INTRODUCTION

Internet is becoming more and more popular in every country; its capacity is also gettinglarger every second Internet’s complex structure and its huge amount of data had beenirresolvable obstacles for internet users That was the reason for the introduction of alarge number of search engines, and Google was a big success However, a generalsearch engine like Google, which treats data in all domains equally, would becomeinconvenient when users prefer one specific domain to the others In this situation, avertical search engine, with the contribution of domain-specific expertise would performgreater

A vertical search engine, as distinct from a general web search engine, focuses on aspecific segment of online content The vertical content area may be based on topicality,media type, or genre of content Common verticals include shopping, the automotiveindustry, legal information, medical information, and travel In contrast to general Websearch engines, which attempt to index large portions of the World Wide Web using

a web crawler, vertical search engines typically use a focused crawler that attempts toindex only Web pages that are relevant to a pre-defined topic or set of topics

Some vertical search sites focus on individual verticals, while other sites include multiplevertical searches within one search engine Vertical search offers several potentialbenefits over general search engines:

 Greater precision due to limited scope

 Leverage domain knowledge including taxonomies and ontology

 Support specific unique user tasks

A part of vertical search engine which focus on specific topic is domain-specific search.Domain-specific search solutions focus on one area of knowledge, creating customizedsearch experiences, that because of the domain's limited corpus and clear relationshipsbetween concepts, provide extremely relevant results for searchers.[2]

Normally, the process of building a search engine will consist of the steps below:

Trang 7

 Creating a crawler that collects the websites from the internet This step covers theinternet, as well as making a database of websites for the search purpose.

 Indexing the websites

 Query Processor: process with the query (from the user) with natural languageprocessing and match the query in the websites for the list the appropriate results

 Determining the website ranks, and returns the ranked list to the user

However, these steps require massive of storage as well as a remarkable algorithm forranking the websites Instead of making a whole new search engine, metasearch engine isanother method to make a new one

A metasearch engine is a search tool that sends user requests to several other search

engines and/or databases and aggregates the results into a single list or displays themaccording to their source Metasearch engines enable users to enter search criteria onceand access several search engines simultaneously Metasearch engines operate on thepremise that the Web is too large for any one search engine to index it all and that morecomprehensive search results can be obtained by combining the results from severalsearch engines This also may save the user from having to use multiple search enginesseparately [3]

On our research, we combine the technology of domain specific search engine and theidea of metasearch engine, resulting a two levels structure that coordinate each searchengine’s own advantages

II LITERATURE REVIEW

 Huge expenses to build the index, find the data, maintain the process

 Majority of time spent on building relevancy and less on design and creating a unique experience

Trang 8

 Search APIs reduce the complexity of building an index

 Vertical search engines still spend significant resources on creating unique data

 More resources are spent on designing the best relevancy and a unique experience

Future

 New search engines tap into huge amounts of distributed data

 More time for developing unique approaches to presenting relevant information and creating a unique experience

Vertical search engines have a distinct advantage over the general search engines Theyalready know what their users are interested in A search for Jaguar in Yahoo! may returnthe automobile, the Mac OS, or the animal However, vertical search engines thatspecialize in sports, autos, or animals would not have that problem This assumption ofuser interest gives vertical search engines more flexibility in creating new models ofrelevancy ranking.[4]

General search engines, like Google, Yahoo or Bing are certainly famous to everyinternet user, they are considered indispensable tools People now even go to searchpages to find websites that they have already known, instead of directly enter websites’name on address bar On the other hand, vertical search engine and metasearch engineobtain very few successes Some vertical search engines were released such as MedNar,PubMed, BizNar, some metasearch websites like iBoogie, InfoGrid were built but none

of them are become famous in the entire world Currently, the vertical search mechanismand metasearch method separately seem not powerful enough to overwhelm classicsearching machine, they should be researched further or be combined together

A good representative for vertical search is Truevert, which is an environmental verticalsearch engine that is going beyond the basic assumption of a niche user's intentions Theybuild a unique natural language dictionary to enhance relevancy A search for "CFL" on a

regular search engine could return "Canadian Football League" but Truevert recognizes

this as the acronym for "Compact Flourescent Lighting", a much more relevant term forenvironmental concerns.[5]

Trang 9

Back in history, Yahoo was once the dictator on searching aspect, it then becomes thesecond after the risen of Google The success of vertical search engine may be on thefuture, when convenience is more appreciated

2 In Vietnam

In Vietnam, general search engine are very popular: Almost every website has its ownsearch engine However, the idea of vertical search and metasearch has not beenindustrially explored

3 Our research goal

We want to demonstrate the vertical search engine idea that re-filter the results of thegeneral search engine based on a specific topic such as medicine, health, football,weather, economy, etc A simple vertical search engine should be done with similar userinterface as general search engines, the speed should be acceptable, the results should beprioritized based on a topic and the keywords can be suggested for the users

Other than that, the experiment can also be an approach for providing a mechanism toquickly build one vertical search engine with least effort

Trang 10

III VERTICAL SEARCH ENGINE

1 System Architecture

The architecture for the system is layer-based, each layer represent one levels of filter, themore layers we have, the more irrelevant websites are filter out, and thus, the better theresults are The layers are independent so that we can easily add, remove one layer orreplace it by a new one

We have developed 4 modules: Meta Search Engine,Webpage Filter and KeywordSuggestion, Search Interface:

Figure 1: System architecture

Search Interface

Meta-search Engine

Keyword Suggestion

Filter

Knowledge Base

Trang 11

Metasearch Engine use metasearch technique to ask other search engine about the findingkeyword It then get all returning results together with their scores, transform them fromhtml form to normal text form and then send the result to the upper module: Textclassification

Webpage Filter takes results from Web Crawler, refine the results based on theknowledge base, with the technique using Support Vector Machine Classification All thepassed results are sent to Interface to display

Keyword Suggestion, independent with WebCrawler, gets suggestion from other searchengine, then use Information Gain (IG) to rearrange results Top high-score suggests aresent to Interface to display

Search Interface allows user to choose number of pages to get and enter the keyword Itthen calls Text Classification and Keyword Suggestion to get returning pages andsuggests respectively Finally, the results are displayed to users

2 System’s Features

By combining the popular and efficient search engine of Bing or Yahoo and thefunctionality to refine the results as categories, the system can offer a vertical searchservice that helps the users to find the efficient information in the topic, without self-filtering the information of the search results The keyword suggestion function helps theuser to get the keywords inside the topics, which is the upgrade on the keywordsuggestion function of the popular search engine

The system also offer a method to setup a personal or topic search engine that shouldserve an organization or company in the limited time where the specific information isneeded to search, which is not reliable when using the popular search engine

Trang 12

IV SYSTEM MODULES

a Meta Search Engine

1.1 Introduction

Meta Search Engine is the module stands between Search Engine and the Internet It is anJava program which receives keyword from upper layer, then use the Internet as theresource to find related page It returns a list of unordered pages in all aspects to theupper layer

1.2 Operation

Receive keyword from user, Web Crawler do following tasks:

 Ask multiple Search Engines and get multiple lists of pages

 Add lists together, remove duplication, and check pages’ availability

 Transform received pages into normal text and pass to upper layer

1.3 Structure

This module consists of 2 smaller modules: Web downloader and Html to text converter

Figure 2: The metasearch structure

Web downloader asks other search engine about the keyword by requesting appropriateUrls Then from the returning html pages, it extracts links to searching pages, their briefdescription, their scores and their availability It then enters each page and get page’scontent, create a list of html files and send to next module

Web downloader

Text Converter

Trang 13

The text converter after receiving html list from Web downloader, it reads through html’scontents, removes Java Scripts, Css and other frames to obtain a list of nature languagelists That list is later sent to Classification module.

2 Webpage Filter module

2.1 Introduction

This is the main model that decides the quality of search results Using previouslycollected data and learning model, this module will decide whether a webpage belongs tothe topic of the search engine (The topic we chose for demonstration is Football) Theclassification is based from clustering web pages, using Support Vector Machinealgorithms as classifier, with words in the web pages as attributes

2.2 Support Vector Machine Introduction

A support vector machine (SVM) is a concept in statistics and computer science for a set

of related supervised learning methods that analyze data and recognize patterns, usedfor classification and regression analysis An SVM model is a representation of theexamples as points in space, mapped so that the examples of the separate categories aredivided by a clear gap that is as wide as possible New examples are then mapped intothat same space and predicted to belong to a category based on which side of the gap theyfall on

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in

a high- or infinite-dimensional space, which can be used for classification, regression, orother tasks Intuitively, a good separation is achieved by the hyperplane that has thelargest distance to the nearest training data point of any class (so-called functionalmargin), since in general the larger the margin the lower the generalization error of theclassifier.[6]

Trang 14

Figure 3: Data classified with a hyperplane by SVM

2.3 The LIBLINEAR Library Introduction

LIBLINEAR is a library for classification, built by the Machine Learning Group at National Taiwan University

Being linear classifier for data with millions of instances and features, LIBLINEAR supports:

 L2-regularized classifiers, L2-loss linear SVM, L1 loss linear SVM and logistic regression

 L1-regularized classifiers, L2-loss linear SVM and logistic regression

Main features of LIBLINEAR:

 Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer

 Cross validation for model selection

 Probability estimates (logistic regression only)

 Weights for unbalanced data

 MATLAB/Octave, Java, Python, Ruby interfaces[7]

The SVM built in LIBLINEAR works as followed:

Định dạng
Số trang	28
Dung lượng	468,5 KB