Information Retrieval IR

Search Engine ArchitectureUser Interface Caching Indexing and Ranking Index Builder Web Page Parser Crawler Web Graph Builder Link Analysis Inverted index Cached Pages Page & Site Statis

Trang 1

GATE Annie Lib Lucene

Course Project

Presentator: Bui Dac Thinh

For: IR Students TH2010

1

Trang 3

3

Trang 4

Search Engine Architecture

User Interface

Caching

Indexing and Ranking

Index Builder Web Page Parser

Crawler

Web Graph Builder

Link Analysis

Inverted index

Cached Pages

Page & Site Statistics

Page Ranks

Web Graph

Pages AnchorsLinks

Link Map

Online Part

Offline Part

7/2/14

Trang 6

NLP Survey of Tools & Resources

 General frameworks

 UIMA

 GATE

 NLP components, pipelines, and tools

Stanford Named Entity Recognizer ( NER )

Stanford CoreNLP ( CoreNLP )

Trang 7

Apache OpenNLP

 OpenNLP tools

 Sentence detector Pos-tagger

 Tokenizer Shallow and full syntactic parser

Trang 8

a Nearly - New Information Extraction System

8

Trang 10

 Open source software

 Community of Text engineering

 Defined and repeatable process

 The Eclipse of NLP

 The Lucene of Infromation Extraction

10

Trang 12

GATE

LIVE DEMO

Trang 13

• Product in Jakartar Apache

• Popular: Xerox, Apple, Wikipedia, IBM, CNN, Nutch…

• Open source in JAVA

• The most efficient framework for IR

• Index

• Search

13

Lucene4c / CLucene Nlucene / Lucene.NET PyLucene

Ferret / RubyLucene ZEND Framework

Trang 14

What uses Lucene

Trang 15

Lucene Sketch

15

http://www.ibm.com/developerworks /library/os-apache-lucenesearch/

In 5 Mins

Trang 16

Analysis with Lucene

Trang 17

Analyze r Ope ratio ns do ne o n the te xt data

WhitespaceAnalyzer Splits tokens at whitespaceSimpleAnalyzer Divides text at non-letter characters

Puts text in lowercaseStopAnalyzer Removes stop words

Puts text in lowercase

StandardAnalyzer

Tokenizes text based on a sophisticated grammar that recognizes: e-mail addresses;

acronyms; Chinese, Japanese, and Korean characters; alphanumerics; and more

Removes stop wordsPuts text in lowercase

17

Analysis with Lucene

Trang 18

Core indexing classes

Trang 19

Search with Lucene

Trang 20

Search with Lucene Query

Query = Term(s) + Operator(s)

20

Term Field Term modifier: ?, *, + …

Boolean: AND, +, NOT, Grouping: ( )

-Range: [], {} Boost: ^

…

http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/Query.html

Trang 21

Search with Lucene

Term

21

Key Value

Word/phrase id

“John likes to watch movies Mary likes movies too”

“John also likes to watch football games”

Sentences

[ 1 , 2 , 1 , 1 , 2 , 0 , 0 , 0 , 1 , 1 ] [ 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 , 0 , 0 ]

Trang 22

Search with Lucene Field

 Specify a fielded data

title:"The Right Way" AND text:go title:"Do it right" AND right

title:Do it right

Trang 24

Search with Lucene IndexSearcher

 Primary class

 searching indices stored in a given directory

 calculates a score for each of the documents

that match a given query

24

Trang 25

Search with Lucene Search Result

Trang 26

 Index

26

//state the file location of the index

string indexFileLocation = @"C:\Index";

Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, true);

//create an analyzer to process the text

Lucene.Net.Analysis.Analyzer analyzer = new

Lucene.Net.Analysis.Standard.StandardAnalyzer();

//create the index writer with the directory and analyzer defined

Lucene.Net.Index.IndexWriter indexWriter = new

Lucene.Net.Index.IndexWriter(dir, analyzer, true);

/*true to create a new index*/

Trang 27

27

//create a document, add in a single field

Lucene.Net.Documents.Document doc = new

Lucene.Net.Documents.Document(); Lucene.Net.Documents.Field fldContent = new Lucene.Net.Documents.Field( "content" , "The quick brown fox jumps over the lazy dog" ,

Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.TOKENIZED, Lucene.Net.Documents.Field.TermVector.YES); doc.Add(fldContent);

Trang 28

//create an index searcher that will perform the search

Lucene.Net.Search.IndexSearcher searcher = new

Lucene.Net.Search.IndexSearcher(dir);

Trang 29

 Search

29

//build a query object

Lucene.Net.Index.Term searchTerm = new

Lucene.Net.Index.Term("content", "fox");

Lucene.Net.Search.Query query = new

Lucene.Net.Search.TermQuery(searchTerm);

//execute the query

Lucene.Net.Search.Hits hits = searcher.Search(query);

//iterate over the results

for (int i = 0; i < hits.Length(); i++) {

Document doc = hits.Doc(i);

string contentValue = doc.Get("content");

Console.WriteLine(contentValue);

}

Trang 30

TopDocs topDocs = indexSearcher.search(query,20);

System.out.println("Total hits "+topDocs.totalHits);

// Get an array of references to matched documents ScoreDoc[] scoreDosArray = topDocs.scoreDocs;

for(ScoreDoc scoredoc: scoreDosArray){

//Retrieve the matched document and show relevant details Document doc = indexSearcher.doc(scoredoc.doc);

Trang 32

Course Project

 Web search GUI

32

Tiêu đề	Information Retrieval: Search Engine on Crawler Text Datasets and Open-Source System
Tác giả	Bui Dac Thinh
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Information Retrieval
Thể loại	Course Project
Năm xuất bản	2010
Thành phố	Hanoi

Định dạng
Số trang	33
Dung lượng	652,25 KB