Search Engine ArchitectureUser Interface Caching Indexing and Ranking Index Builder Web Page Parser Crawler Web Graph Builder Link Analysis Inverted index Cached Pages Page & Site Statis
Trang 1GATE Annie Lib Lucene
Course Project
Presentator: Bui Dac Thinh
For: IR Students TH2010
1
Trang 33
Trang 4Search Engine Architecture
User Interface
Caching
Indexing and Ranking
Index Builder Web Page Parser
Crawler
Web Graph Builder
Link Analysis
Inverted index
Cached Pages
Page & Site Statistics
Page Ranks
Web Graph
Pages AnchorsLinks
Link Map
Online Part
Offline Part
7/2/14
Trang 6NLP Survey of Tools & Resources
General frameworks
UIMA
GATE
NLP components, pipelines, and tools
Stanford Named Entity Recognizer ( NER )
Stanford CoreNLP ( CoreNLP )
Trang 7Apache OpenNLP
OpenNLP tools
Sentence detector Pos-tagger
Tokenizer Shallow and full syntactic parser
Trang 8a Nearly - New Information Extraction System
8
Trang 10 Open source software
Community of Text engineering
Defined and repeatable process
The Eclipse of NLP
The Lucene of Infromation Extraction
10
Trang 12GATE
LIVE DEMO
Trang 13• Product in Jakartar Apache
• Popular: Xerox, Apple, Wikipedia, IBM, CNN, Nutch…
• Open source in JAVA
• The most efficient framework for IR
• Index
• Search
13
Lucene4c / CLucene Nlucene / Lucene.NET PyLucene
Ferret / RubyLucene ZEND Framework
Trang 14What uses Lucene
Trang 15Lucene Sketch
15
http://www.ibm.com/developerworks /library/os-apache-lucenesearch/
In 5 Mins
Trang 16Analysis with Lucene
Trang 17Analyze r Ope ratio ns do ne o n the te xt data
WhitespaceAnalyzer Splits tokens at whitespaceSimpleAnalyzer Divides text at non-letter characters
Puts text in lowercaseStopAnalyzer Removes stop words
Puts text in lowercase
StandardAnalyzer
Tokenizes text based on a sophisticated grammar that recognizes: e-mail addresses;
acronyms; Chinese, Japanese, and Korean characters; alphanumerics; and more
Removes stop wordsPuts text in lowercase
17
Analysis with Lucene
Trang 18Core indexing classes
Trang 19Search with Lucene
Trang 20Search with Lucene Query
Query = Term(s) + Operator(s)
20
Term Field Term modifier: ?, *, + …
Boolean: AND, +, NOT, Grouping: ( )
-Range: [], {} Boost: ^
…
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/Query.html
Trang 21Search with Lucene
Term
21
Key Value
Word/phrase id
“John likes to watch movies Mary likes movies too”
“John also likes to watch football games”
Sentences
[ 1 , 2 , 1 , 1 , 2 , 0 , 0 , 0 , 1 , 1 ] [ 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 , 0 , 0 ]
Trang 22Search with Lucene Field
Specify a fielded data
title:"The Right Way" AND text:go title:"Do it right" AND right
title:Do it right
Trang 24Search with Lucene IndexSearcher
Primary class
searching indices stored in a given directory
calculates a score for each of the documents
that match a given query
24
Trang 25Search with Lucene Search Result
Trang 26 Index
26
//state the file location of the index
string indexFileLocation = @"C:\Index";
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, true);
//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.Standard.StandardAnalyzer();
//create the index writer with the directory and analyzer defined
Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Index.IndexWriter(dir, analyzer, true);
/*true to create a new index*/
Trang 2727
//create a document, add in a single field
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document(); Lucene.Net.Documents.Field fldContent = new Lucene.Net.Documents.Field( "content" , "The quick brown fox jumps over the lazy dog" ,
Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.TOKENIZED, Lucene.Net.Documents.Field.TermVector.YES); doc.Add(fldContent);
Trang 28//create an index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
Trang 29 Search
29
//build a query object
Lucene.Net.Index.Term searchTerm = new
Lucene.Net.Index.Term("content", "fox");
Lucene.Net.Search.Query query = new
Lucene.Net.Search.TermQuery(searchTerm);
//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(query);
//iterate over the results
for (int i = 0; i < hits.Length(); i++) {
Document doc = hits.Doc(i);
string contentValue = doc.Get("content");
Console.WriteLine(contentValue);
}
Trang 30TopDocs topDocs = indexSearcher.search(query,20);
System.out.println("Total hits "+topDocs.totalHits);
// Get an array of references to matched documents ScoreDoc[] scoreDosArray = topDocs.scoreDocs;
for(ScoreDoc scoredoc: scoreDosArray){
//Retrieve the matched document and show relevant details Document doc = indexSearcher.doc(scoredoc.doc);
Trang 32Course Project
Web search GUI
32