HƯớng dẫn cách cài đặt và sử dụng Apache Nutch để crawl dữ liệu từ các website. Và liên kết với apache solr để đánh chỉ mục
Trang 1Apache Nutch
Professor
Kim Kyoung-Yun
Trang 2I What is Apache Nutch?
II How to install Apache Nutch on Window 10?III How to crawl web?
Trang 3I What is Apache nutch?
• Apache Nutch is a highly extensible and scalable open-source web crawler software project
• Runs on top of Hadoop
• Mostly used to feed a search index, also data mining
• Customizable/ extensible plugin architecture
Trang 4II How to install Apache Nutch on
Window 10
1 Requirements:
11)
Trang 5Installing Cygwin
• Download the Cygwin installer and run setup.exe:
https://www.cygwin.com/install.html
Trang 6Installing Java Runtime/Development Environment
• Download Java SE Development Kit 11 for Window and run exe
file:
https://www.oracle.com/kr/java/technologies/javase/jdk11-archive-downloads.html
• Setup environment variables: JAVA_HOME and PATH
• Check installed Java
Trang 7Installing Apache Ant
• Download and install Apache ant (1.10.12)
https://ant.apache.org/
• Set variables
ANT_HOME and PATH
• Check ant –version
Successfully installation
Trang 82 Installing Nutch
• Download a binary package (apache-nutch-1.X-bin.zip) (1.19)
https://archive.apache.org/dist/nutch/
• Unzip Nutch package There should be a folder apache-nutch-1.X
• Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home
• Verify Nutch installation:
+ Open cygwin64 terminal
+ Run bin/nutch: @{nutch_home} $bin/nutch
Trang 9III How to crawl a web
1 Crawler Workflow
• initialize crawldb, inject seed URLs
• generate fetch list: select URLs from crawldb for fetching
• fetch URLs from fetch list
• parse documents: extra content, metadata and links
• update crawldb status, score and signature, add new URLs inlines or at the end of one
crawler run
• invert links: map anchor texts to documents the links point to
• calculate link rank and web graph and update Crawldb
• deduplicate document by signature
• index document content, meta data, and anchor texts
Trang 101 Crawler Workflow
Trang 11III How to crawl a web?
2 Installing Apache Solr (8.11.2)
• Download and unzip Apache Solr
https://archive.apache.org/dist/lucene/solr/8.11.2
Trang 122 Installing Apache Solr
• Check solr installation: {APACHE_SOLR_HOME}
+ Start: bin\solr.cmd start
+ Status: bin\solr.cmd status
+ Stop: bin\solr.cmd stop
+ Go to this: http://localhost:8983/solr/#/
Trang 133 Crawl a site
conf/nutch-site.xml conf/regex-urlfilter.txt
Customize your crawl properties in conf of
{nutch_home} and Configure Regular Expression Filters
Trang 143 Crawl a site
Create a URL seed list
+ Create a urls folder
+ Create a file seed.txt under urls folder and add a site
which will crawl
For example: https://www.youtube.com/
Trang 153 Crawl a site
+ Open Cygwin terminal
Crawl: bin/crawl –i –s urls crawl 2
Seeding the crawldb with a list of URLs
+ bin/nutch inject crawl/crawldb urls
+ bin/nutch updatedb crawl/crawldb $s1
+ bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
bin/crawl -i -s urls crawl 2
Trang 173 Crawl a site
Read db
• 1- bin/nutch readdb crawl/crawldb –stats >stats.txt
• 2- bin/nutch readdb crawl/crawldb/ -dump db
• 3- bin/nutch readlinkdb crawl/linkdb/ -dump link
• 4- bin/nutch readseg -dump crawl/segments/20131216194731 crawl/segments/201312161 9473 _dump-nocontent-nofetch-noparse-
noparsedata-noparsetext
Trang 18+ Download and extract Hadoop package : https://archive.apache.org/dist/hadoop/common/
+ Set the environment variable %HADOOP_HOME% and PATH
+ Download the winutils.exe binary from a Hadoop redistribution and extract to folder
Trang 204 Indexing in Solr
- Integrate Nutch with Solr
+ Go to solr folder (solr-8.11.2)
Trang 21${APACHE_SOLR_HOME}/server/solr/configse ts/nutch/conf/
+ Edit conf/index-writers.xml from
{nutch_home} with the name of new core
${APACHE_SOLR_HOME}/bin/soalr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
Trang 224 Indexing in Solr
+ Indexing into Apache Solr: (change the segments folder)
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/{segment_folder}/ -filter -normalize
Trang 234 Indexing in Solr