Cách cài đặt Apache nutch trên Window 10Slide bằng Tiếng AnhPowerPoint Presentation Apache Nutch Team 4 Nguyen Thi Phuong Hang Park Minsoo Ali Usman Professor Kim Kyoung Yun Content I What is Apache Nutch? II How to install Apache Nutch on Window 10? III How t.
Trang 1Apache Nutch
Team 4
Nguyen Thi Phuong Hang
Park Minsoo Ali Usman
Professor
Kim Kyoung-Yun
Trang 2I What is Apache Nutch?
II How to install Apache Nutch on Window 10? III How to crawl web?
Trang 3I What is Apache nutch?
• Apache Nutch is a highly extensible and scalable open-source web crawler software project
• Runs on top of Hadoop
• Mostly used to feed a search index, also data mining
• Customizable/ extensible plugin architecture
Trang 4II How to install Apache Nutch on
Window 10
1 Requirements:
+ Windows-Cygwin environment
+ Java Runtime/Development Environment (JDK 1.11 / Java 11)
+ (Source build only) Apache Ant: https://ant.apache.org/
Trang 5Installing Cygwin
• Download the Cygwin installer and run setup.exe:
https://www.cygwin.com/install.html
Trang 6Installing Java Runtime/Development Environment
• Download Java SE Development Kit 11 for Window and run exe
file:
https://www.oracle.com/kr/java/technologies/javase/jdk11-archive-downloads.html
• Setup environment variables: JAVA_HOME and PATH
• Check installed Java
Trang 7Installing Apache Ant
• Download and install Apache ant (1.10.12)
https://ant.apache.org/
• Set variables
ANT_HOME and PATH
• Check ant –version
Successfully installation
Trang 82 Installing Nutch
• Download a binary package (apache-nutch-1.X-bin.zip) (1.19)
https://archive.apache.org/dist/nutch/
• Unzip Nutch package There should be a folder apache-nutch-1.X
• Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home
• Verify Nutch installation:
+ Open cygwin64 terminal
+ Run bin/nutch: @{nutch_home} $bin/nutch
Trang 9III How to crawl a web
1 Crawler Workflow
• initialize crawldb, inject seed URLs
• generate fetch list: select URLs from crawldb for fetching
• fetch URLs from fetch list
• parse documents: extra content, metadata and links
• update crawldb status, score and signature, add new URLs inlines or at the end of one
crawler run
• invert links: map anchor texts to documents the links point to
• calculate link rank and web graph and update Crawldb
• deduplicate document by signature
• index document content, meta data, and anchor texts
Trang 101 Crawler Workflow
Trang 11III How to crawl a web?
2 Installing Apache Solr (8.11.2)
• Download and unzip Apache Solr
https://archive.apache.org/dist/lucene/solr/8.11.2
Trang 122 Installing Apache Solr
• Check solr installation
+ Start: bin\solr.cmd start
+ Status: bin\solr.cmd status
+ Stop: bin\solr.cmd stop
+ Go to this: http://localhost:8983/solr/#/
Trang 133 Crawl a site
Customize your crawl properties in conf of
{nutch_home} and Configure Regular Expression Filters
Trang 143 Crawl a site
Create a URL seed list
+ Create a urls folder
+ Create a file seed.txt under urls folder and add a site
which will crawl
For example: https://www.youtube.com/
Trang 153 Crawl a site
+ Open Cygwin terminal
Seeding the crawldb with a list of URLs
+ bin/nutch inject crawl/crawldb urls
Fetching
+ bin/nutch generate craw l/craw ldb craw l/segm ents
+ s1= `ls -d craw l/segm ents/2* | tail -1`
echo $s1
+ bin/nutch fetch $s1
+ bin/nutch parse $s1
+ bin/nutch updatedb craw l/craw ldb $s1
+ bin/nutch generate craw l/craw ldb craw l/segm ents -topN 1000 s2= `ls -d craw l/segm ents/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb craw l/craw ldb $s2
Trang 163 Crawl a site
Invertlinks
bin/nutch invertlinks craw l/linkdb -dir craw l/segm ents
D um p to fi le and take a look:
bin/nutch readlinkdb
craw l/linkdb -dum p out2
Trang 17+ Set the environment variable %HADOOP_HOME% and PATH
+ Download the winutils.exe binary from a Hadoop redistribution and extract to folder.
https://github.com/steveloughran/winutils
+ Replace the bin folder from Hadoop folder by bin in Hadoop redistribution which has winutils.exe + Copy hadoop.dll from bin of Hadoop redistribution into C:/Window/System32
3 Crawl a site – Some errors and solution
Trang 183 Crawl a site
Use apache-nutch-1.18 will have the above error Solution:
+ Upgrade the version of Apache Nutch to 1.19