Cách cài đặt Apache nutch trên Window 10

Cách cài đặt Apache nutch trên Window 10Slide bằng Tiếng AnhPowerPoint Presentation Apache Nutch Team 4 Nguyen Thi Phuong Hang Park Minsoo Ali Usman Professor Kim Kyoung Yun Content I What is Apache Nutch? II How to install Apache Nutch on Window 10? III How t.

Trang 1

Apache Nutch

Team 4

Nguyen Thi Phuong Hang

Park Minsoo Ali Usman

Professor

Kim Kyoung-Yun

Trang 2

I What is Apache Nutch?

II How to install Apache Nutch on Window 10? III How to crawl web?

Trang 3

I What is Apache nutch?

• Apache Nutch is a highly extensible and scalable open-source web crawler software project

• Runs on top of Hadoop

• Mostly used to feed a search index, also data mining

• Customizable/ extensible plugin architecture

Trang 4

II How to install Apache Nutch on

Window 10

1 Requirements:

+ Windows-Cygwin environment

+ Java Runtime/Development Environment (JDK 1.11 / Java 11)

+ (Source build only) Apache Ant: https://ant.apache.org/

Trang 5

Installing Cygwin

• Download the Cygwin installer and run setup.exe:

https://www.cygwin.com/install.html

Trang 6

Installing Java Runtime/Development Environment

• Download Java SE Development Kit 11 for Window and run exe

file:

https://www.oracle.com/kr/java/technologies/javase/jdk11-archive-downloads.html

• Setup environment variables: JAVA_HOME and PATH

• Check installed Java

Trang 7

Installing Apache Ant

• Download and install Apache ant (1.10.12)

https://ant.apache.org/

• Set variables

ANT_HOME and PATH

• Check ant –version

Successfully installation

Trang 8

2 Installing Nutch

• Download a binary package (apache-nutch-1.X-bin.zip) (1.19)

https://archive.apache.org/dist/nutch/

• Unzip Nutch package There should be a folder apache-nutch-1.X

• Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home

• Verify Nutch installation:

+ Open cygwin64 terminal

+ Run bin/nutch: @{nutch_home} $bin/nutch

Trang 9

III How to crawl a web

1 Crawler Workflow

• initialize crawldb, inject seed URLs

• generate fetch list: select URLs from crawldb for fetching

• fetch URLs from fetch list

• parse documents: extra content, metadata and links

• update crawldb status, score and signature, add new URLs inlines or at the end of one

crawler run

• invert links: map anchor texts to documents the links point to

• calculate link rank and web graph and update Crawldb

• deduplicate document by signature

• index document content, meta data, and anchor texts

Trang 10

1 Crawler Workflow

Trang 11

III How to crawl a web?

2 Installing Apache Solr (8.11.2)

• Download and unzip Apache Solr

https://archive.apache.org/dist/lucene/solr/8.11.2

Trang 12

2 Installing Apache Solr

• Check solr installation

+ Start: bin\solr.cmd start

+ Status: bin\solr.cmd status

+ Stop: bin\solr.cmd stop

+ Go to this: http://localhost:8983/solr/#/

Trang 13

3 Crawl a site

Customize your crawl properties in conf of

{nutch_home} and Configure Regular Expression Filters

Trang 14

3 Crawl a site

Create a URL seed list

+ Create a urls folder

+ Create a file seed.txt under urls folder and add a site

which will crawl

For example: https://www.youtube.com/

Trang 15

3 Crawl a site

+ Open Cygwin terminal

Seeding the crawldb with a list of URLs

+ bin/nutch inject crawl/crawldb urls

Fetching

+ bin/nutch generate craw l/craw ldb craw l/segm ents

+ s1= `ls -d craw l/segm ents/2* | tail -1`

echo $s1

+ bin/nutch fetch $s1

+ bin/nutch parse $s1

+ bin/nutch updatedb craw l/craw ldb $s1

+ bin/nutch generate craw l/craw ldb craw l/segm ents -topN 1000 s2= `ls -d craw l/segm ents/2* | tail -1`

echo $s2

bin/nutch fetch $s2

bin/nutch parse $s2

bin/nutch updatedb craw l/craw ldb $s2

Trang 16

3 Crawl a site

Invertlinks

bin/nutch invertlinks craw l/linkdb -dir craw l/segm ents

D um p to fi le and take a look:

bin/nutch readlinkdb

craw l/linkdb -dum p out2

Trang 17

+ Set the environment variable %HADOOP_HOME% and PATH

+ Download the winutils.exe binary from a Hadoop redistribution and extract to folder.

https://github.com/steveloughran/winutils

+ Replace the bin folder from Hadoop folder by bin in Hadoop redistribution which has winutils.exe + Copy hadoop.dll from bin of Hadoop redistribution into C:/Window/System32

3 Crawl a site – Some errors and solution

Trang 18

3 Crawl a site

Use apache-nutch-1.18 will have the above error Solution:

+ Upgrade the version of Apache Nutch to 1.19

Tiêu đề	Cách cài đặt Apache Nutch trên Windows 10
Tác giả	Nguyen Thi Phuong Hang, Park Minsoo, Ali Usman
Người hướng dẫn	Professor Kim Kyoung-Yun
Trường học	University of the People
Chuyên ngành	Computer Science
Thể loại	guide
Năm xuất bản	2023
Thành phố	Hà Nội

Định dạng
Số trang	18
Dung lượng	774,96 KB