HƯớng dẫn cách cài đặt và sử dụng Apache Nutch để crawl dữ liệu từ các website. Và liên kết với apache solr để đánh chỉ mục

Trang 1

Apache Nutch

Professor

Kim Kyoung-Yun

Trang 2

I What is Apache Nutch?

II How to install Apache Nutch on Window 10?III How to crawl web?

Trang 3

I What is Apache nutch?

• Apache Nutch is a highly extensible and scalable open-source web crawler software project

• Runs on top of Hadoop

• Mostly used to feed a search index, also data mining

• Customizable/ extensible plugin architecture

Trang 4

II How to install Apache Nutch on

Window 10

1 Requirements:

11)

Trang 5

Installing Cygwin

• Download the Cygwin installer and run setup.exe:

https://www.cygwin.com/install.html

Trang 6

Installing Java Runtime/Development Environment

• Download Java SE Development Kit 11 for Window and run exe

file:

https://www.oracle.com/kr/java/technologies/javase/jdk11-archive-downloads.html

• Setup environment variables: JAVA_HOME and PATH

• Check installed Java

Trang 7

Installing Apache Ant

• Download and install Apache ant (1.10.12)

https://ant.apache.org/

• Set variables

ANT_HOME and PATH

• Check ant –version

Successfully installation

Trang 8

2 Installing Nutch

• Download a binary package (apache-nutch-1.X-bin.zip) (1.19)

https://archive.apache.org/dist/nutch/

• Unzip Nutch package There should be a folder apache-nutch-1.X

• Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home

• Verify Nutch installation:

+ Open cygwin64 terminal

+ Run bin/nutch: @{nutch_home} $bin/nutch

Trang 9

III How to crawl a web

1 Crawler Workflow

• initialize crawldb, inject seed URLs

• generate fetch list: select URLs from crawldb for fetching

• fetch URLs from fetch list

• parse documents: extra content, metadata and links

• update crawldb status, score and signature, add new URLs inlines or at the end of one

crawler run

• invert links: map anchor texts to documents the links point to

• calculate link rank and web graph and update Crawldb

• deduplicate document by signature

• index document content, meta data, and anchor texts

Trang 10

1 Crawler Workflow

Trang 11

III How to crawl a web?

2 Installing Apache Solr (8.11.2)

• Download and unzip Apache Solr

https://archive.apache.org/dist/lucene/solr/8.11.2

Trang 12

2 Installing Apache Solr

• Check solr installation: {APACHE_SOLR_HOME}

+ Start: bin\solr.cmd start

+ Status: bin\solr.cmd status

+ Stop: bin\solr.cmd stop

+ Go to this: http://localhost:8983/solr/#/

Trang 13

3 Crawl a site

conf/nutch-site.xml conf/regex-urlfilter.txt

Customize your crawl properties in conf of

{nutch_home} and Configure Regular Expression Filters

Trang 14

3 Crawl a site

Create a URL seed list

+ Create a urls folder

+ Create a file seed.txt under urls folder and add a site

which will crawl

For example: https://www.youtube.com/

Trang 15

3 Crawl a site

+ Open Cygwin terminal

Crawl: bin/crawl –i –s urls crawl 2

Seeding the crawldb with a list of URLs

+ bin/nutch inject crawl/crawldb urls

+ bin/nutch updatedb crawl/crawldb $s1

+ bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1`

echo $s2

bin/nutch fetch $s2

bin/nutch parse $s2

bin/nutch updatedb crawl/crawldb $s2

bin/crawl -i -s urls crawl 2

Trang 17

3 Crawl a site

Read db

• 1- bin/nutch readdb crawl/crawldb –stats >stats.txt

• 2- bin/nutch readdb crawl/crawldb/ -dump db

• 3- bin/nutch readlinkdb crawl/linkdb/ -dump link

• 4- bin/nutch readseg -dump crawl/segments/20131216194731 crawl/segments/201312161 9473 _dump-nocontent-nofetch-noparse-

noparsedata-noparsetext

Trang 18

+ Download and extract Hadoop package : https://archive.apache.org/dist/hadoop/common/

+ Set the environment variable %HADOOP_HOME% and PATH

+ Download the winutils.exe binary from a Hadoop redistribution and extract to folder

Trang 20

4 Indexing in Solr

- Integrate Nutch with Solr

+ Go to solr folder (solr-8.11.2)

Trang 21

${APACHE_SOLR_HOME}/server/solr/configse ts/nutch/conf/

+ Edit conf/index-writers.xml from

{nutch_home} with the name of new core

${APACHE_SOLR_HOME}/bin/soalr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/

Trang 22

4 Indexing in Solr

+ Indexing into Apache Solr: (change the segments folder)

bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/{segment_folder}/ -filter -normalize

Trang 23

4 Indexing in Solr

Tiêu đề	Hướng Dẫn Cách Cài Đặt Và Sử Dụng Apache Nutch Để Crawl Dữ Liệu Từ Các Website Và Liên Kết Với Apache Solr Để Đánh Chỉ Mục
Tác giả	Content I. What is Apache Nutch? II. How to install Apache Nutch on Window 10? III. How to crawl web?
Người hướng dẫn	Professor Kim Kyoung-Yun
Trường học	Unknown
Chuyên ngành	Computer Science
Thể loại	Hướng dẫn
Thành phố	Unknown

Định dạng
Số trang	24
Dung lượng	1,18 MB