Tin học ứng dụng trong công nghệ hóa học Lab 11 apache hadoop mapreduce

Lab 11 HADOOP MAPREDUCE Biên so n ThS Nguy n Quang Hùngạ ễ E mail hungnq2@cse hcmut edu vn 1 Giới thiệu • Hadoop Map/Reduce là m t khung n n (software framework) mã ngu n m , h trộ ề ồ ở ỗ ợ ng i l p[.]

Trang 1

Lab 11: HADOOP MAPREDUCE

Biên so n: ThS Nguy n Quang Hùngạ ễ E-mail: hungnq2@cse.hcmut.edu.vn

1 Giới thiệu:

• Hadoop Map/Reduce là m t khung n n (software framework) mã ngu n m , h trộ ề ồ ở ỗ ợ

ngườ ậi l p trình vi t các ng d ng theo mô hình Map/Reduce Đ hi n th c m t ngế ứ ụ ể ệ ự ộ ứ

d ng theo mô hình Map/Reduce, sinh viên c n s d ng các interface l p trình doụ ầ ử ụ ậ Hadoop cung c p nh : Mapper, Reducer, JobConf, JobClient, Partitioner,ấ ư OutputCollector, Reporter, InputFormat, OutputFormat, v.v

• Yêu c u sinh viên th c thi ng d ng WordCount trên hai mô hình cluster đ hi u rõầ ự ứ ụ ể ể

ho t đ ng c a mô hình Map/Reduce và ki n trúc HDFS (Hadoop Distributedạ ộ ủ ế FileSystem)

2 Tài liệu hướng dẫn cài đặt Apache Hadoop và MapReduce tutorial:

- Hadoop: http://hadoop.apache.org/docs/r1.1.2/#Getting+Started

- Tài li u hệ ướng d n cài đ t Apache Hadoop trên cluster (Cluster node setup):ẫ ặ

https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html

- MapReduce Tutorial: http://hadoop.apache.org/docs/r1.1.2/mapred_tutorial.html

3 Chương trình ví dụ:

3.1 Cài đặt và sử dụng MapReduce

SV có th cài đ t mô hình Single Node Mode hay Pseudo-Distributed Operation trên ể ặ

m t máy đ n Các bộ ơ ước th c hi n nh sau:ự ệ ư

 Download hadoop distribution t m t trong các liên k t sau: http:// ừ ộ ế

hadoop.apache.org

 Kh i đ ng môi trở ộ ường hadoop mapreduce b ng các l nh sau: ằ ệ

$ cd $HADOOP_HOME

$ bin/hadoop namenode –format

$ bin/start-all.sh

 Th c hi n duy t các trang web sau đ ki m tra xem Hadoop MapReduce đã ho tự ệ ệ ể ể ạ

đ ng hay ch a: ộ ư

Trang 2

• Namenode: http://localhost:50070

• JobTracker: http://localhost:50030

- Th c thi ng d ng m u đự ứ ụ ẫ ược cung c p b i hadoop: ấ ở

$ bin/hadoop fs -put conf input

$ bin/hadoop jar hadoop-example-*.jar grep input output ‘dfs[a-z.]+’

$ bin/hadoop fs –get output output

$ cat output/*

- K t thúc môi trế ường hadoop mapreduce

$ bin/stop-all.sh

M t s file c u hình đ thi t l p môi trộ ố ấ ể ế ậ ường Peuso-DistribuCluster mode cho Hadoop g m:ồ

- Ba (3) t p tin chính trong th m c hadoop-version/conf: ậ ư ụ

conf/core-site.xml:

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

conf/hdfs-site.xml:

<name>dfs.replication</name>

</property>

</configuration>

conf/mapred-site.xml:

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

3.2 Chương trình ví dụ: WordCount.java

/* Filename: WordCount.java */

Trang 3

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());

output.collect(word, one);

} }

}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) { sum += values.next().get();

} output.collect(key, new IntWritable(sum));

}

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}

Trang 4

Biên d ch và th c thi ị ự

$ export HADOOP_HOME=<th m c cài hadoop> ư ụ

$ javac -classpath $HADOOP_HOME/hadoop-core-*.jar -d /wordcount_classes/ WordCount.java

$ jar -cvf wordcount.jar -C /wordcount_classes/

$ mkdir wordcount

$ cd wordcount

$ mkdir input

$ cd input/

$ vi file01

Hello Hadoop Goodbye Hadoop

$ vi file02

Hello World Bye World

$ bin/hadoop -fs mkdir wordcount

$ bin/hadoop dfs -put $HOME/input/file0* wordcount/input/

$ bin/hadoop dfs -ls wordcount/input

Found 2 items

-rw-r r 1 hung supergroup 28 2012-05-08 08:15 /user/hung/wordcount/input/file01 -rw-r r 1 hung supergroup 22 2012-05-08 08:15 /user/hung/wordcount/input/file02

$ bin/hadoop dfs -cat wordcount/input/file*

Hello Hadoop Goodbye Hadoop

Hello World Bye World

$ bin/hadoop dfs -ls wordcount/input

Found 2 items

-rw-r r 1 hung supergroup 28 2012-05-08 08:15 /user/hung/wordcount/input/file01 -rw-r r 1 hung supergroup 22 2012-05-08 08:15 /user/hung/wordcount/input/file02

$ bin/hadoop jar ~/wordcount.jar org.myorg.WordCount wordcount/input

wordcount/output

12/05/08 08:17:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments Applications should implement Tool for the same

12/05/08 08:17:38 INFO mapred.FileInputFormat: Total input paths to process : 2

12/05/08 08:17:38 INFO mapred.JobClient: Running job: job_201205080748_0004

12/05/08 08:17:39 INFO mapred.JobClient: map 0% reduce 0%

12/05/08 08:18:22 INFO mapred.JobClient: Job complete: job_201205080748_0004

12/05/08 08:18:22 INFO mapred.JobClient: Counters: 30

…

12/05/08 08:18:22 INFO mapred.JobClient: Combine output records=6

12/05/08 08:18:22 INFO mapred.JobClient: Physical memory (bytes) snapshot=471007232 12/05/08 08:18:22 INFO mapred.JobClient: Reduce output records=5

Trang 5

12/05/08 08:18:22 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1495506944 12/05/08 08:18:22 INFO mapred.JobClient: Map output records=8

$ bin/hadoop dfs -ls wordcount/output

Found 3 items

-rw-r r 1 hung supergroup 0 2012-05-08 08:18

/user/hung/wordcount/output/_SUCCESS

drwxr-xr-x - hung supergroup 0 2012-05-08 08:17 /user/hung/wordcount/output/_logs -rw-r r 1 hung supergroup 41 2012-05-08 08:18 /user/hung/wordcount/output/part-00000

[hung@ppdslab01 hadoop-0.20.205.0]$ bin/hadoop dfs -cat wordcount/output/part-00000

WordCount v2.0

Here is a more complete WordCount which uses many of the features provided by the MapReduce framework we discussed so far.

This needs the HDFS to be up and running, especially for the DistributedCache-related features Hence it only works with a pseudo-distributed or fully-distributed Hadoop installation

Source Code

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example

%3A+WordCount+v2.0

• Chú ý : M t s l nh thao tác trên HDFS ộ ố ệ

$ bin/hadoop dfs –put <source> <dest> : cung c p input cho chấ ương trình

$ bin/hadoop dfs –get <dest> <source> : l y v output c a chấ ề ủ ương trình

$ bin/hadoop dfs –rmr <dir> : xóa th m c ư ụ

$ bin/hadoop dfs –rm <file> : xóa t p tinậ

3 Bài tập

Bài 1: SV th c thi chự ương trình WordCount có đ m t n su t xu t hi n các t trong cácế ầ ấ ấ ệ ừ văn b n ả

Bài 2: SV vi t chế ương trình tính PI theo mô hình Map/Reduce

Bài 3: Cho trướ ậc t p các đ nh (t a đ trong không gian hai chi u (x, y)) Tìm đỉ ọ ộ ề ường đi

ng n nh t qua hai đ nh cho trắ ấ ỉ ước G i ý: hi n th c gi i thu t Dijistra trên Hadoop ợ ệ ự ả ậ MapReduce

Tiêu đề	Hadoop Mapreduce
Tác giả	ThS. Nguyễn Quang Hùng
Trường học	Hochiminh City University of Technology
Chuyên ngành	Computer Science
Thể loại	bài tập
Năm xuất bản	2025
Thành phố	Ho Chi Minh City

Định dạng
Số trang	5
Dung lượng	102,35 KB