Hadoop data processing and modelling

Have a go hero – WordCount on a larger body of text Monitoring Hadoop from the browser The HDFS web UI The MapReduce web UIUsing Elastic MapReduce Setting up an account in Amazon Web Ser

Trang 2

Hadoop: Data Processing and Modelling

Trang 3

Table of Contents

Hadoop: Data Processing and Modelling

Credits

Preface

What this learning path covers

Hadoop beginners Guide

Hadoop Real World Solutions Cookbook, 2nd editionMastering Hadoop

What you need for this learning path

Who this learning path is for

1 What It's All About

Big data processing

The value of data

Historically for the few and not the many

Classic data processing systemsScale-up

Early approaches to scale-outLimiting factors

A different approach

All roads lead to scale-outShare nothing

Expect failureSmart software, dumb hardwareMove processing, not data

Build applications, not infrastructure

Trang 4

What it is and isn't good for

Cloud computing with Amazon Web ServicesToo many clouds

A third way

Different types of costs

AWS – infrastructure on demand from AmazonElastic Compute Cloud (EC2)

Simple Storage Service (S3)

Elastic MapReduce (EMR)

What this book covers

A dual approach

Summary

2 Getting Hadoop Up and Running

Hadoop on a local Ubuntu host

Other operating systems

Time for action – checking the prerequisites

What just happened?

Setting up Hadoop

A note on versions

Time for action – downloading Hadoop

What just happened?

Time for action – setting up SSH

What just happened?

Configuring and running Hadoop

Time for action – using Hadoop to calculate PiWhat just happened?

Trang 5

Three modes

Time for action – configuring the pseudo-distributed mode

What just happened?

Configuring the base directory and formatting the filesystem

Time for action – changing the base HDFS directory

What just happened?

Time for action – formatting the NameNode

What just happened?

Starting and using Hadoop

Time for action – starting Hadoop

What just happened?

Time for action – using HDFS

What just happened?

Time for action – WordCount, the Hello World of MapReduce

What just happened?

Have a go hero – WordCount on a larger body of text

Monitoring Hadoop from the browser

The HDFS web UI

The MapReduce web UIUsing Elastic MapReduce

Setting up an account in Amazon Web Services

Creating an AWS account

Signing up for the necessary services

Time for action – WordCount on EMR using the management consoleWhat just happened?

Have a go hero – other EMR sample applications

Other ways of using EMR

AWS credentials

The EMR command-line tools

The AWS ecosystem

Comparison of local versus EMR Hadoop

Trang 6

Some real-world examples

MapReduce as a series of key/value transformations

Pop quiz – key/value pairs

The Hadoop Java API for MapReduce

The 0.20 MapReduce Java API

The Mapper class

The Reducer class

The Driver class

Writing MapReduce programs

Time for action – setting up the classpath

What just happened?

Time for action – implementing WordCount

What just happened?

Time for action – building a JAR file

What just happened?

Time for action – running WordCount on a local Hadoop clusterWhat just happened?

Time for action – running WordCount on EMR

What just happened?

The pre-0.20 Java MapReduce API

Hadoop-provided mapper and reducer implementations

Time for action – WordCount the easy way

What just happened?

Walking through a run of WordCount

Trang 7

Reducer output

Shutdown

That's all there is to it!

Apart from the combiner…maybe

Why have a combiner?

Time for action – WordCount with a combiner

What just happened?

When you can use the reducer as the combiner

Time for action – fixing WordCount to work with a combinerWhat just happened?

Reuse is your friend

Pop quiz – MapReduce mechanics

Hadoop-specific data types

The Writable and WritableComparable interfaces

Introducing the wrapper classes

Primitive wrapper classes

Array wrapper classes

Map wrapper classes

Time for action – using the Writable wrapper classes

What just happened?

Other wrapper classes

Have a go hero – playing with Writables

Making your own

Input/output

Files, splits, and records

InputFormat and RecordReader

4 Developing MapReduce Programs

Using languages other than Java with Hadoop

How Hadoop Streaming works

Why to use Hadoop Streaming

Trang 8

Time for action – implementing WordCount using Streaming

What just happened?

Differences in jobs when using Streaming

Analyzing a large dataset

Getting the UFO sighting dataset

Getting a feel for the dataset

Time for action – summarizing the UFO data

What just happened?

Examining UFO shapes

Time for action – summarizing the shape data

What just happened?

Time for action – correlating of sighting duration to UFO shape

What just happened?

Using Streaming scripts outside Hadoop

Time for action – performing the shape/time analysis from the

command line

What just happened?

Java shape and location analysis

Time for action – using ChainMapper for field validation/analysisWhat just happened?

Have a go hero

Too many abbreviations

Using the Distributed Cache

Time for action – using the Distributed Cache to improve locationoutput

What just happened?

Counters, status, and other output

Time for action – creating counters, task states, and writing log outputWhat just happened?

Too much information!

Summary

5 Advanced MapReduce Techniques

Simple, advanced, and in-between

Joins

When this is a bad idea

Map-side versus reduce-side joins

Trang 9

Matching account and sales information

Time for action – reduce-side join using MultipleInputsWhat just happened?

DataJoinMapper and TaggedMapperOutput

Implementing map-side joins

Using the Distributed Cache

Have a go hero - Implementing map-side joins

Pruning data to fit in the cache

Using a data representation instead of raw dataUsing multiple mappers

To join or not to join

Graph algorithms

Graph 101

Graphs and MapReduce – a match made somewhereRepresenting a graph

Time for action – representing the graph

What just happened?

Overview of the algorithm

The mapper

The reducer

Iterative application

Time for action – creating the source code

What just happened?

Time for action – the first run

What just happened?

Time for action – the second run

What just happened?

Time for action – the third run

What just happened?

Time for action – the fourth and last run

What just happened?

Running multiple jobs

Final thoughts on graphs

Using language-independent data structures

Candidate technologies

Introducing Avro

Trang 10

Time for action – getting and installing Avro

What just happened?

Avro and schemas

Time for action – defining the schema

What just happened?

Time for action – creating the source Avro data with RubyWhat just happened?

Time for action – consuming the Avro data with Java

What just happened?

Using Avro within MapReduce

Time for action – generating shape summaries in MapReduceWhat just happened?

Time for action – examining the output data with Ruby

What just happened?

Time for action – examining the output data with Java

What just happened?

Have a go hero – graphs in Avro

Going forward with Avro

Summary

6 When Things Break

Failure

Embrace failure

Or at least don't fear it

Don't try this at home

Types of failure

Hadoop node failure

The dfsadmin command

Cluster setup, test files, and block sizes

Fault tolerance and Elastic MapReduce

Time for action – killing a DataNode process

What just happened?

NameNode and DataNode communication

Have a go hero – NameNode log delving

Time for action – the replication factor in action

What just happened?

Time for action – intentionally causing missing blocks

Trang 11

What just happened?

When data may be lost

Block corruption

Time for action – killing a TaskTracker process

What just happened?

Comparing the DataNode and TaskTracker failures

Permanent failure

Killing the cluster masters

Time for action – killing the JobTracker

What just happened?

Starting a replacement JobTracker

Have a go hero – moving the JobTracker to a new host

Time for action – killing the NameNode process

What just happened?

Starting a replacement NameNode

The role of the NameNode in more detail

File systems, files, blocks, and nodes

The single most important piece of data in the cluster – fsimageDataNode startup

The risk of correlated failures

Task failure due to software

Failure of slow running tasks

Time for action – causing task failure

What just happened?

Have a go hero – HDFS programmatic access

Hadoop's handling of slow-running tasks

Speculative execution

Hadoop's handling of failing tasks

Have a go hero – causing tasks to fail

Trang 12

Task failure due to data

Handling dirty data through code

Using Hadoop's skip mode

Time for action – handling dirty data by using skip modeWhat just happened?

To skip or not to skip

Time for action – browsing default properties

What just happened?

Additional property elements

Default storage location

Where to set properties

Setting up a cluster

How many hosts?

Calculating usable space on a node

Location of the master nodes

Sizing hardware

Processor / memory / storage ratio

EMR as a prototyping platform

Special node requirements

Storage types

Commodity versus enterprise class storage

Single disk versus RAID

Finding the balance

Network storage

Hadoop networking configuration

How blocks are placed

Rack awareness

The rack-awareness scriptTime for action – examining the default rack configurationWhat just happened?

Time for action – adding a rack awareness script

Trang 13

What just happened?

What is commodity hardware anyway?

Pop quiz – setting up a cluster

Cluster access control

The Hadoop security model

Time for action – demonstrating the default security

What just happened?

User identity

The super user

More granular access control

Working around the security model via physical access controlManaging the NameNode

Configuring multiple locations for the fsimage class

Time for action – adding an additional fsimage location

What just happened?

Where to write the fsimage copies

Swapping to another NameNode host

Having things ready before disaster strikes

Time for action – swapping to a new NameNode host

What just happened?

Don't celebrate quite yet!

What about MapReduce?

Have a go hero – swapping to a new NameNode host

Command line job management

Have a go hero – command line job management

Job priorities and scheduling

Time for action – changing job priorities and killing a job

What just happened?

Alternative schedulers

Capacity Scheduler

Fair Scheduler

Trang 14

Enabling alternative schedulers

When to use alternative schedulers

Scaling

Adding capacity to a local Hadoop cluster

Have a go hero – adding a node and running balancer

Adding capacity to an EMR job flow

Expanding a running job flow

Time for action – installing Hive

What just happened?

Using Hive

Time for action – creating a table for the UFO data

What just happened?

Time for action – inserting the UFO data

What just happened?

Validating the data

Time for action – validating the table

What just happened?

Time for action – redefining the table with the correct column separatorWhat just happened?

Hive tables – real or not?

Time for action – creating a table from an existing file

What just happened?

Time for action – performing a join

What just happened?

Have a go hero – improve the join to use regular expressions

Hive and SQL views

Time for action – using views

What just happened?

Trang 15

Handling dirty data in Hive

Have a go hero – do it!

Time for action – exporting query output

What just happened?

Partitioning the table

Time for action – making a partitioned UFO sighting table

What just happened?

Bucketing, clustering, and sorting oh my!

User-Defined Function

Time for action – adding a new User Defined Function (UDF)What just happened?

To preprocess or not to preprocess

Hive versus Pig

What we didn't cover

Hive on Amazon Web Services

Time for action – running UFO analysis on EMR

What just happened?

Using interactive job flows for development

Have a go hero – using an interactive EMR cluster

Integration with other AWS products

Summary

9 Working with Relational Databases

Common data paths

Hadoop as an archive store

Hadoop as a preprocessing step

Hadoop as a data input tool

The serpent eats its own tail

Setting up MySQL

Time for action – installing and setting up MySQL

What just happened?

Did it have to be so hard?

Time for action – configuring MySQL to allow remote connectionsWhat just happened?

Don't do this in production!

Time for action – setting up the employee database

What just happened?

Trang 16

Be careful with data file access rights

Getting data into Hadoop

Using MySQL tools and manual import

Have a go hero – exporting the employee table into HDFSAccessing the database from the mapper

A better way – introducing Sqoop

Time for action – downloading and configuring Sqoop

What just happened?

Sqoop and Hadoop versions

Importing data into Hive using Sqoop

Time for action – exporting data from MySQL into HiveWhat just happened?

Time for action – a more selective import

What just happened?

Datatype issues

Time for action – using a type mapping

What just happened?

Time for action – importing data from a raw query

What just happened?

Have a go hero

Sqoop and Hive partitions

Field and line terminators

Getting data out of Hadoop

Writing data from within the reducer

Writing SQL import files from the reducer

A better way – Sqoop again

Time for action – importing data from Hadoop into MySQLWhat just happened?

Differences between Sqoop imports and exports

Inserts versus updates

Trang 17

Have a go hero

Sqoop and Hive exports

Time for action – importing Hive data into MySQL

What just happened?

Time for action – fixing the mapping and re-running the exportWhat just happened?

Other Sqoop features

Incremental mergeAvoiding partial exportsSqoop as a code generatorAWS considerations

Considering RDS

Summary

10 Data Collection with Flume

A note about AWS

Data data everywhere

Types of data

Getting network traffic into Hadoop

Time for action – getting web server data into Hadoop

What just happened?

Re-creating the wheel

A common framework approach

Introducing Apache Flume

A note on versioning

Time for action – installing and configuring Flume

What just happened?

Using Flume to capture network data

Time for action – capturing network traffic in a log file

What just happened?

Time for action – logging to the console

Trang 18

What just happened?

Writing network data to log files

Time for action – capturing the output of a command to a flat fileWhat just happened?

Logs versus files

Time for action – capturing a remote file in a local flat file

What just happened?

Sources, sinks, and channels

Sources

Sinks

Channels

Or roll your own

Understanding the Flume configuration files

Have a go hero

It's all about events

Time for action – writing network traffic onto HDFS

What just happened?

Time for action – adding timestamps

What just happened?

To Sqoop or to Flume

Time for action – multi level Flume networks

What just happened?

Time for action – writing to multiple sinks

What just happened?

Selectors replicating and multiplexing

Handling sink failure

Have a go hero - Handling sink failure

Next, the world

Have a go hero - Next, the world

The bigger picture

Trang 19

Upcoming Hadoop changes

Alternative distributions

Why alternative distributions?

Bundling

Free and commercial extensions

Cloudera Distribution for HadoopHortonworks Data Platform

MapRIBM InfoSphere Big InsightsChoosing a distribution

Other Apache projects

A Pop Quiz Answers

Chapter 3, Understanding MapReduce

Pop quiz – key/value pairs

Pop quiz – walking through a run of WordCountChapter 7, Keeping Things Running

Pop quiz – setting up a cluster

Trang 20

Installing a multi-node Hadoop cluster

Trang 25

Left outer join

Right outer join

Full outer join

Left semi join

Trang 31

Problem statementSolution

How it works

3 Module 3

1 Hadoop 2.X

The inception of Hadoop

The evolution of Hadoop

Hadoop's genealogy

Hadoop-0.20-appendHadoop-0.20-securityHadoop's timelineHadoop 2.X

Yet Another Resource Negotiator (YARN)

Architecture overviewStorage layer enhancements

High availabilityHDFS FederationHDFS snapshotsOther enhancementsSupport enhancements

Hadoop distributions

Which Hadoop distribution?

PerformanceScalabilityReliabilityManageabilityAvailable distributions

Cloudera Distribution of Hadoop (CDH)Hortonworks Data Platform (HDP)MapR

Trang 32

Pivotal HD

Summary

2 Advanced MapReduce

MapReduce input

The InputFormat class

The InputSplit class

The RecordReader class

Hadoop's "small files" problem

Filtering inputs

The Map task

The dfs.blocksize attribute

Sort and spill of intermediate outputs

Node-local Reducers or Combiners

Fetching intermediate outputs – Map-sideThe Reduce task

Fetching intermediate outputs – Reduce-sideMerge and spill of intermediate outputsMapReduce output

Speculative execution of tasks

MapReduce job counters

Handling data joins

Different modes of execution

Complex data types in Pig

Compiling Pig scripts

The logical plan

The physical plan

The MapReduce plan

Development and debugging aids

The DESCRIBE command

The EXPLAIN command

The ILLUSTRATE command

Trang 33

The advanced Pig operators

The advanced FOREACH operatorThe FLATTEN operator

The nested FOREACH operatorThe COGROUP operator

The UNION operator

The CROSS operator

Specialized joins in Pig

The Replicated join

Skewed joins

The Merge join

User-defined functions

The evaluation functions

The aggregate functions

The Algebraic interface

The Accumulator interface

The filter functions

The load functions

The store functions

Pig performance optimizations

The optimization rules

Measurement of Pig script performanceCombiners in Pig

Memory for the Bag data type

Number of reducers in Pig

The multiquery mode in Pig

Best practices

The explicit usage of types

Early and frequent projection

Early and frequent filtering

The usage of the LIMIT operator

The usage of the DISTINCT operatorThe reduction of operations

The usage of Algebraic UDFs

The usage of Accumulator UDFs

Eliminating nulls in the data

Trang 34

The usage of specialized joins

Compressing intermediate resultsCombining smaller files

Summary

4 Advanced Hive

The Hive architecture

The Hive metastore

The Hive compiler

The Hive execution engine

The supporting components of HiveData types

File formats

Compressed files

ORC files

The Parquet files

The data model

The GROUP BY operation

ORDER BY versus SORT BY clausesThe JOIN operator and its types

Map-side joins

Advanced aggregation support

Other advanced clauses

UDF, UDAF, and UDTF

Summary

5 Serialization and Hadoop I/O

Data serialization in Hadoop

Writable and WritableComparableHadoop versus Java serialization

Avro serialization

Avro and MapReduce

Avro and Pig

Trang 35

Avro and Hive

Comparison – Avro versus Protocol Buffers / ThriftFile formats

The Sequence file format

Reading and writing Sequence files

The MapFile format

Other data structures

Compression

Splits and compressions

Scope for compression

Summary

6 YARN – Bringing Other Paradigms to Hadoop

The YARN architecture

Resource Manager (RM)

Application Master (AM)

Node Manager (NM)

YARN clients

Developing YARN applications

Writing YARN clients

Writing the Application Master entity

Architecture of an Apache Storm cluster

Computation and data modeling in Apache StormUse cases for Apache Storm

Developing with Apache Storm

Apache Storm 0.9.1

Trang 36

8 Hadoop on the Cloud

Cloud computing characteristics

Hadoop on the cloud

Amazon Elastic MapReduce (EMR)

Provisioning a Hadoop cluster on EMR

Summary

9 HDFS Replacements

HDFS – advantages and drawbacks

Amazon AWS S3

Hadoop support for S3

Implementing a filesystem in Hadoop

Implementing an S3 native filesystem in Hadoop

Trang 37

Kerberos authentication and Hadoop

Authentication via HTTP interfaces

Authorization in Hadoop

Authorization in HDFS

Identity of an HDFS user

Group listings for an HDFS user

HDFS APIs and shell commands

Specifying the HDFS superuser

Turning off HDFS authorization

Limiting HDFS usage

Name quotas in HDFS

Space quotas in HDFS

Service-level authorization in Hadoop

Data confidentiality in Hadoop

HTTPS and encrypted shuffle

SSL configuration changes

Configuring the keystore and truststoreAudit logging in Hadoop

Summary

12 Analytics Using Hadoop

Data analytics workflow

Cosine similarity distance measures

Clustering using k-means

K-means clustering using Apache MahoutRHadoop

Summary

13 Hadoop for Microsoft Windows

Deploying Hadoop on Microsoft Windows

Prerequisites

Trang 38

Building HadoopConfiguring HadoopDeploying HadoopSummary

A Bibliography

Index

Trang 39

Hadoop: Data Processing and Modelling

Trang 40

Hadoop: Data Processing and

Định dạng
Số trang	1.836
Dung lượng	20,87 MB