Have a go hero – WordCount on a larger body of text Monitoring Hadoop from the browser The HDFS web UI The MapReduce web UIUsing Elastic MapReduce Setting up an account in Amazon Web Ser
Trang 2Hadoop: Data Processing and Modelling
Trang 3Table of Contents
Hadoop: Data Processing and Modelling
Hadoop: Data Processing and Modelling
Credits
Preface
What this learning path covers
Hadoop beginners Guide
Hadoop Real World Solutions Cookbook, 2nd editionMastering Hadoop
What you need for this learning path
Who this learning path is for
1 What It's All About
Big data processing
The value of data
Historically for the few and not the many
Classic data processing systemsScale-up
Early approaches to scale-outLimiting factors
A different approach
All roads lead to scale-outShare nothing
Expect failureSmart software, dumb hardwareMove processing, not data
Build applications, not infrastructure
Trang 4What it is and isn't good for
Cloud computing with Amazon Web ServicesToo many clouds
A third way
Different types of costs
AWS – infrastructure on demand from AmazonElastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic MapReduce (EMR)
What this book covers
A dual approach
Summary
2 Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Other operating systems
Time for action – checking the prerequisites
What just happened?
Setting up Hadoop
A note on versions
Time for action – downloading Hadoop
What just happened?
Time for action – setting up SSH
What just happened?
Configuring and running Hadoop
Time for action – using Hadoop to calculate PiWhat just happened?
Trang 5Three modes
Time for action – configuring the pseudo-distributed mode
What just happened?
Configuring the base directory and formatting the filesystem
Time for action – changing the base HDFS directory
What just happened?
Time for action – formatting the NameNode
What just happened?
Starting and using Hadoop
Time for action – starting Hadoop
What just happened?
Time for action – using HDFS
What just happened?
Time for action – WordCount, the Hello World of MapReduce
What just happened?
Have a go hero – WordCount on a larger body of text
Monitoring Hadoop from the browser
The HDFS web UI
The MapReduce web UIUsing Elastic MapReduce
Setting up an account in Amazon Web Services
Creating an AWS account
Signing up for the necessary services
Time for action – WordCount on EMR using the management consoleWhat just happened?
Have a go hero – other EMR sample applications
Other ways of using EMR
AWS credentials
The EMR command-line tools
The AWS ecosystem
Comparison of local versus EMR Hadoop
Trang 6Some real-world examples
MapReduce as a series of key/value transformations
Pop quiz – key/value pairs
The Hadoop Java API for MapReduce
The 0.20 MapReduce Java API
The Mapper class
The Reducer class
The Driver class
Writing MapReduce programs
Time for action – setting up the classpath
What just happened?
Time for action – implementing WordCount
What just happened?
Time for action – building a JAR file
What just happened?
Time for action – running WordCount on a local Hadoop clusterWhat just happened?
Time for action – running WordCount on EMR
What just happened?
The pre-0.20 Java MapReduce API
Hadoop-provided mapper and reducer implementations
Time for action – WordCount the easy way
What just happened?
Walking through a run of WordCount
Trang 7Reducer output
Shutdown
That's all there is to it!
Apart from the combiner…maybe
Why have a combiner?
Time for action – WordCount with a combiner
What just happened?
When you can use the reducer as the combiner
Time for action – fixing WordCount to work with a combinerWhat just happened?
Reuse is your friend
Pop quiz – MapReduce mechanics
Hadoop-specific data types
The Writable and WritableComparable interfaces
Introducing the wrapper classes
Primitive wrapper classes
Array wrapper classes
Map wrapper classes
Time for action – using the Writable wrapper classes
What just happened?
Other wrapper classes
Have a go hero – playing with Writables
Making your own
Input/output
Files, splits, and records
InputFormat and RecordReader
4 Developing MapReduce Programs
Using languages other than Java with Hadoop
How Hadoop Streaming works
Why to use Hadoop Streaming
Trang 8Time for action – implementing WordCount using Streaming
What just happened?
Differences in jobs when using Streaming
Analyzing a large dataset
Getting the UFO sighting dataset
Getting a feel for the dataset
Time for action – summarizing the UFO data
What just happened?
Examining UFO shapes
Time for action – summarizing the shape data
What just happened?
Time for action – correlating of sighting duration to UFO shape
What just happened?
Using Streaming scripts outside Hadoop
Time for action – performing the shape/time analysis from the
command line
What just happened?
Java shape and location analysis
Time for action – using ChainMapper for field validation/analysisWhat just happened?
Have a go hero
Too many abbreviations
Using the Distributed Cache
Time for action – using the Distributed Cache to improve locationoutput
What just happened?
Counters, status, and other output
Time for action – creating counters, task states, and writing log outputWhat just happened?
Too much information!
Summary
5 Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
When this is a bad idea
Map-side versus reduce-side joins
Trang 9Matching account and sales information
Time for action – reduce-side join using MultipleInputsWhat just happened?
DataJoinMapper and TaggedMapperOutput
Implementing map-side joins
Using the Distributed Cache
Have a go hero - Implementing map-side joins
Pruning data to fit in the cache
Using a data representation instead of raw dataUsing multiple mappers
To join or not to join
Graph algorithms
Graph 101
Graphs and MapReduce – a match made somewhereRepresenting a graph
Time for action – representing the graph
What just happened?
Overview of the algorithm
The mapper
The reducer
Iterative application
Time for action – creating the source code
What just happened?
Time for action – the first run
What just happened?
Time for action – the second run
What just happened?
Time for action – the third run
What just happened?
Time for action – the fourth and last run
What just happened?
Running multiple jobs
Final thoughts on graphs
Using language-independent data structures
Candidate technologies
Introducing Avro
Trang 10Time for action – getting and installing Avro
What just happened?
Avro and schemas
Time for action – defining the schema
What just happened?
Time for action – creating the source Avro data with RubyWhat just happened?
Time for action – consuming the Avro data with Java
What just happened?
Using Avro within MapReduce
Time for action – generating shape summaries in MapReduceWhat just happened?
Time for action – examining the output data with Ruby
What just happened?
Time for action – examining the output data with Java
What just happened?
Have a go hero – graphs in Avro
Going forward with Avro
Summary
6 When Things Break
Failure
Embrace failure
Or at least don't fear it
Don't try this at home
Types of failure
Hadoop node failure
The dfsadmin command
Cluster setup, test files, and block sizes
Fault tolerance and Elastic MapReduce
Time for action – killing a DataNode process
What just happened?
NameNode and DataNode communication
Have a go hero – NameNode log delving
Time for action – the replication factor in action
What just happened?
Time for action – intentionally causing missing blocks
Trang 11What just happened?
When data may be lost
Block corruption
Time for action – killing a TaskTracker process
What just happened?
Comparing the DataNode and TaskTracker failures
Permanent failure
Killing the cluster masters
Time for action – killing the JobTracker
What just happened?
Starting a replacement JobTracker
Have a go hero – moving the JobTracker to a new host
Time for action – killing the NameNode process
What just happened?
Starting a replacement NameNode
The role of the NameNode in more detail
File systems, files, blocks, and nodes
The single most important piece of data in the cluster – fsimageDataNode startup
The risk of correlated failures
Task failure due to software
Failure of slow running tasks
Time for action – causing task failure
What just happened?
Have a go hero – HDFS programmatic access
Hadoop's handling of slow-running tasks
Speculative execution
Hadoop's handling of failing tasks
Have a go hero – causing tasks to fail
Trang 12Task failure due to data
Handling dirty data through code
Using Hadoop's skip mode
Time for action – handling dirty data by using skip modeWhat just happened?
To skip or not to skip
Time for action – browsing default properties
What just happened?
Additional property elements
Default storage location
Where to set properties
Setting up a cluster
How many hosts?
Calculating usable space on a node
Location of the master nodes
Sizing hardware
Processor / memory / storage ratio
EMR as a prototyping platform
Special node requirements
Storage types
Commodity versus enterprise class storage
Single disk versus RAID
Finding the balance
Network storage
Hadoop networking configuration
How blocks are placed
Rack awareness
The rack-awareness scriptTime for action – examining the default rack configurationWhat just happened?
Time for action – adding a rack awareness script
Trang 13What just happened?
What is commodity hardware anyway?
Pop quiz – setting up a cluster
Cluster access control
The Hadoop security model
Time for action – demonstrating the default security
What just happened?
User identity
The super user
More granular access control
Working around the security model via physical access controlManaging the NameNode
Configuring multiple locations for the fsimage class
Time for action – adding an additional fsimage location
What just happened?
Where to write the fsimage copies
Swapping to another NameNode host
Having things ready before disaster strikes
Time for action – swapping to a new NameNode host
What just happened?
Don't celebrate quite yet!
What about MapReduce?
Have a go hero – swapping to a new NameNode host
Command line job management
Have a go hero – command line job management
Job priorities and scheduling
Time for action – changing job priorities and killing a job
What just happened?
Alternative schedulers
Capacity Scheduler
Fair Scheduler
Trang 14Enabling alternative schedulers
When to use alternative schedulers
Scaling
Adding capacity to a local Hadoop cluster
Have a go hero – adding a node and running balancer
Adding capacity to an EMR job flow
Expanding a running job flow
Time for action – installing Hive
What just happened?
Using Hive
Time for action – creating a table for the UFO data
What just happened?
Time for action – inserting the UFO data
What just happened?
Validating the data
Time for action – validating the table
What just happened?
Time for action – redefining the table with the correct column separatorWhat just happened?
Hive tables – real or not?
Time for action – creating a table from an existing file
What just happened?
Time for action – performing a join
What just happened?
Have a go hero – improve the join to use regular expressions
Hive and SQL views
Time for action – using views
What just happened?
Trang 15Handling dirty data in Hive
Have a go hero – do it!
Time for action – exporting query output
What just happened?
Partitioning the table
Time for action – making a partitioned UFO sighting table
What just happened?
Bucketing, clustering, and sorting oh my!
User-Defined Function
Time for action – adding a new User Defined Function (UDF)What just happened?
To preprocess or not to preprocess
Hive versus Pig
What we didn't cover
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
What just happened?
Using interactive job flows for development
Have a go hero – using an interactive EMR cluster
Integration with other AWS products
Summary
9 Working with Relational Databases
Common data paths
Hadoop as an archive store
Hadoop as a preprocessing step
Hadoop as a data input tool
The serpent eats its own tail
Setting up MySQL
Time for action – installing and setting up MySQL
What just happened?
Did it have to be so hard?
Time for action – configuring MySQL to allow remote connectionsWhat just happened?
Don't do this in production!
Time for action – setting up the employee database
What just happened?
Trang 16Be careful with data file access rights
Getting data into Hadoop
Using MySQL tools and manual import
Have a go hero – exporting the employee table into HDFSAccessing the database from the mapper
A better way – introducing Sqoop
Time for action – downloading and configuring Sqoop
What just happened?
Sqoop and Hadoop versions
Importing data into Hive using Sqoop
Time for action – exporting data from MySQL into HiveWhat just happened?
Time for action – a more selective import
What just happened?
Datatype issues
Time for action – using a type mapping
What just happened?
Time for action – importing data from a raw query
What just happened?
Have a go hero
Sqoop and Hive partitions
Field and line terminators
Getting data out of Hadoop
Writing data from within the reducer
Writing SQL import files from the reducer
A better way – Sqoop again
Time for action – importing data from Hadoop into MySQLWhat just happened?
Differences between Sqoop imports and exports
Inserts versus updates
Trang 17Have a go hero
Sqoop and Hive exports
Time for action – importing Hive data into MySQL
What just happened?
Time for action – fixing the mapping and re-running the exportWhat just happened?
Other Sqoop features
Incremental mergeAvoiding partial exportsSqoop as a code generatorAWS considerations
Considering RDS
Summary
10 Data Collection with Flume
A note about AWS
Data data everywhere
Types of data
Getting network traffic into Hadoop
Time for action – getting web server data into Hadoop
What just happened?
Re-creating the wheel
A common framework approach
Introducing Apache Flume
A note on versioning
Time for action – installing and configuring Flume
What just happened?
Using Flume to capture network data
Time for action – capturing network traffic in a log file
What just happened?
Time for action – logging to the console
Trang 18What just happened?
Writing network data to log files
Time for action – capturing the output of a command to a flat fileWhat just happened?
Logs versus files
Time for action – capturing a remote file in a local flat file
What just happened?
Sources, sinks, and channels
Sources
Sinks
Channels
Or roll your own
Understanding the Flume configuration files
Have a go hero
It's all about events
Time for action – writing network traffic onto HDFS
What just happened?
Time for action – adding timestamps
What just happened?
To Sqoop or to Flume
Time for action – multi level Flume networks
What just happened?
Time for action – writing to multiple sinks
What just happened?
Selectors replicating and multiplexing
Handling sink failure
Have a go hero - Handling sink failure
Next, the world
Have a go hero - Next, the world
The bigger picture
Trang 19Upcoming Hadoop changes
Alternative distributions
Why alternative distributions?
Bundling
Free and commercial extensions
Cloudera Distribution for HadoopHortonworks Data Platform
MapRIBM InfoSphere Big InsightsChoosing a distribution
Other Apache projects
A Pop Quiz Answers
Chapter 3, Understanding MapReduce
Pop quiz – key/value pairs
Pop quiz – walking through a run of WordCountChapter 7, Keeping Things Running
Pop quiz – setting up a cluster
Trang 20Installing a multi-node Hadoop cluster
Trang 25Left outer join
Right outer join
Full outer join
Left semi join
Trang 31Problem statementSolution
How it works
3 Module 3
1 Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop's genealogy
Hadoop-0.20-appendHadoop-0.20-securityHadoop's timelineHadoop 2.X
Yet Another Resource Negotiator (YARN)
Architecture overviewStorage layer enhancements
High availabilityHDFS FederationHDFS snapshotsOther enhancementsSupport enhancements
Hadoop distributions
Which Hadoop distribution?
PerformanceScalabilityReliabilityManageabilityAvailable distributions
Cloudera Distribution of Hadoop (CDH)Hortonworks Data Platform (HDP)MapR
Trang 32Pivotal HD
Summary
2 Advanced MapReduce
MapReduce input
The InputFormat class
The InputSplit class
The RecordReader class
Hadoop's "small files" problem
Filtering inputs
The Map task
The dfs.blocksize attribute
Sort and spill of intermediate outputs
Node-local Reducers or Combiners
Fetching intermediate outputs – Map-sideThe Reduce task
Fetching intermediate outputs – Reduce-sideMerge and spill of intermediate outputsMapReduce output
Speculative execution of tasks
MapReduce job counters
Handling data joins
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
The logical plan
The physical plan
The MapReduce plan
Development and debugging aids
The DESCRIBE command
The EXPLAIN command
The ILLUSTRATE command
Trang 33The advanced Pig operators
The advanced FOREACH operatorThe FLATTEN operator
The nested FOREACH operatorThe COGROUP operator
The UNION operator
The CROSS operator
Specialized joins in Pig
The Replicated join
Skewed joins
The Merge join
User-defined functions
The evaluation functions
The aggregate functions
The Algebraic interface
The Accumulator interface
The filter functions
The load functions
The store functions
Pig performance optimizations
The optimization rules
Measurement of Pig script performanceCombiners in Pig
Memory for the Bag data type
Number of reducers in Pig
The multiquery mode in Pig
Best practices
The explicit usage of types
Early and frequent projection
Early and frequent filtering
The usage of the LIMIT operator
The usage of the DISTINCT operatorThe reduction of operations
The usage of Algebraic UDFs
The usage of Accumulator UDFs
Eliminating nulls in the data
Trang 34The usage of specialized joins
Compressing intermediate resultsCombining smaller files
Summary
4 Advanced Hive
The Hive architecture
The Hive metastore
The Hive compiler
The Hive execution engine
The supporting components of HiveData types
File formats
Compressed files
ORC files
The Parquet files
The data model
The GROUP BY operation
ORDER BY versus SORT BY clausesThe JOIN operator and its types
Map-side joins
Advanced aggregation support
Other advanced clauses
UDF, UDAF, and UDTF
Summary
5 Serialization and Hadoop I/O
Data serialization in Hadoop
Writable and WritableComparableHadoop versus Java serialization
Avro serialization
Avro and MapReduce
Avro and Pig
Trang 35Avro and Hive
Comparison – Avro versus Protocol Buffers / ThriftFile formats
The Sequence file format
Reading and writing Sequence files
The MapFile format
Other data structures
Compression
Splits and compressions
Scope for compression
Summary
6 YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Resource Manager (RM)
Application Master (AM)
Node Manager (NM)
YARN clients
Developing YARN applications
Writing YARN clients
Writing the Application Master entity
Architecture of an Apache Storm cluster
Computation and data modeling in Apache StormUse cases for Apache Storm
Developing with Apache Storm
Apache Storm 0.9.1
Trang 368 Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Provisioning a Hadoop cluster on EMR
Summary
9 HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Hadoop support for S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Trang 37Kerberos authentication and Hadoop
Authentication via HTTP interfaces
Authorization in Hadoop
Authorization in HDFS
Identity of an HDFS user
Group listings for an HDFS user
HDFS APIs and shell commands
Specifying the HDFS superuser
Turning off HDFS authorization
Limiting HDFS usage
Name quotas in HDFS
Space quotas in HDFS
Service-level authorization in Hadoop
Data confidentiality in Hadoop
HTTPS and encrypted shuffle
SSL configuration changes
Configuring the keystore and truststoreAudit logging in Hadoop
Summary
12 Analytics Using Hadoop
Data analytics workflow
Cosine similarity distance measures
Clustering using k-means
K-means clustering using Apache MahoutRHadoop
Summary
13 Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Prerequisites
Trang 38Building HadoopConfiguring HadoopDeploying HadoopSummary
A Bibliography
Index
Trang 39Hadoop: Data Processing and Modelling
Trang 40Hadoop: Data Processing and