How it works… Writing a WordCount MapReduce application, bundling it, and running it using theHadoop local mode How to do it… How it works… See also Setting up Hadoop ecosystem in a dist
Trang 1www.allitebooks.com
Trang 3Hadoop MapReduce v2 Cookbook Second Edition
www.allitebooks.com
Trang 4Hadoop MapReduce v2 Cookbook Second EditionCredits
Trang 5How it works…
Writing a WordCount MapReduce application, bundling it, and running it using theHadoop local mode
How to do it…
How it works…
See also
Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoopdistribution
Trang 6See also
Saving money using Amazon EC2 Spot Instances to execute EMR job flowsHow to do it…
Trang 9www.allitebooks.com
Trang 10See also
Adding support for new input data formats – implementing a custom InputFormatHow to do it…
How it works…
There’s more…
See also
Formatting the results of MapReduce computations – using Hadoop OutputFormatsHow to do it…
How to do it…
How it works…
There’s more…
www.allitebooks.com
Trang 11Adding resources to the DistributedCache from the command lineAdding resources to the classpath using the DistributedCacheUsing Hadoop with legacy applications – Hadoop streaming
Trang 12Plotting the Hadoop MapReduce results using gnuplotGetting ready
Trang 13Hive data types
Hive external tables
Using the describe formatted command to inspect the metadata of Hive tablesSimple SQL-style data querying using Apache Hive
Trang 16Exporting data from HDFS to a relational database using Apache SqoopGetting ready
How to do it…
How it works…
See also
Whole web crawling with Apache Nutch using a Hadoop/HBase clusterGetting ready
How to do it…
How it works…
Trang 19Index
Trang 20www.allitebooks.com
Trang 21Hadoop MapReduce v2 Cookbook Second Edition
Trang 23Hadoop MapReduce v2 Cookbook Second Edition
Copyright © 2015 Packt Publishing
All rights reserved No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of thepublisher, except in the case of brief quotations embedded in critical articles or reviews.Every effort has been made in the preparation of this book to ensure the accuracy of theinformation presented However, the information contained in this book is sold withoutwarranty, either express or implied Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be causeddirectly or indirectly by this book
Trang 28bachelor of science degree in computer science and engineering from University of
Moratuwa, Sri Lanka
Trang 30I would like to thank my wife, Bimalee, my son, Kaveen, and my daughter, Yasali, forputting up with me for all the missing family time and for providing me with love andencouragement throughout the writing period I would also like to thank my parents andsiblings Without their love, guidance, and encouragement, I would not be where I amtoday
I really appreciate the contributions from my coauthor, Dr Srinath Perera, for the firstedition of this book Many of his contributions from the first edition of this book havebeen adapted to the current book even though he wasn’t able to coauthor this book due tohis work and family commitments
I would like to thank the Hadoop, HBase, Mahout, Pig, Hive, Sqoop, Nutch, and Lucenecommunities for developing great open source products Thanks to Apache Software
Foundation for fostering vibrant open source communities
Big thanks to the editorial staff at Packt for providing me with the opportunity to write thisbook and feedback and guidance throughout the process Thanks to the reviewers of thisbook for the many useful suggestions and corrections
I would like to express my deepest gratitude to all the mentors I have had over the years,including Prof Geoffrey Fox, Dr Chris Groer, Dr Sanjiva Weerawarana, Prof DennisGannon, Prof Judy Qiu, Prof Beth Plale, and all my professors at Indiana University andUniversity of Moratuwa for all the knowledge and guidance they gave me Thanks to all
my past and present colleagues for the many insightful discussions we’ve had and theknowledge they shared with me
www.allitebooks.com
Trang 32Srinath Perera (coauthor of the first edition of this book) is a senior software architect at
WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO Healso serves as a research scientist at Lanka Software Foundation and teaches as a member
of the visiting faculty at Department of Computer Science and Engineering, University ofMoratuwa He is a cofounder of Apache Axis2 open source project, and he has been
involved with the Apache Web Service project since 2002 and is a member of ApacheSoftware foundation and Apache Web Service project PMC Srinath is also a committer ofApache open source projects Axis, Axis2, and Geronimo
Srinath received his PhD and MSc in computer science from Indiana University,
Bloomington, USA, and his bachelor of science in computer science and engineering fromUniversity of Moratuwa, Sri Lanka
Srinath has authored many technical and peer-reviewed research articles; more details can
be found on his website He is also a frequent speaker at technical venues
Srinath has worked with large-scale distributed systems for a long time He closely workswith big data technologies such as Hadoop and Cassandra daily He also teaches a parallelprogramming graduate class at University of Moratuwa, which is primarily based on
Hadoop
I would like to thank my wife, Miyuru, and my parents, whose never-ending support keeps
me going I would also like to thank Sanjiva from WSO2 who encouraged us to make ourmark even though project such as these are not in the job description Finally, I would like
to thank my colleagues at WSO2 for ideas and companionship that have shaped the book
in many ways
Trang 34architecture His 15 years of experience in IT consulting has resulted in a client list thatlooks like a “Who’s Who” of the Fortune 500 His recent projects include a completenetwork redesign for an aircraft manufacturer and an in-store video analytics pilot for amajor home improvement retailer
Jeroen van Wilgenburg is a software craftsman at JPoint (http://www.jpoint.nl), a
software agency based in the center of the Netherlands Their main focus is on developinghigh-quality Java and Scala software with open source frameworks
Currently, Jeroen is developing several big data applications with Hadoop, MapReduce,Storm, Spark, Kafka, MongoDB, and Elasticsearch
Trang 35http://vanwilgenburg.wordpress.com
Shinichi Yamashita is a solutions architect at System Platform Sector in NTT DATA
Corporation, Japan He has more than 9 years of experience in software and middlewareengineering (Apache, Tomcat, PostgreSQL, Hadoop Ecosystem, and Spark) Shinichi haswritten a few books on Hadoop in Japanese
Trang 37www.PacktPub.com
Trang 38Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.PacktPub.com and as
a print book customer, you are entitled to a discount on the eBook copy Get in touch with
us at < service@packtpub.com > for more details
At www.PacktPub.com, you can also read a collection of free technical articles, sign upfor a range of free newsletters and receive exclusive discounts and offers on Packt booksand eBooks
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt’s online digitalbook library Here, you can search, access, and read Packt’s entire library of books
Trang 39Fully searchable across every book published by PacktCopy and paste, print, and bookmark content
On demand and accessible via a web browser
Trang 40If you have an account with Packt at www.PacktPub.com, you can use this to accessPacktLib today and view 9 entirely free books Simply use your login credentials forimmediate access
www.allitebooks.com
Trang 42We are currently facing an avalanche of data, and this data contains many insights thathold the keys to success or failure in the data-driven world Next generation Hadoop (v2)offers a cutting-edge platform to store and analyze these massive data sets and improveupon the widely used and highly successful Hadoop MapReduce v1 The recipes that willhelp you analyze large and complex datasets with next generation Hadoop MapReducewill provide you with the skills and knowledge needed to process large and complexdatasets using the next generation Hadoop ecosystem
This book presents many exciting topics such as MapReduce patterns using Hadoop tosolve analytics, classifications, and data indexing and searching You will also be
introduced to several Hadoop ecosystem components including Hive, Pig, HBase,
Mahout, Nutch, and Sqoop
This book introduces you to simple examples and then dives deep to solve in-depth bigdata use cases This book presents more than 90 ready-to-use Hadoop MapReduce recipes
in a simple and straightforward manner, with step-by-step instructions and real-worldexamples
Trang 43basic Hadoop YARN and HDFS configurations, HDFS Java API, and unit testing methodsfor MapReduce applications
Chapter 8, Searching and Indexing, introduces several tools and techniques that you can
use with Apache Hadoop to perform large-scale searching and indexing
Chapter 9, Classifications, Recommendations, and Finding Relationships, explains how to
implement complex algorithms such as classifications, recommendations, and findingrelationships using Hadoop
Chapter 10, Mass Text Data Processing, explains how to use Hadoop and Mahout to
process large text datasets and how to perform data preprocessing and loading of
operations using Hadoop
Trang 45You need a moderate knowledge of Java and access to the Internet and a computer thatruns a Linux operating system
Trang 47If you are a big data enthusiast and wish to use Hadoop v2 to solve your problems, thenthis book is for you This book is for Java programmers with little to moderate knowledge
of Hadoop MapReduce This is also a one-stop reference for developers and system
admins who want to quickly get up to speed with using Hadoop v2 It would be helpful tohave a basic knowledge of software development using Java and a basic working
knowledge of Linux
Trang 49In this book, you will find a number of styles of text that distinguish between differentkinds of information Here are some examples of these styles, and an explanation of theirmeaning
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: “Thefollowing are the descriptions of the properties we used in the hadoop.properties file.”
Note
Warnings or important notes appear in a box like this
Tip
Tips and tricks appear like this
Trang 50www.allitebooks.com
Trang 51Feedback from our readers is always welcome Let us know what you think about thisbook—what you liked or may have disliked Reader feedback is important for us todevelop titles that you really get the most out of
To send us general feedback, simply send an e-mail to <feedback@packtpub.com >, andmention the book title via the subject of your message
If there is a topic that you have expertise in and you are interested in either writing orcontributing to a book, see our author guide on www.packtpub.com/authors
Trang 53Now that you are the proud owner of a Packt book, we have a number of things to helpyou to get the most from your purchase
Trang 54You can download the example code files for all Packt books you have purchased fromyour account at http://www.packtpub.com If you purchased this book elsewhere, you canvisit http://www.packtpub.com/support and register to have the files e-mailed directly toyou
Trang 55Although we have taken every care to ensure the accuracy of our content, mistakes dohappen If you find a mistake in one of our books—maybe a mistake in the text or thecode—we would be grateful if you would report this to us By doing so, you can saveother readers from frustration and help us improve subsequent versions of this book Ifyou find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the
details of your errata Once your errata are verified, your submission will be accepted andthe errata will be uploaded on our website, or added to any list of existing errata, under theErrata section of that title Any existing errata can be viewed by selecting your title from
http://www.packtpub.com/support
Trang 56Piracy of copyright material on the Internet is an ongoing problem across all media AtPackt, we take the protection of our copyright and licenses very seriously If you comeacross any illegal copies of our works, in any form, on the Internet, please provide us withthe location address or website name immediately so that we can pursue a remedy
Please contact us at < copyright@packtpub.com > with a link to the suspected pirated
material
We appreciate your help in protecting our authors, and our ability to bring you valuablecontent
Trang 57You can contact us at < questions@packtpub.com > if you are having a problem with anyaspect of the book, and we will do our best to address it
Trang 59Chapter 1 Getting Started with Hadoop v2
In this chapter, we will cover the following recipes:
Setting up standalone Hadoop v2 on your local machine
Writing a WordCount MapReduce application, bundling it, and running it usingHadoop local mode
Adding a combiner step to the WordCount MapReduce program
Setting up HDFS
Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoopdistribution
HDFS command-line file operations
Running the WordCount program in a distributed cluster environment
Benchmarking HDFS using DFSIO
Benchmarking Hadoop MapReduce using TeraSort
Trang 60We are living in the era of big data, where exponential growth of phenomena such as web,social networking, smartphones, and so on are producing petabytes of data on a dailybasis Gaining insights from analyzing these very large amounts of data has become a
must-have competitive advantage for many industries However, the size and the possibly
unstructured nature of these data sources make it impossible to use traditional solutionssuch as relational databases to store and analyze these datasets
Storage, processing, and analyzing petabytes of data in a meaningful and timely mannerrequire many compute nodes with thousands of disks and thousands of processors togetherwith the ability to efficiently communicate massive amounts of data among them Such ascale makes failures such as disk failures, compute node failures, network failures, and so
on a common occurrence making fault tolerance a very important aspect of such systems.Other common challenges that arise include the significant cost of resources, handlingcommunication latencies, handling heterogeneous compute resources, synchronizationacross nodes, and load balancing As you can infer, developing and maintaining
distributed parallel applications to process massive amounts of data while handling allthese issues is not an easy task This is where Apache Hadoop comes to our rescue
Note
Google is one of the first organizations to face the problem of processing massive amounts
of data Google built a framework for large-scale data processing borrowing the map and reduce paradigms from the functional programming world and named it as MapReduce.
At the foundation of Google, MapReduce was the Google File System, which is a highthroughput parallel filesystem that enables the reliable storage of massive amounts of datausing commodity computers Seminal research publications that introduced Google
computations by collocating computations with the storage Also, the hardware cost of aHadoop cluster is orders of magnitude cheaper than HPC clusters and database appliancesdue to the usage of commodity hardware and commodity interconnects Together Hadoop-based frameworks have become the de-facto standard for storing and processing big data
www.allitebooks.com