Hadoop mapreduce v2 cookbook explore the hadoop mapreduce v2 ecosystem to gain insights from very large datasets 2nd edition

How it works… Writing a WordCount MapReduce application, bundling it, and running it using theHadoop local mode How to do it… How it works… See also Setting up Hadoop ecosystem in a dist

Trang 1

www.allitebooks.com

Trang 3

Hadoop MapReduce v2 Cookbook Second Edition

www.allitebooks.com

Trang 4

Hadoop MapReduce v2 Cookbook Second EditionCredits

Trang 5

How it works…

Writing a WordCount MapReduce application, bundling it, and running it using theHadoop local mode

How to do it…

How it works…

See also

Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoopdistribution

Trang 6

See also

Saving money using Amazon EC2 Spot Instances to execute EMR job flowsHow to do it…

Trang 9

www.allitebooks.com

Trang 10

See also

Adding support for new input data formats – implementing a custom InputFormatHow to do it…

How it works…

There’s more…

See also

Formatting the results of MapReduce computations – using Hadoop OutputFormatsHow to do it…

How to do it…

How it works…

There’s more…

www.allitebooks.com

Trang 11

Adding resources to the DistributedCache from the command lineAdding resources to the classpath using the DistributedCacheUsing Hadoop with legacy applications – Hadoop streaming

Trang 12

Plotting the Hadoop MapReduce results using gnuplotGetting ready

Trang 13

Hive data types

Hive external tables

Using the describe formatted command to inspect the metadata of Hive tablesSimple SQL-style data querying using Apache Hive

Trang 16

Exporting data from HDFS to a relational database using Apache SqoopGetting ready

How to do it…

How it works…

See also

Whole web crawling with Apache Nutch using a Hadoop/HBase clusterGetting ready

How to do it…

How it works…

Trang 19

Index

Trang 20

www.allitebooks.com

Trang 21

Trang 23

or transmitted in any form or by any means, without the prior written permission of thepublisher, except in the case of brief quotations embedded in critical articles or reviews.Every effort has been made in the preparation of this book to ensure the accuracy of theinformation presented However, the information contained in this book is sold withoutwarranty, either express or implied Neither the author, nor Packt Publishing, and its

dealers and distributors will be held liable for any damages caused or alleged to be causeddirectly or indirectly by this book

Trang 28

bachelor of science degree in computer science and engineering from University of

Moratuwa, Sri Lanka

Trang 30

I would like to thank my wife, Bimalee, my son, Kaveen, and my daughter, Yasali, forputting up with me for all the missing family time and for providing me with love andencouragement throughout the writing period I would also like to thank my parents andsiblings Without their love, guidance, and encouragement, I would not be where I amtoday

I really appreciate the contributions from my coauthor, Dr Srinath Perera, for the firstedition of this book Many of his contributions from the first edition of this book havebeen adapted to the current book even though he wasn’t able to coauthor this book due tohis work and family commitments

I would like to thank the Hadoop, HBase, Mahout, Pig, Hive, Sqoop, Nutch, and Lucenecommunities for developing great open source products Thanks to Apache Software

Foundation for fostering vibrant open source communities

Big thanks to the editorial staff at Packt for providing me with the opportunity to write thisbook and feedback and guidance throughout the process Thanks to the reviewers of thisbook for the many useful suggestions and corrections

I would like to express my deepest gratitude to all the mentors I have had over the years,including Prof Geoffrey Fox, Dr Chris Groer, Dr Sanjiva Weerawarana, Prof DennisGannon, Prof Judy Qiu, Prof Beth Plale, and all my professors at Indiana University andUniversity of Moratuwa for all the knowledge and guidance they gave me Thanks to all

my past and present colleagues for the many insightful discussions we’ve had and theknowledge they shared with me

www.allitebooks.com

Trang 32

Srinath Perera (coauthor of the first edition of this book) is a senior software architect at

WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO Healso serves as a research scientist at Lanka Software Foundation and teaches as a member

of the visiting faculty at Department of Computer Science and Engineering, University ofMoratuwa He is a cofounder of Apache Axis2 open source project, and he has been

involved with the Apache Web Service project since 2002 and is a member of ApacheSoftware foundation and Apache Web Service project PMC Srinath is also a committer ofApache open source projects Axis, Axis2, and Geronimo

Srinath received his PhD and MSc in computer science from Indiana University,

Bloomington, USA, and his bachelor of science in computer science and engineering fromUniversity of Moratuwa, Sri Lanka

Srinath has authored many technical and peer-reviewed research articles; more details can

be found on his website He is also a frequent speaker at technical venues

Srinath has worked with large-scale distributed systems for a long time He closely workswith big data technologies such as Hadoop and Cassandra daily He also teaches a parallelprogramming graduate class at University of Moratuwa, which is primarily based on

Hadoop

I would like to thank my wife, Miyuru, and my parents, whose never-ending support keeps

me going I would also like to thank Sanjiva from WSO2 who encouraged us to make ourmark even though project such as these are not in the job description Finally, I would like

to thank my colleagues at WSO2 for ideas and companionship that have shaped the book

in many ways

Trang 34

architecture His 15 years of experience in IT consulting has resulted in a client list thatlooks like a “Who’s Who” of the Fortune 500 His recent projects include a completenetwork redesign for an aircraft manufacturer and an in-store video analytics pilot for amajor home improvement retailer

Jeroen van Wilgenburg is a software craftsman at JPoint (http://www.jpoint.nl), a

software agency based in the center of the Netherlands Their main focus is on developinghigh-quality Java and Scala software with open source frameworks

Currently, Jeroen is developing several big data applications with Hadoop, MapReduce,Storm, Spark, Kafka, MongoDB, and Elasticsearch

Trang 35

http://vanwilgenburg.wordpress.com

Shinichi Yamashita is a solutions architect at System Platform Sector in NTT DATA

Corporation, Japan He has more than 9 years of experience in software and middlewareengineering (Apache, Tomcat, PostgreSQL, Hadoop Ecosystem, and Spark) Shinichi haswritten a few books on Hadoop in Japanese

Trang 37

www.PacktPub.com

Trang 38

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.PacktPub.com and as

a print book customer, you are entitled to a discount on the eBook copy Get in touch with

us at < service@packtpub.com > for more details

At www.PacktPub.com, you can also read a collection of free technical articles, sign upfor a range of free newsletters and receive exclusive discounts and offers on Packt booksand eBooks

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt’s online digitalbook library Here, you can search, access, and read Packt’s entire library of books

Trang 39

Fully searchable across every book published by PacktCopy and paste, print, and bookmark content

On demand and accessible via a web browser

Trang 40

If you have an account with Packt at www.PacktPub.com, you can use this to accessPacktLib today and view 9 entirely free books Simply use your login credentials forimmediate access

www.allitebooks.com

Trang 42

We are currently facing an avalanche of data, and this data contains many insights thathold the keys to success or failure in the data-driven world Next generation Hadoop (v2)offers a cutting-edge platform to store and analyze these massive data sets and improveupon the widely used and highly successful Hadoop MapReduce v1 The recipes that willhelp you analyze large and complex datasets with next generation Hadoop MapReducewill provide you with the skills and knowledge needed to process large and complexdatasets using the next generation Hadoop ecosystem

This book presents many exciting topics such as MapReduce patterns using Hadoop tosolve analytics, classifications, and data indexing and searching You will also be

introduced to several Hadoop ecosystem components including Hive, Pig, HBase,

Mahout, Nutch, and Sqoop

This book introduces you to simple examples and then dives deep to solve in-depth bigdata use cases This book presents more than 90 ready-to-use Hadoop MapReduce recipes

in a simple and straightforward manner, with step-by-step instructions and real-worldexamples

Trang 43

basic Hadoop YARN and HDFS configurations, HDFS Java API, and unit testing methodsfor MapReduce applications

Chapter 8, Searching and Indexing, introduces several tools and techniques that you can

use with Apache Hadoop to perform large-scale searching and indexing

Chapter 9, Classifications, Recommendations, and Finding Relationships, explains how to

implement complex algorithms such as classifications, recommendations, and findingrelationships using Hadoop

Chapter 10, Mass Text Data Processing, explains how to use Hadoop and Mahout to

process large text datasets and how to perform data preprocessing and loading of

operations using Hadoop

Trang 45

You need a moderate knowledge of Java and access to the Internet and a computer thatruns a Linux operating system

Trang 47

If you are a big data enthusiast and wish to use Hadoop v2 to solve your problems, thenthis book is for you This book is for Java programmers with little to moderate knowledge

of Hadoop MapReduce This is also a one-stop reference for developers and system

admins who want to quickly get up to speed with using Hadoop v2 It would be helpful tohave a basic knowledge of software development using Java and a basic working

knowledge of Linux

Trang 49

In this book, you will find a number of styles of text that distinguish between differentkinds of information Here are some examples of these styles, and an explanation of theirmeaning

Code words in text, database table names, folder names, filenames, file extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: “Thefollowing are the descriptions of the properties we used in the hadoop.properties file.”

Note

Warnings or important notes appear in a box like this

Tip

Tips and tricks appear like this

Trang 50

www.allitebooks.com

Trang 51

Feedback from our readers is always welcome Let us know what you think about thisbook—what you liked or may have disliked Reader feedback is important for us todevelop titles that you really get the most out of

To send us general feedback, simply send an e-mail to <feedback@packtpub.com >, andmention the book title via the subject of your message

If there is a topic that you have expertise in and you are interested in either writing orcontributing to a book, see our author guide on www.packtpub.com/authors

Trang 53

Now that you are the proud owner of a Packt book, we have a number of things to helpyou to get the most from your purchase

Trang 54

You can download the example code files for all Packt books you have purchased fromyour account at http://www.packtpub.com If you purchased this book elsewhere, you canvisit http://www.packtpub.com/support and register to have the files e-mailed directly toyou

Trang 55

Although we have taken every care to ensure the accuracy of our content, mistakes dohappen If you find a mistake in one of our books—maybe a mistake in the text or thecode—we would be grateful if you would report this to us By doing so, you can saveother readers from frustration and help us improve subsequent versions of this book Ifyou find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the

details of your errata Once your errata are verified, your submission will be accepted andthe errata will be uploaded on our website, or added to any list of existing errata, under theErrata section of that title Any existing errata can be viewed by selecting your title from

http://www.packtpub.com/support

Trang 56

Piracy of copyright material on the Internet is an ongoing problem across all media AtPackt, we take the protection of our copyright and licenses very seriously If you comeacross any illegal copies of our works, in any form, on the Internet, please provide us withthe location address or website name immediately so that we can pursue a remedy

Please contact us at < copyright@packtpub.com > with a link to the suspected pirated

material

We appreciate your help in protecting our authors, and our ability to bring you valuablecontent

Trang 57

You can contact us at < questions@packtpub.com > if you are having a problem with anyaspect of the book, and we will do our best to address it

Trang 59

Chapter 1 Getting Started with Hadoop v2

In this chapter, we will cover the following recipes:

Setting up standalone Hadoop v2 on your local machine

Writing a WordCount MapReduce application, bundling it, and running it usingHadoop local mode

Adding a combiner step to the WordCount MapReduce program

Setting up HDFS

Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoopdistribution

HDFS command-line file operations

Running the WordCount program in a distributed cluster environment

Benchmarking HDFS using DFSIO

Benchmarking Hadoop MapReduce using TeraSort

Trang 60

We are living in the era of big data, where exponential growth of phenomena such as web,social networking, smartphones, and so on are producing petabytes of data on a dailybasis Gaining insights from analyzing these very large amounts of data has become a

must-have competitive advantage for many industries However, the size and the possibly

unstructured nature of these data sources make it impossible to use traditional solutionssuch as relational databases to store and analyze these datasets

Storage, processing, and analyzing petabytes of data in a meaningful and timely mannerrequire many compute nodes with thousands of disks and thousands of processors togetherwith the ability to efficiently communicate massive amounts of data among them Such ascale makes failures such as disk failures, compute node failures, network failures, and so

on a common occurrence making fault tolerance a very important aspect of such systems.Other common challenges that arise include the significant cost of resources, handlingcommunication latencies, handling heterogeneous compute resources, synchronizationacross nodes, and load balancing As you can infer, developing and maintaining

distributed parallel applications to process massive amounts of data while handling allthese issues is not an easy task This is where Apache Hadoop comes to our rescue

Note

Google is one of the first organizations to face the problem of processing massive amounts

of data Google built a framework for large-scale data processing borrowing the map and reduce paradigms from the functional programming world and named it as MapReduce.

At the foundation of Google, MapReduce was the Google File System, which is a highthroughput parallel filesystem that enables the reliable storage of massive amounts of datausing commodity computers Seminal research publications that introduced Google

computations by collocating computations with the storage Also, the hardware cost of aHadoop cluster is orders of magnitude cheaper than HPC clusters and database appliancesdue to the usage of commodity hardware and commodity interconnects Together Hadoop-based frameworks have become the de-facto standard for storing and processing big data

www.allitebooks.com

Định dạng
Số trang	695
Dung lượng	4,3 MB