441500775 BIG DATA HADOOP Learn by Example by Mayank Bhushan FIRST EDITION 2020 Copyright © BPB Publications, INDIA ISBN 978 93 8655 199 3 All Rights Reserved No part of this publication can be stor.
Trang 2BIG DATA & HADOOP
Learn by Example
by
Mayank Bhushan
Trang 3FIRST EDITION 2020
Copyright © BPB Publications, INDIA
ISBN: 978-93-8655-199-3
All Rights Reserved No part of this publication can be stored in
a retrieval system or reproduced in any form or by any meanswithout the prior written permission of the publishers
LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY
The Author and Publisher of this book have tried their best toensure that the programmes, procedures and functions described
in the book are correct However, the author and the publishersmake no warranty of any kind, expressed or implied, with regard
to these programmes or the documentation contained in the book.The author and publisher shall not be liable in any event of anydamages, incidental or consequential, in connection with, or
arising out of the furnishing, performance or use of these
programmes, procedures and functions Product name mentionedare used for identification purposes only and may be trademarks
of their respective companies
All trademarks referred to in the book are acknowledged as
properties of their respective owners
Distributors:
Trang 4Shop No 5, Mahendra Chambers,
150 DN Rd Next to Capital Cinema,
V.T (C.S.T.) Station, MUMBAI-400 001
Ph: 22078296/22078297
Trang 7This book promises to be a very good starting point for beginnersand an asset to advanced users too.
This book is written as per the syllabus of various universitieslearning pattern and its aim is to keep course approach as
“learning with example” Difficult concepts of Big Data-Hadoop isgiven in an easy and practical way, so that students can able tounderstand it in an efficient manner This book provides
screenshots of practical approaches which can be helpful for
in the subsequent editions of this book
23rd March 2018
Trang 8Mayank Bhushan
Trang 9I would like to express my gratitude to all those who providedsupport, talked things over, read, wrote, offered comments, allowed
me to quote their remarks and assisted in the editing,
proofreading and design
I have relied on many people to guide me directly and indirectly
in writing this book I am very thankful to Hadoop community;from whom I have learned with continuous efforts and I also owe
a debt of gratitude for ABES College to provide me all facilitiesfor Big Data-Hadoop lab
There is always a sense of gratitude, which every one expressesothers for their helpful and needy services they render during
difficult phases of life and to achieve the goal already set
It is impossible to thank individually but we are here by makinghumble effort to thanks some of them At the outset I am
thankful to the almighty that is constantly and invisibly guidingevery body and have also helped us to work on the right path
I am very much thankful to Prof (Dr.) Shailesh Tiwari, H.O.D.(CSE), ABES Engineering College, Ghaziabad (U.P.) for guiding andsupporting me He is the main source of inspiration for me Iwould also like to thanks to Dr Munesh Chandra Trivedi Dean(REC-Azamgarh) Dr Pratibha Singh (Prof., ABES Engineering
College) and Dr Shaswati Banerjea, Asst Prof (MNNIT Allahabad)
Trang 10who always provide me support everywhere Without help fromthem this book is not possible I am in debt of technical helpfrom my dearest friend and colleague Mr Omesh Kumar whoguide me technically for every problem.
I wish my thanks to my all Guru's, friends and colleagues whohelped and kept us motivated for writing this text Special thanksto:
Dr K.K Mishra, MNNIT Allahabad
Dr Mayank Pandey, MNNIT Allahabad
Dr Shashank Srivastava, MNNIT Allahabad
Mr Nitin Shukla, MNNIT Allahabad
Mr Suraj Deb Barma Govt Polytechnic College, Agartala
Dr A.L.N Rao, GL Bajaj, Greater Noida
Mr Ankit Yadav, Mr Desh Deepak Pathak, ABES EC Ghaziabad
Dr Sumit Yadav, IP University
Mr Aatif Jamshed, Galgotia College, Greater Noida
Trang 11I also thank the Publisher and the whole staff at BPB
Publications, especially
Mr Manish Jain for bringing this text in a nice presentable form.
Finally, I want to thanks everyone who has directly or indirectlycontributed to complete this authentic work
Mayank Bhushan
Trang 12Table of Content
Chapter 1: Big Data-Introduction and Demand
1.1 Big Data
1.1.1 Characteristics of Big Data
1.1.2 Why Big Data
1.2 Hadoop
1.2.1 History of Hadoop
1.2.2 Name of Hadoop
1.2.3 Hadoop Ecosystem
1.3 Convergence of Key Trends
1.3.1 Convergence of Big Data into Business
1.3.2 Big data Vs other techniques
1.4 Unstructured Data
1.5 Industry examples of Big data
1.5.1 Use of Big data-Hadoop at Yahoo
1.5.2 In RackSpace for log processing
1.5.3 Hadoop at Facebook
1.6 Usages of Big Data
1.6.1 Web analytics
1.6.2 Big Data and marketing
1.6.3 Big data and fraud
1.6.4 Risk management in Big Data with Credit card1.6.5 Big data and algorithm trading
1.6.6 Big data in Healthcare
Chapter 2: NoSQL Data Management
2.1 Introduction to NoSQL database
2.1.1 Terminology used in NoSQL and RDBMS
Trang 132.1.2 Database use in NoSQL
2.4.6 Hbase Shell Commands
2.4.7 The different usages of scan command2.4.8 Terminologies
Trang 14Chapter 3: Basics of Hadoop
3.1 Data Format
3.2 Analysing data with Hadoop
3.3 Scale-in Vs Scale-out
3.3.1 Number of reducers used
3.3.2 Driver class with no reducer
3.6.4 Low-latency data access
3.6.5 Lots of small files
3.6.6 Arbitrary file modifications
3.7 HDFS Concept
3.7.1 Blocks
3.7.2 Namenodes and Datanodes
3.7.3 HDFS group
3.7.4 All time availability
3.8 Hadoop Files System
3.9 Java Interface
3.9.1 HTTP
3.9.2 C
3.9.3 FUSE (File System in Userspace)
3.9.4 Reading data using Java interface (URL)
3.9.5 Reading data using java interface (File System API)
Trang 153.14 Avro file based data structure
3.14.1 Data type and schemas
3.14.2 Serialization and deserialization
4.3 Fully Distributed Mode
Chapter 5: MapReduce Applications
5.1 Understanding of MapReduce
5.2 Traditional Way
5.3 MapReduce Workflow
Trang 165.3.1 Map Side
5.3.2 Reduce Side
5.4 Unit Test with MRUnit
5.4.1 Testing Mapper Class
5.4.2 Testing Reducer Class
5.4.3 Testing Driver Class of Program
5.4.4 Test output of program
5.5 Test Data and Local Data Check
5.5.1 Debugging MapReduce Job
5.5.2 Job Control
5.6 Anatomy of MapReduce Job
5.6.1 Anatomy of File Write
5.6.2 Anatomy of File Read
5.6.3 Replica Management
5.7 MapReduce Job Run
5.7.1 Classic MapReduce (MapReduce 1)
Trang 176.6.5 Delete specific cell in table
6.6.6 Delete all cells in a table
6.6.7 Scanning using HBase shell
6.6.13 Scope operator for alter table
6.6.14 Deleting column family
6.6.15 Existence of table
6.6.16 Dropping a table
6.6.17 Drop all table
6.7 HBase using Java APIs
6.7.1 Creating table
6.7.2 List of the tables in HBase
6.7.3 Disable a table
6.7.4 Add column family
6.7.5 Deleting column family
Trang 186.7.6 Verifying existence of table6.7.7 Deleting table
6.9.4 Basic CLI commands
6.10 Cassandra Data Model
6.10.1 Super Column family
Trang 197.7.4 User Defined Functions
7.8 Developing and Testing PigLatin Script
Trang 207.10 Data Type and File Format
7.11 Comparison of HiveQL with Traditional Database7.12 HiveQL
7.12.1 Data Definition Language
7.12.2 Data Manipulation Language
7.12.3 Example for practice
Chapter 8: Practical & Research based Topics
8.1 Data Analysis with Twitter
8.1.1 Using flume
8.1.2 Data Extraction using Java
8.1.3 Data Extraction using Python
8.2 Use of Bloom Filter in MapReduce
8.2.1 Function of Bloom filter
8.2.2 Working of bloom filter
8.2.3 Application of Bloom filter
8.2.4 Implementation of Bloom filter in MapReduce8.3 Amazon Web Service
8.3.1 AWS
8.3.2 Setting AWS
8.3.3 Setting up Hadoop on EC2
8.4 Document Archived from NY Times
8.5 Data Mining in Mobiles
Trang 21Appendix: Hadoop Commands
Chapter wise Questions
Previous Year Question Paper
Trang 22CHAPTER 1
Big Data-Introduction and Demand
“…Data is useless without the skill to analyse it.”
-Jeanne Harris, senior executive at Accenture Institute for High
Performance,
“Taking a hunch, you have about the world and pursuing it in a structural, mathematical way to understand something new about the world.”
-Hilary Mason American data scientist and the founder of
technology start-up Fast Forward Labs
Trang 231.1 Big Data
In today's scenario, we all are surrounded by bulk of data We ashuman also an example of big data as we are surrounded by
devices and generating data every minute
“I spend most of my time assuming the world is not ready for the technology revolution that will be happening to them soon,”
Eric Schmidt Executive Chairman Google
In the matter of fact, if we compare present situation to past
scenario we can find that we are creating as much information injust two days as we did up-to 2003 That means we are creatingfive Exabyte of data in every two days
Real problem is that the user generated data which they are
producing continuously At the time of data analysis, we have
challenges to store and analysis those data
“The real issue is user-generated content,”
Schmidt
Mostly it helps Google for analysis the data and sell data analytics
to companies who required it We are producing data only the
Trang 24rough mobile as we already logged in when we buy system:
Map: that collect data of our travelling.
App: that gather information about our mood swings and record
activity in which we involve most of the time
E-Commerce sites: It also collect information of our requirement
and show whatever we are supposed to buy
Emails: It produce data of our requirement depend upon the
conversation as all conversation generally filtered through
companies that own mailing addresses
During the past few decades, technologies like remote sensing,geographical data systems, and world positioning systems of maphave remodelled the approach of distribution of human populationacross the world For that scenario, we need to map those
population data to meaningful survey that is performing by bigcompanies As a result, spatially careful changes across scales ofdays, weeks, or months, or maybe year to year, area unit tough toassess and limit the applying of human population maps in thingswithin which timely data is needed, like disasters, conflicts, orepidemics Information being collected on daily basis by mobilenetwork suppliers across the planet, the prospect of having theability to map up to date and ever-changing human populationdistributions over comparatively short intervals exist, paving theapproach for brand new applications and a close to period of
time understanding the patterns and processes in human science
Trang 25Some of the facts related to exponential data production are:
Currently, over 2 billion people worldwide are connected to theInternet, and over 5 billion individuals own mobile phones By
2020, 50 billion devices are expected to be connected to the
Internet At this point, predicted data production will be 44 timesgreater than that in 2009
In 2012, 2.5 quintillion bytes of data were generated daily, and90% of current data worldwide originated in the past two years
Facebook alone stores, accesses, and analyses 30 + PB of generated data
user-In 2008, Google was processing 20,000 TB of data daily
Walmart processes over 1 million customer transactions, thus
generating data more than 2.5 PB as an estimate
More than 5 billion people worldwide call, text, tweet, and browse
Trang 26value is expected to increase at an average annual rate of 13%over the next four years to exceed 143 billion by the end of 2016.
Boston.com reported that in 2013, approximately 507 billion e-mailswere sent daily Currently, an e-mail is sent every 3.5 × 10”7
seconds Thus, the volume of data increases per second because
of rapid data generation
By 2020, enterprise data is expected to total 40 ZB, as per
International Data Corporation
The New York Stock Exchange generating about one terabytes ofdata for new trade
Based on this estimation, business-to-consumer (B2C) and
internet-business-to-business (B2B) transactions will amount to 450billion per day
All are the facts that are sufficient to prove that world is
generating large amount of data that is not structured That caseleads to innovation or thinking that can provide solution for
solving those issues
Big data is the one which is use to deal with current scenario.Big data is the concept for handling unstructured and structureddata other than traditional way
way way
Trang 27Table 1.1: Introduction of data
Table 1.1 is showing flow of data from bottom to top In today'sscenario, any type of data is possible to store and processing
Trang 281.1.1 Characteristics of Big Data
Big data is data which gives the capacity to think beyond the
traditional database system Since that data can be used in Big data,
it may be structured or unstructured data with huge amount of
capacity, it requires fast movement, fast storage, fast processing otherthan conventional database techniques These requirements of
processing of data demand tools that can perform functions fast andmeaningful that are difficult by any traditional database tools
Properties of Big data provide next generation way to handle thesituation and provide easy and efficient way to handle data for
organization As we all see around, that there are lot of devices
which are continuously generating data with exponential incrementand all human being digging themselves into social networking
These types of unstructured and structured data are creating
challenges of storing and processing data
Every day, world is creating 2.5 quintillion bytes of data that is 90%
of the data in the world today that was created in the last two yearsalone and sources of those data from sensors, videos, post, twitter,WhatsApp, Facebook and many more digital sites of many users
Big data Vs Traditional techniques of databases
databases databases databases databases databases
databases databases databases databases databases databasesdatabases databases databases
Trang 29databases databases databases databases databases databasesdatabases databases
databases databases databases databases databases databasesdatabases databases databases databases databases
There are 3 V's that defined its characteristics in very clear manner
Fig 1.1: 3 V's of Big Data
Fig 1.1 showing 3 initial V's on which big data is dependent Volumerefers to any large amount of data which need storage for analytics
of data As data is increasing exponentially so up to YB of data
processing can be possible Companies can think of it now with
solution The volume of data is growing Consultants predict that theamount of information within the world can grow up to 25 ZB in
2020 that is with the exponential rate of increment
An article could be a few weight unit bytes, a sound file could be afew megabytes whereas a full-length pic could be a few gigabytes.Additional sources of information area unit are adding on continuousbasis For any company, this time all the information generated is fornot only by companies' employees but also by its machine as welllike CCTV cameras, punching machines or sensible sensors etc
Trang 30More sources of information with a bigger size of data mix to extendthe amount of information that needs to be analysed If we lookaround there is no cost of GB of data in commodity systems Soonall will be replaced by TB's of data.
Velocity refers to the speed of data that is exponential increases.Data is increasingly accelerating the velocity at which it is createdand at which it is integrated We have moved from batch to a real-time business
At starting there is trend to analyse data in batch processing sinceamount of data is large, that simply means that there is need tosubmit data on server and wait for its processing It is obvious thatresult will get delay With latest source of data there is different type
of data producing by machines which can be handle by Big dataeasily The data is now processed into the server in real time
scenario, in a continuous fashion; delivery of data output also
depends on delay of sources omitting data
It is not guarantee that data comes at machine in bulk it might beslow some time So, when there is need to handle pace variance ofdata flow techniques there is easy and accurate solution by Big data
Variety shows for different type of input that required for informationextraction Fact says that 80% of the world's data is unstructuredwhile we have options in traditional data handle techniques Text
(SMS), photo, audio, video, web, GPS data, sensor data, relationaldata bases, documents, pdf, flash, etc are the data that are flowingand required control to store and process it Facebook, emails etc.have no control over input that can be provided by any user Thevariety of data sources continues to increase It includes:
Trang 31Internet data (i.e., click stream, social media, and social networkinglinks)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data, industryreports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data, GPS)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogues and pricing, qualityinformation)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
Fig 1.2: Additional V's
There are two additional V's that are useful to take attention of user
in showing characteristics of Big data As all we can find out
messiness of data around like Twitter hash tag, smiley with text etc.All these data are very typical to handle when there is need of itsmining Big data makes it easy to store Hash tag (#) in twitter is
Trang 32use to categorize the topic so that at time of extraction meaningful
or required data can be fetched out and trustworthiness will remainwith users Nowadays, every company wants its survey and need to
do performance analysis that is why, hash tag is growing popularity
There is no need if there is no value of data, for that case Big dataprovide value of specific mining for enhancing quality of data andtime for its processing
Trang 331.1.2 Why Big Data
⚫ Considerate and Target Customers
In today's scenario, Big data is getting popularity and it relateitself to latest technology with synchronization of present one Thisfeature creates better understanding of its knowledge among
customers Companies are continuously storing variety of data thatare typical to handle with sensors, browser logs, social media etc
so it is preferable to store data first without much expectation ofits format It easily uses in prediction of behaviour of machine aswell as human
U.S retailer Target, predict customer pregnancy before her fatherdid because of her shopping trends and analysis
Using big data, Telecom companies now have better customerchurn
Wal-Mart can predict what products will sell and where
Car insurance companies understand how well their customersdrive and what offer can be provided to target next
Government election campaigns can be optimized by using bigdata analytics as we all are aware of central election based on
Trang 34⚫ Ease in Business process
As per earlier discussion prediction on data can make businesseasy, moreover to target customers Big data is also increasinglyused to optimize business processes Any process of analytics inbusiness needs historical data for accurate model
Retailers are now optimizing their stock based on predictionsgenerated from social media, web trends and weather forecasts.They also predict about targeted area of companies for sellinggoods
People can be easily track with their roaming behaviour as allrelate to GPS that are logged based Many times, we can observeabout route optimization with help of analytics of data
HR department also not untouched with Big data exponentialgrowth Money ball is style to optimize talent in any field
⚫ Personal growth and Optimization
If all of us look around we can find that we are only one thatare targeting by companies to increases its sale Nowadays, manygadgets are selling by companies that are tracking all habits ofusers that are useful for personal growth as well
Trang 35We can now take advantage from data generated devices like
wearable devices, smart bracelets etc
UP band from Jawbone is also an activity tracker to collect dataand observe it for processing to consumption of calorie and
sleeping pattern This company now have sleep data of 60 years
of individuals that can be taken for business purpose and
personal growth also for individuals
Processing big amount of data bring analysis for individual userlike love online sites, marriage sites, recommendation engines allthese are based on analysis More data that give more accurateresult
⚫ Regarding health improvisation
Big data allow predicting and analyses the string of pattern that isuseful in cure of disease DNA data analysis pattern is one ofthem As companies having data of health that is flowing fromwearable watches, band etc, can be recognized by its pattern forsolving disease of many individuals Many antibiotics follows thesame pattern to diagnose and cure disease Computation on DNAallow us to understanding and better cure Big data techniquesare already being used to monitor premature babies with
prescriptions and suggestions by recording and analysing everyheart beat and breathing pattern of baby, by analysis pattern
prediction can be done about infections There is algorithm
developed that predict cure of infection based on pattern Big dataanalytics allow monitoring and predicting the developments of
epidemics and disease outbreak
Trang 36Social media is also very useful for predicting upcoming disease.All this can be done by comments that are posted on twitter orFacebook Sensitive viruses also predicted before its entry intoplace Zika virus is an example of predictive analysis in medicalfield by social media.
⚫ Improving Sports Performance
Many of sports are interestingly in Big data field for its accurateprediction Most selected sports have now embraced big dataanalytics
IBM Slam Tracker tool is use for tennis tournaments
Video analytics track the performance of individual player in afootball or baseball game
Sensor technology in sports equipment such as basketball or golfclubs allows us to get feedback (via smart phones and cloud) onour game and how to improve it
Many sports teams also track athletes outside of the sportingenvironment regarding their pattern and habits
⚫ Improving Science and Research
Trang 37Science and research also not untouched by Big data analytics,these are also producing new opportunities and possibilities e.g.CERN, the Swiss nuclear physics lab with its Large Hadron
Collider, the world's largest and most powerful particle accelerator.The CERN data center has 65,000 processors to analyse its 30petabytes of data It uses the computing powers of thousands ofcomputers distributed across 150 data centers worldwide to
analyse the data Such powerful set up can fetch data process inuse of research and development
⚫ Enhancing and Optimizing Device Performance
Big data analytics help machines and devices to become smarterand independent Since we all know about self-operated Google'scar The Toyota Prius is fitted with cameras, GPS as well as
powerful computers and sensors to drive with safety on the roadwithout the intervention of human All these devices are well
trained with intelligence system only when it have a large amount
of data These are also capable to take real time decisions forhandling situations
⚫ Improvising Security Features
Big data is applied in improving security and enabling law
enforcement NSA (National Security Agency) use data for foilterrorist plot and spy on it In cyber-attack there is use of Bigdata With large amount of data of behaviour analysis, we caneasily track security concerns Police department can also usefraud detection to catch criminals specially in case of internetdealing
Trang 38⚫ Improving and Optimizing Cities and Countries
Big data is used to improve many aspects of cities and countries.Since govt is very serious in managing smart cities in country somaking any city smarter there is need to analysis bulk amount ofdata to take appropriate decision like traffic flow, weather data andmany sensor information It will be helpful to analysis the
reducing of man-made problem as well
⚫ Financial Trading
There is use of Big data in trading purpose with high frequency
of trading It needs to take wise decision based on algorithms ofintelligence For real implementation of trade scenario there is rawdata that comes from social media and mostly it help to takedecision in buy, sell or keep things with us
Trang 391.2 Hadoop
Since there are two problems before world:
(i) Data Storage
(ii) Data Analysis
It will be wastage if amount of data we could not collect So,there will be need of storing data with scale-out property
Traditional way to collect data on server side requires special
maintenance with its own limitations that said scale-in propertywhile scale-out property deals with commodity hardware to storingdata
Apache Hadoop is a framework that allows for the distributedprocessing of large data sets across clusters of commodity
computers by using a simple programming model It is opensource
Trang 401.2.1 History of Hadoop
2002 Doug Cutting, Graduate from Stanford University and MikeCafarella, Associate Professor of University of Michigan startedworking on NUTCH
Doug Cutting
Mike Cafarella