1. Trang chủ
  2. » Công Nghệ Thông Tin

Big data and hadoop learn by example (bhushan, mayank)

721 8 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Big Data & Hadoop Learn by Example
Tác giả Mayank Bhushan
Chuyên ngành Big Data and Hadoop
Thể loại book
Năm xuất bản 2020
Thành phố New Delhi
Định dạng
Số trang 721
Dung lượng 10,63 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

441500775 BIG DATA HADOOP Learn by Example by Mayank Bhushan FIRST EDITION 2020 Copyright © BPB Publications, INDIA ISBN 978 93 8655 199 3 All Rights Reserved No part of this publication can be stor.

Trang 2

BIG DATA & HADOOP

Learn by Example

by

Mayank Bhushan

Trang 3

FIRST EDITION 2020

Copyright © BPB Publications, INDIA

ISBN: 978-93-8655-199-3

All Rights Reserved No part of this publication can be stored in

a retrieval system or reproduced in any form or by any meanswithout the prior written permission of the publishers

LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY

The Author and Publisher of this book have tried their best toensure that the programmes, procedures and functions described

in the book are correct However, the author and the publishersmake no warranty of any kind, expressed or implied, with regard

to these programmes or the documentation contained in the book.The author and publisher shall not be liable in any event of anydamages, incidental or consequential, in connection with, or

arising out of the furnishing, performance or use of these

programmes, procedures and functions Product name mentionedare used for identification purposes only and may be trademarks

of their respective companies

All trademarks referred to in the book are acknowledged as

properties of their respective owners

Distributors:

Trang 4

Shop No 5, Mahendra Chambers,

150 DN Rd Next to Capital Cinema,

V.T (C.S.T.) Station, MUMBAI-400 001

Ph: 22078296/22078297

Trang 7

This book promises to be a very good starting point for beginnersand an asset to advanced users too.

This book is written as per the syllabus of various universitieslearning pattern and its aim is to keep course approach as

“learning with example” Difficult concepts of Big Data-Hadoop isgiven in an easy and practical way, so that students can able tounderstand it in an efficient manner This book provides

screenshots of practical approaches which can be helpful for

in the subsequent editions of this book

23rd March 2018

Trang 8

Mayank Bhushan

Trang 9

I would like to express my gratitude to all those who providedsupport, talked things over, read, wrote, offered comments, allowed

me to quote their remarks and assisted in the editing,

proofreading and design

I have relied on many people to guide me directly and indirectly

in writing this book I am very thankful to Hadoop community;from whom I have learned with continuous efforts and I also owe

a debt of gratitude for ABES College to provide me all facilitiesfor Big Data-Hadoop lab

There is always a sense of gratitude, which every one expressesothers for their helpful and needy services they render during

difficult phases of life and to achieve the goal already set

It is impossible to thank individually but we are here by makinghumble effort to thanks some of them At the outset I am

thankful to the almighty that is constantly and invisibly guidingevery body and have also helped us to work on the right path

I am very much thankful to Prof (Dr.) Shailesh Tiwari, H.O.D.(CSE), ABES Engineering College, Ghaziabad (U.P.) for guiding andsupporting me He is the main source of inspiration for me Iwould also like to thanks to Dr Munesh Chandra Trivedi Dean(REC-Azamgarh) Dr Pratibha Singh (Prof., ABES Engineering

College) and Dr Shaswati Banerjea, Asst Prof (MNNIT Allahabad)

Trang 10

who always provide me support everywhere Without help fromthem this book is not possible I am in debt of technical helpfrom my dearest friend and colleague Mr Omesh Kumar whoguide me technically for every problem.

I wish my thanks to my all Guru's, friends and colleagues whohelped and kept us motivated for writing this text Special thanksto:

Dr K.K Mishra, MNNIT Allahabad

Dr Mayank Pandey, MNNIT Allahabad

Dr Shashank Srivastava, MNNIT Allahabad

Mr Nitin Shukla, MNNIT Allahabad

Mr Suraj Deb Barma Govt Polytechnic College, Agartala

Dr A.L.N Rao, GL Bajaj, Greater Noida

Mr Ankit Yadav, Mr Desh Deepak Pathak, ABES EC Ghaziabad

Dr Sumit Yadav, IP University

Mr Aatif Jamshed, Galgotia College, Greater Noida

Trang 11

I also thank the Publisher and the whole staff at BPB

Publications, especially

Mr Manish Jain for bringing this text in a nice presentable form.

Finally, I want to thanks everyone who has directly or indirectlycontributed to complete this authentic work

Mayank Bhushan

Trang 12

Table of Content

Chapter 1: Big Data-Introduction and Demand

1.1 Big Data

1.1.1 Characteristics of Big Data

1.1.2 Why Big Data

1.2 Hadoop

1.2.1 History of Hadoop

1.2.2 Name of Hadoop

1.2.3 Hadoop Ecosystem

1.3 Convergence of Key Trends

1.3.1 Convergence of Big Data into Business

1.3.2 Big data Vs other techniques

1.4 Unstructured Data

1.5 Industry examples of Big data

1.5.1 Use of Big data-Hadoop at Yahoo

1.5.2 In RackSpace for log processing

1.5.3 Hadoop at Facebook

1.6 Usages of Big Data

1.6.1 Web analytics

1.6.2 Big Data and marketing

1.6.3 Big data and fraud

1.6.4 Risk management in Big Data with Credit card1.6.5 Big data and algorithm trading

1.6.6 Big data in Healthcare

Chapter 2: NoSQL Data Management

2.1 Introduction to NoSQL database

2.1.1 Terminology used in NoSQL and RDBMS

Trang 13

2.1.2 Database use in NoSQL

2.4.6 Hbase Shell Commands

2.4.7 The different usages of scan command2.4.8 Terminologies

Trang 14

Chapter 3: Basics of Hadoop

3.1 Data Format

3.2 Analysing data with Hadoop

3.3 Scale-in Vs Scale-out

3.3.1 Number of reducers used

3.3.2 Driver class with no reducer

3.6.4 Low-latency data access

3.6.5 Lots of small files

3.6.6 Arbitrary file modifications

3.7 HDFS Concept

3.7.1 Blocks

3.7.2 Namenodes and Datanodes

3.7.3 HDFS group

3.7.4 All time availability

3.8 Hadoop Files System

3.9 Java Interface

3.9.1 HTTP

3.9.2 C

3.9.3 FUSE (File System in Userspace)

3.9.4 Reading data using Java interface (URL)

3.9.5 Reading data using java interface (File System API)

Trang 15

3.14 Avro file based data structure

3.14.1 Data type and schemas

3.14.2 Serialization and deserialization

4.3 Fully Distributed Mode

Chapter 5: MapReduce Applications

5.1 Understanding of MapReduce

5.2 Traditional Way

5.3 MapReduce Workflow

Trang 16

5.3.1 Map Side

5.3.2 Reduce Side

5.4 Unit Test with MRUnit

5.4.1 Testing Mapper Class

5.4.2 Testing Reducer Class

5.4.3 Testing Driver Class of Program

5.4.4 Test output of program

5.5 Test Data and Local Data Check

5.5.1 Debugging MapReduce Job

5.5.2 Job Control

5.6 Anatomy of MapReduce Job

5.6.1 Anatomy of File Write

5.6.2 Anatomy of File Read

5.6.3 Replica Management

5.7 MapReduce Job Run

5.7.1 Classic MapReduce (MapReduce 1)

Trang 17

6.6.5 Delete specific cell in table

6.6.6 Delete all cells in a table

6.6.7 Scanning using HBase shell

6.6.13 Scope operator for alter table

6.6.14 Deleting column family

6.6.15 Existence of table

6.6.16 Dropping a table

6.6.17 Drop all table

6.7 HBase using Java APIs

6.7.1 Creating table

6.7.2 List of the tables in HBase

6.7.3 Disable a table

6.7.4 Add column family

6.7.5 Deleting column family

Trang 18

6.7.6 Verifying existence of table6.7.7 Deleting table

6.9.4 Basic CLI commands

6.10 Cassandra Data Model

6.10.1 Super Column family

Trang 19

7.7.4 User Defined Functions

7.8 Developing and Testing PigLatin Script

Trang 20

7.10 Data Type and File Format

7.11 Comparison of HiveQL with Traditional Database7.12 HiveQL

7.12.1 Data Definition Language

7.12.2 Data Manipulation Language

7.12.3 Example for practice

Chapter 8: Practical & Research based Topics

8.1 Data Analysis with Twitter

8.1.1 Using flume

8.1.2 Data Extraction using Java

8.1.3 Data Extraction using Python

8.2 Use of Bloom Filter in MapReduce

8.2.1 Function of Bloom filter

8.2.2 Working of bloom filter

8.2.3 Application of Bloom filter

8.2.4 Implementation of Bloom filter in MapReduce8.3 Amazon Web Service

8.3.1 AWS

8.3.2 Setting AWS

8.3.3 Setting up Hadoop on EC2

8.4 Document Archived from NY Times

8.5 Data Mining in Mobiles

Trang 21

Appendix: Hadoop Commands

Chapter wise Questions

Previous Year Question Paper

Trang 22

CHAPTER 1

Big Data-Introduction and Demand

“…Data is useless without the skill to analyse it.”

-Jeanne Harris, senior executive at Accenture Institute for High

Performance,

“Taking a hunch, you have about the world and pursuing it in a structural, mathematical way to understand something new about the world.”

-Hilary Mason American data scientist and the founder of

technology start-up Fast Forward Labs

Trang 23

1.1 Big Data

In today's scenario, we all are surrounded by bulk of data We ashuman also an example of big data as we are surrounded by

devices and generating data every minute

“I spend most of my time assuming the world is not ready for the technology revolution that will be happening to them soon,”

Eric Schmidt Executive Chairman Google

In the matter of fact, if we compare present situation to past

scenario we can find that we are creating as much information injust two days as we did up-to 2003 That means we are creatingfive Exabyte of data in every two days

Real problem is that the user generated data which they are

producing continuously At the time of data analysis, we have

challenges to store and analysis those data

“The real issue is user-generated content,”

Schmidt

Mostly it helps Google for analysis the data and sell data analytics

to companies who required it We are producing data only the

Trang 24

rough mobile as we already logged in when we buy system:

Map: that collect data of our travelling.

App: that gather information about our mood swings and record

activity in which we involve most of the time

E-Commerce sites: It also collect information of our requirement

and show whatever we are supposed to buy

Emails: It produce data of our requirement depend upon the

conversation as all conversation generally filtered through

companies that own mailing addresses

During the past few decades, technologies like remote sensing,geographical data systems, and world positioning systems of maphave remodelled the approach of distribution of human populationacross the world For that scenario, we need to map those

population data to meaningful survey that is performing by bigcompanies As a result, spatially careful changes across scales ofdays, weeks, or months, or maybe year to year, area unit tough toassess and limit the applying of human population maps in thingswithin which timely data is needed, like disasters, conflicts, orepidemics Information being collected on daily basis by mobilenetwork suppliers across the planet, the prospect of having theability to map up to date and ever-changing human populationdistributions over comparatively short intervals exist, paving theapproach for brand new applications and a close to period of

time understanding the patterns and processes in human science

Trang 25

Some of the facts related to exponential data production are:

Currently, over 2 billion people worldwide are connected to theInternet, and over 5 billion individuals own mobile phones By

2020, 50 billion devices are expected to be connected to the

Internet At this point, predicted data production will be 44 timesgreater than that in 2009

In 2012, 2.5 quintillion bytes of data were generated daily, and90% of current data worldwide originated in the past two years

Facebook alone stores, accesses, and analyses 30 + PB of generated data

user-In 2008, Google was processing 20,000 TB of data daily

Walmart processes over 1 million customer transactions, thus

generating data more than 2.5 PB as an estimate

More than 5 billion people worldwide call, text, tweet, and browse

Trang 26

value is expected to increase at an average annual rate of 13%over the next four years to exceed 143 billion by the end of 2016.

Boston.com reported that in 2013, approximately 507 billion e-mailswere sent daily Currently, an e-mail is sent every 3.5 × 10”7

seconds Thus, the volume of data increases per second because

of rapid data generation

By 2020, enterprise data is expected to total 40 ZB, as per

International Data Corporation

The New York Stock Exchange generating about one terabytes ofdata for new trade

Based on this estimation, business-to-consumer (B2C) and

internet-business-to-business (B2B) transactions will amount to 450billion per day

All are the facts that are sufficient to prove that world is

generating large amount of data that is not structured That caseleads to innovation or thinking that can provide solution for

solving those issues

Big data is the one which is use to deal with current scenario.Big data is the concept for handling unstructured and structureddata other than traditional way

way way

Trang 27

Table 1.1: Introduction of data

Table 1.1 is showing flow of data from bottom to top In today'sscenario, any type of data is possible to store and processing

Trang 28

1.1.1 Characteristics of Big Data

Big data is data which gives the capacity to think beyond the

traditional database system Since that data can be used in Big data,

it may be structured or unstructured data with huge amount of

capacity, it requires fast movement, fast storage, fast processing otherthan conventional database techniques These requirements of

processing of data demand tools that can perform functions fast andmeaningful that are difficult by any traditional database tools

Properties of Big data provide next generation way to handle thesituation and provide easy and efficient way to handle data for

organization As we all see around, that there are lot of devices

which are continuously generating data with exponential incrementand all human being digging themselves into social networking

These types of unstructured and structured data are creating

challenges of storing and processing data

Every day, world is creating 2.5 quintillion bytes of data that is 90%

of the data in the world today that was created in the last two yearsalone and sources of those data from sensors, videos, post, twitter,WhatsApp, Facebook and many more digital sites of many users

Big data Vs Traditional techniques of databases

databases databases databases databases databases

databases databases databases databases databases databasesdatabases databases databases

Trang 29

databases databases databases databases databases databasesdatabases databases

databases databases databases databases databases databasesdatabases databases databases databases databases

There are 3 V's that defined its characteristics in very clear manner

Fig 1.1: 3 V's of Big Data

Fig 1.1 showing 3 initial V's on which big data is dependent Volumerefers to any large amount of data which need storage for analytics

of data As data is increasing exponentially so up to YB of data

processing can be possible Companies can think of it now with

solution The volume of data is growing Consultants predict that theamount of information within the world can grow up to 25 ZB in

2020 that is with the exponential rate of increment

An article could be a few weight unit bytes, a sound file could be afew megabytes whereas a full-length pic could be a few gigabytes.Additional sources of information area unit are adding on continuousbasis For any company, this time all the information generated is fornot only by companies' employees but also by its machine as welllike CCTV cameras, punching machines or sensible sensors etc

Trang 30

More sources of information with a bigger size of data mix to extendthe amount of information that needs to be analysed If we lookaround there is no cost of GB of data in commodity systems Soonall will be replaced by TB's of data.

Velocity refers to the speed of data that is exponential increases.Data is increasingly accelerating the velocity at which it is createdand at which it is integrated We have moved from batch to a real-time business

At starting there is trend to analyse data in batch processing sinceamount of data is large, that simply means that there is need tosubmit data on server and wait for its processing It is obvious thatresult will get delay With latest source of data there is different type

of data producing by machines which can be handle by Big dataeasily The data is now processed into the server in real time

scenario, in a continuous fashion; delivery of data output also

depends on delay of sources omitting data

It is not guarantee that data comes at machine in bulk it might beslow some time So, when there is need to handle pace variance ofdata flow techniques there is easy and accurate solution by Big data

Variety shows for different type of input that required for informationextraction Fact says that 80% of the world's data is unstructuredwhile we have options in traditional data handle techniques Text

(SMS), photo, audio, video, web, GPS data, sensor data, relationaldata bases, documents, pdf, flash, etc are the data that are flowingand required control to store and process it Facebook, emails etc.have no control over input that can be provided by any user Thevariety of data sources continues to increase It includes:

Trang 31

Internet data (i.e., click stream, social media, and social networkinglinks)

Primary research (i.e., surveys, experiments, observations)

Secondary research (i.e., competitive and marketplace data, industryreports, consumer data, business data)

Location data (i.e., mobile device data, geospatial data, GPS)

Image data (i.e., video, satellite image, surveillance)

Supply chain data (i.e., EDI, vendor catalogues and pricing, qualityinformation)

Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)

Fig 1.2: Additional V's

There are two additional V's that are useful to take attention of user

in showing characteristics of Big data As all we can find out

messiness of data around like Twitter hash tag, smiley with text etc.All these data are very typical to handle when there is need of itsmining Big data makes it easy to store Hash tag (#) in twitter is

Trang 32

use to categorize the topic so that at time of extraction meaningful

or required data can be fetched out and trustworthiness will remainwith users Nowadays, every company wants its survey and need to

do performance analysis that is why, hash tag is growing popularity

There is no need if there is no value of data, for that case Big dataprovide value of specific mining for enhancing quality of data andtime for its processing

Trang 33

1.1.2 Why Big Data

⚫ Considerate and Target Customers

In today's scenario, Big data is getting popularity and it relateitself to latest technology with synchronization of present one Thisfeature creates better understanding of its knowledge among

customers Companies are continuously storing variety of data thatare typical to handle with sensors, browser logs, social media etc

so it is preferable to store data first without much expectation ofits format It easily uses in prediction of behaviour of machine aswell as human

U.S retailer Target, predict customer pregnancy before her fatherdid because of her shopping trends and analysis

Using big data, Telecom companies now have better customerchurn

Wal-Mart can predict what products will sell and where

Car insurance companies understand how well their customersdrive and what offer can be provided to target next

Government election campaigns can be optimized by using bigdata analytics as we all are aware of central election based on

Trang 34

⚫ Ease in Business process

As per earlier discussion prediction on data can make businesseasy, moreover to target customers Big data is also increasinglyused to optimize business processes Any process of analytics inbusiness needs historical data for accurate model

Retailers are now optimizing their stock based on predictionsgenerated from social media, web trends and weather forecasts.They also predict about targeted area of companies for sellinggoods

People can be easily track with their roaming behaviour as allrelate to GPS that are logged based Many times, we can observeabout route optimization with help of analytics of data

HR department also not untouched with Big data exponentialgrowth Money ball is style to optimize talent in any field

⚫ Personal growth and Optimization

If all of us look around we can find that we are only one thatare targeting by companies to increases its sale Nowadays, manygadgets are selling by companies that are tracking all habits ofusers that are useful for personal growth as well

Trang 35

We can now take advantage from data generated devices like

wearable devices, smart bracelets etc

UP band from Jawbone is also an activity tracker to collect dataand observe it for processing to consumption of calorie and

sleeping pattern This company now have sleep data of 60 years

of individuals that can be taken for business purpose and

personal growth also for individuals

Processing big amount of data bring analysis for individual userlike love online sites, marriage sites, recommendation engines allthese are based on analysis More data that give more accurateresult

⚫ Regarding health improvisation

Big data allow predicting and analyses the string of pattern that isuseful in cure of disease DNA data analysis pattern is one ofthem As companies having data of health that is flowing fromwearable watches, band etc, can be recognized by its pattern forsolving disease of many individuals Many antibiotics follows thesame pattern to diagnose and cure disease Computation on DNAallow us to understanding and better cure Big data techniquesare already being used to monitor premature babies with

prescriptions and suggestions by recording and analysing everyheart beat and breathing pattern of baby, by analysis pattern

prediction can be done about infections There is algorithm

developed that predict cure of infection based on pattern Big dataanalytics allow monitoring and predicting the developments of

epidemics and disease outbreak

Trang 36

Social media is also very useful for predicting upcoming disease.All this can be done by comments that are posted on twitter orFacebook Sensitive viruses also predicted before its entry intoplace Zika virus is an example of predictive analysis in medicalfield by social media.

⚫ Improving Sports Performance

Many of sports are interestingly in Big data field for its accurateprediction Most selected sports have now embraced big dataanalytics

IBM Slam Tracker tool is use for tennis tournaments

Video analytics track the performance of individual player in afootball or baseball game

Sensor technology in sports equipment such as basketball or golfclubs allows us to get feedback (via smart phones and cloud) onour game and how to improve it

Many sports teams also track athletes outside of the sportingenvironment regarding their pattern and habits

⚫ Improving Science and Research

Trang 37

Science and research also not untouched by Big data analytics,these are also producing new opportunities and possibilities e.g.CERN, the Swiss nuclear physics lab with its Large Hadron

Collider, the world's largest and most powerful particle accelerator.The CERN data center has 65,000 processors to analyse its 30petabytes of data It uses the computing powers of thousands ofcomputers distributed across 150 data centers worldwide to

analyse the data Such powerful set up can fetch data process inuse of research and development

⚫ Enhancing and Optimizing Device Performance

Big data analytics help machines and devices to become smarterand independent Since we all know about self-operated Google'scar The Toyota Prius is fitted with cameras, GPS as well as

powerful computers and sensors to drive with safety on the roadwithout the intervention of human All these devices are well

trained with intelligence system only when it have a large amount

of data These are also capable to take real time decisions forhandling situations

⚫ Improvising Security Features

Big data is applied in improving security and enabling law

enforcement NSA (National Security Agency) use data for foilterrorist plot and spy on it In cyber-attack there is use of Bigdata With large amount of data of behaviour analysis, we caneasily track security concerns Police department can also usefraud detection to catch criminals specially in case of internetdealing

Trang 38

⚫ Improving and Optimizing Cities and Countries

Big data is used to improve many aspects of cities and countries.Since govt is very serious in managing smart cities in country somaking any city smarter there is need to analysis bulk amount ofdata to take appropriate decision like traffic flow, weather data andmany sensor information It will be helpful to analysis the

reducing of man-made problem as well

⚫ Financial Trading

There is use of Big data in trading purpose with high frequency

of trading It needs to take wise decision based on algorithms ofintelligence For real implementation of trade scenario there is rawdata that comes from social media and mostly it help to takedecision in buy, sell or keep things with us

Trang 39

1.2 Hadoop

Since there are two problems before world:

(i) Data Storage

(ii) Data Analysis

It will be wastage if amount of data we could not collect So,there will be need of storing data with scale-out property

Traditional way to collect data on server side requires special

maintenance with its own limitations that said scale-in propertywhile scale-out property deals with commodity hardware to storingdata

Apache Hadoop is a framework that allows for the distributedprocessing of large data sets across clusters of commodity

computers by using a simple programming model It is opensource

Trang 40

1.2.1 History of Hadoop

2002 Doug Cutting, Graduate from Stanford University and MikeCafarella, Associate Professor of University of Michigan startedworking on NUTCH

Doug Cutting

Mike Cafarella

Ngày đăng: 16/09/2022, 22:17