IT training managing the data lake khotailieu

Andy OramManaging the Data Lake Moving to Big Data Analysis... 5 Why Companies Move to Hadoop 7 Acquisition and Ingestion 10 Metadata Cataloguing 11 Data Preparation and Cleaning 14 Mana

Trang 2

Make Data Work

strataconf.com

Presented by O’Reilly and Cloudera, Strata + Hadoop World is where cutting-edge data science and new business fundamentals intersect— and merge.

n Learn business applications of data technologies

nDevelop new skills through trainings and in-depth tutorials

nConnect with an international community of thousands who work with data

Job # 15420

Trang 3

Andy Oram

Managing the Data Lake

Moving to Big Data Analysis

Trang 4

[LSI]

Managing the Data Lake

by Andy Oram

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

Editor: Shannon Cutt Interior Designer: David Futato

Cover Designer: Karen Montgomery September 2015: First Edition

Revision History for the First Edition

2015-09-02: First Release

2015-10-20: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Managing the

Data Lake and related trade dress are trademarks of O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights Cover photo credit: “55 Flying Fish” by Michal (flikr).

Trang 5

Table of Contents

Moving to Big Data Analysis 5

Why Companies Move to Hadoop 7

Acquisition and Ingestion 10

Metadata (Cataloguing) 11

Data Preparation and Cleaning 14

Managing Workflows 17

Access Control 19

Conclusion 20

iii

Trang 7

1http://bit.ly/1hJyJzi (PDF)

Moving to Big Data Analysis

Can you tell by sailing the surface of a lake whether it has been wellmaintained? Can local fish and plants survive? Dare you swim? Andhow about the data maintained in your organization’s data lake? Canyou tell whether it’s healthy enough to support your business needs?

An increasing number of organizations maintain fast-growing repo‐sitories of data, usually from multiple sources and formatted in mul‐tiple ways, that are commonly called “data lakes.” They use a variety

of storage and processing tools—especially in the Hadoop family—

to extract value quickly and inform key organizational decisions.This report looks at the common needs that modern organizationshave for data management and governance The MapReduce model

—introduced in 2004 in a paper1 by Jeffrey Dean and Sanjay Ghe‐mawat—completely overturned the way the computing communityapproached big data analysis Many other models, such as Spark,have come since then, creating excitement and seeing eager adop‐tion by organizations of all sizes to solve the problems that relationaldatabases were not suited for But these technologies bring withthem new demands for organizing data and keeping track of whatyou’ve got

I take it for granted that you understand the value of undertaking abig data initiative, as well as the value of a framework such asHadoop, and are in the process of transforming the way you manageyour organization’s data I have interviewed a number of experts indata management to find out the common challenges you are about

5

Trang 8

to face, so you can anticipate them and put solutions in place beforeyou find yourself overwhelmed.

Essentially, you’ll need to take care of challenges that never came upwith traditional relational databases and data warehouses, or thatwere handled by the constraints that the relational model placed ondata There is wonderful value in those constraints, and most of uswill be entrusting data to relational systems for the foreseeablefuture But some data tasks just don’t fit And once you escape thefamiliarity and safety of the relational model, you need other tools

to manage the inconsistencies, unpredictability, and breakneck pace

of the data you’re handling

The risk of the new tools is having many disparate sources of data—and perhaps multiple instances of Hadoop or other systems offeringanalytics operating inefficiently—which in turn causes you to losetrack of basic information you need to know about your data Thismakes it hard to set up new jobs that could provide input to thequestions you urgently need to answer

The fix is to restore some of the controls you had over old data sour‐ces through careful planning and coding, while still being flexibleand responsive to fast-moving corporate data needs

The main topics covered in this report are:

Acquisition and ingestion

Data comes nowadays from many different sources: internalbusiness systems, product data from customers, external dataproviders, public data sets, and more You can’t force everyone

to provide the data in a format that’s convenient for you Norcan you take the time (as in the old days) to define strict sche‐mas and enter all data into schemas The problems of dataacquisition and ingestion have to be solved with a degree ofautomation

Metadata (cataloguing)

Questions such as who provided the data, when it came in, and

how it was formatted—a slew of concerns known as lineage or

provenance—are critical to managing your data well A catalog

can keep this metadata and make it available to later stages ofprocessing

6 | Moving to Big Data Analysis

Trang 9

Data preparation and cleaning

Just as you can’t control incoming formats, you can’t controldata quality You will inevitably deal with data that does notconform Data may be missing, entered in diverse formats, con‐tain errors, and so on In addition, data might be lost or corrup‐ted because sensors run out of battery power, networks fail, soft‐ware along the way harbored a bug, or the incoming data had

an unrecognized format Some data users estimate that detect‐ing these anomalies and cleaning takes up 90% of their time

Access control

Data is the organization’s crown jewels You can’t give everybodyaccess to all data In fact, regulations require you to restrictaccess to sensitive customer data Security and access controlsare therefore critical at all stages of data handling

Why Companies Move to Hadoop

To set the stage for exploration of data management, it is helpful toremind ourselves of why organizations are moving in the direction

of big data tools

Size

“Volume” is one of the main aspects of big data Relational data‐bases cannot scale beyond a certain volume due to architecturerestrictions Organizations find that data processing in rela‐tional databases takes too long, and as they do more and moreanalytics, such data processing using conventional ETL toolsbecomes such a big time sink that they hold users back frommaking full use of the data

Trang 10

Sometimes you want also to export data from Hadoop to anRDBMS or other repository.

Free-form data

Some data may be almost completely unstructured, as in thecase of product reviews and social media postings Other datawill come to you inconsistently structured For instance, differ‐ent data providers may provide the same information in verydifferent formats

Streaming data

If you don’t keep up with changes in the world around you, itwill pass you by—and probably reward a competitor who doesadapt to it Streaming has evolved from a few rare cases, such asstock markets and sensor data, to everyday data such as productusage data and social media

Fitting the task to the tool

Data maintained in relational databases—let alone cruder stor‐age formats, such as spreadsheets—is structured well for certainanalytics But for new paradigms such as Spark or the MapRe‐duce model, preparing data can take more time than doing theanalytics Data in normalized relational format resides in manydifferent tables and must be combined to make the format thatthe analytics engine can efficiently process

Frequent failures

Modern processing systems such as Hadoop contain redun‐dancy and automatic restart to handle hardware failures or soft‐ware glitches Even so, you can expect jobs to be aborted regu‐larly by bad data You’ll want to get notifications when a job fin‐ishes successfully or unsuccessfully Log files should show youwhat goes wrong, and you should be able to see how many cor‐rupted rows were discarded and what other errors occurred.Unless you take management into consideration in advance, youend up unable to make good use of this data One example comesfrom a telecom company whose network generated records aboutthe details of phone calls for monthly billing purposes Their ETLsystem didn’t ingest data from calls that were dropped or never con‐nected, because no billing was involved So years later, when theyrealized they should be looking at which cell towers had low quality,they had no data with which to do so

Trang 11

A failure to collect or store data may be an extreme example of man‐agement problems, but other hindrances—such as storing it in a for‐mat that is hard to read, or failing to remember when it arrived—will also slow down processing to the point where you give upopportunities for learning insights from your data.

When the telecom company just mentioned realized that they coulduse information on dropped and incomplete calls, their ETL systemrequired a huge new programming effort and did not have thecapacity to store or process the additional data Modern organiza‐tions may frequently get new sources of data from brokers or pub‐licly available repositories, and can’t afford to spend time andresources doing such coding in order to integrate them

In systems with large, messy data, you have to decide what the sys‐tem should do when input is bad When do you skip a record, when

do you run a program to try to fix corrupted data, and when do youabort the whole job?

A minor error such as a missing ZIP code probably shouldn’t stop ajob, or even prevent that record from being processed A missingcustomer ID, though, might prevent you from doing anything usefulwith the data (There may be ways to recover from these errors too,

as we’ll see.)

Your choice depends of course on your goal If you’re counting sales

of a particular item, you don’t need the customer ID If you want toupdate customer records, you probably do

A more global problem with data ingestion comes when someonechanges the order of fields in all the records of an incoming data set.Your program might be able to detect what happened and adjust, ormight have to abort

At some point, old data will pile up and you will have to decidewhether to buy more disk space, archive the data (magnetic tape isstill in everyday use), or discard it Archiving or discarding has to beautomated to reduce errors You’ll find old data surprisingly useful ifyou can manage to hold on to it And of course, having it readily athand (instead of on magnetic tape) will permit you to quickly runanalytics on that data

Why Companies Move to Hadoop | 9

Trang 12

Acquisition and Ingestion

At this point we turn to the steps in data processing Acquisitioncomes first Nowadays it involves much more than moving datafrom an external source to your own repository In fact, you maynot be storing every source you get data from at all: you mightaccept streams of fast-changing data from sensors or social media,process them right away, and save only the results

On the other hand, if you want to keep the incoming data, you mayneed to convert it to a format understood by Hadoop or other pro‐cessing tools, such as Avro or Parquet

The health care field provides a particularly complex data collectioncase You may be collecting:

• Electronic health records from hospitals using different formats

• Claims data from health care providers or payers

• Profiles from health plans

• Data from individuals’ fitness devices

Electronic health records illustrate the variety and inconsistency ofall these data types Although there are standards developed by theHL7 standards group, they are implemented differently by eachEHR vendor Furthermore, HL7 exchanges data through several

used in the computer field

In a situation like this, you will probably design several generalmethods of ingesting data: one to handle the HL7 messages fromEHRs, another to handle claims data, and so on You’ll want to make

it easy for a user to choose one of these methods and adjust parame‐ters such as source, destination file, and frequency in order to han‐dle a new data feed

Successful ingestion requires you to know in detail how the data iscoming in Read the documentation carefully: you may find that thedata doesn’t contain what you wanted at all, or needs complex pro‐cessing to extract just what you need And the documentation maynot be trustworthy, so you have to test your ingestion process onactual input

Trang 13

As mentioned earlier, you may be able to anticipate how incomingdata changes—such as reordered fields—and adapt to it However,there are risks to doing this First, your tools become more compli‐cated and harder to maintain Second, they may make the wrongchoice because they think they understand the change and get itwrong.

Another common ingestion task is to create a consolidated recordfrom multiple files of related information that are used frequentlytogether— for example, an Order Header and Details merged intoone file Hadoop has a particular constraint on incoming data: it wasnot designed for small files Input may consist of many small files,but submitting them individually will force a wasteful input processonto Hadoop and perhaps even cause a failure For this reason, it isrecommended that, prior to processing these small files, they becombined into a single large file to leverage the Hadoop clustermore efficiently

This example highlights an important principle governing all theprocessing discussed in this report: use open formats if possible, andleverage everything the open source and free software communitieshave made available This will give you more options, because youwon’t be locked into one vendor Open source also makes it easier tohire staff and get them productive quickly

However, current open source tools don’t do everything you need.You’ll have to fill in the gaps with commercial solutions or hand-crafted scripts

For instance, Sqoop is an excellent tool for importing data from arelational database to Hadoop and supports incremental loads.However, building a complete insert-update-delete solution to keepthe Hive table in sync with the RDBMS table would be a pretty com‐plex task Here you might benefit from Zaloni’s Bedrock product,which offers a Change Data Capture (CDC) action that handlesinserts, updates, and deletes and is easy to configure

Metadata (Cataloguing)

Why do you need to preserve metadata about your data? Reasonsfor doing so abound:

Metadata (Cataloguing) | 11

Định dạng
Số trang	24
Dung lượng	2,91 MB