Managing the data lake

And howabout the data maintained in your organization’s data lake?. Andonce you escape the familiarity and safety of the relational model, you needother tools to manage the inconsistenci

Trang 3

Managing the Data LakeMoving to Big Data Analysis

Andy Oram

Trang 4

Managing the Data Lake

by Andy Oram

Printed in the United States of America

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or salespromotional use Online editions are also available for most titles(http://safaribooksonline.com) For more information, contact ourcorporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com

Editor: Shannon Cutt

Interior Designer: David Futato

Cover Designer: Karen Montgomery

September 2015: First Edition

Trang 5

Revision History for the First Edition

2015-09-02: First Release

2015-10-20: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc

Managing the Data Lake and related trade dress are trademarks of O’Reilly

of or reliance on this work Use of the information and instructions contained

in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the

intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights

Cover photo credit: “55 Flying Fish” by Michal (flikr)

978-1-491-94168-3

[LSI]

Trang 6

Chapter 1 Moving to Big Data

Analysis

Can you tell by sailing the surface of a lake whether it has been well

maintained? Can local fish and plants survive? Dare you swim? And howabout the data maintained in your organization’s data lake? Can you tell

whether it’s healthy enough to support your business needs?

An increasing number of organizations maintain fast-growing repositories ofdata, usually from multiple sources and formatted in multiple ways, that arecommonly called “data lakes.” They use a variety of storage and processingtools — especially in the Hadoop family — to extract value quickly and

inform key organizational decisions

This report looks at the common needs that modern organizations have fordata management and governance The MapReduce model — introduced in

2004 in a paper1 by Jeffrey Dean and Sanjay Ghemawat — completely

overturned the way the computing community approached big data analysis.Many other models, such as Spark, have come since then, creating

excitement and seeing eager adoption by organizations of all sizes to solvethe problems that relational databases were not suited for But these

technologies bring with them new demands for organizing data and keepingtrack of what you’ve got

I take it for granted that you understand the value of undertaking a big datainitiative, as well as the value of a framework such as Hadoop, and are in theprocess of transforming the way you manage your organization’s data I haveinterviewed a number of experts in data management to find out the commonchallenges you are about to face, so you can anticipate them and put solutions

in place before you find yourself overwhelmed

Essentially, you’ll need to take care of challenges that never came up withtraditional relational databases and data warehouses, or that were handled bythe constraints that the relational model placed on data There is wonderful

Trang 7

value in those constraints, and most of us will be entrusting data to relationalsystems for the foreseeable future But some data tasks just don’t fit Andonce you escape the familiarity and safety of the relational model, you needother tools to manage the inconsistencies, unpredictability, and breakneckpace of the data you’re handling.

The risk of the new tools is having many disparate sources of data — andperhaps multiple instances of Hadoop or other systems offering analyticsoperating inefficiently — which in turn causes you to lose track of basicinformation you need to know about your data This makes it hard to set upnew jobs that could provide input to the questions you urgently need to

answer

The fix is to restore some of the controls you had over old data sources

through careful planning and coding, while still being flexible and responsive

to fast-moving corporate data needs

The main topics covered in this report are:

Acquisition and ingestion

Data comes nowadays from many different sources: internal businesssystems, product data from customers, external data providers, publicdata sets, and more You can’t force everyone to provide the data in aformat that’s convenient for you Nor can you take the time (as in theold days) to define strict schemas and enter all data into schemas Theproblems of data acquisition and ingestion have to be solved with adegree of automation

Metadata (cataloguing)

Questions such as who provided the data, when it came in, and how it

was formatted — a slew of concerns known as lineage or provenance —

are critical to managing your data well A catalog can keep this metadataand make it available to later stages of processing

Data preparation and cleaning

Just as you can’t control incoming formats, you can’t control data

quality You will inevitably deal with data that does not conform Datamay be missing, entered in diverse formats, contain errors, and so on In

Trang 8

addition, data might be lost or corrupted because sensors run out of

battery power, networks fail, software along the way harbored a bug, orthe incoming data had an unrecognized format Some data users estimatethat detecting these anomalies and cleaning takes up 90% of their time

Managing workflows

The actual jobs you run on data need to be linked with the three otherstages just described Users should be able to submit jobs of their own,based on the work done by experts before them, to handle ingestion,cataloguing, and cleaning You want staff to quickly get a new

visualization or report without waiting weeks for a programmer to code

it up

Access control

Data is the organization’s crown jewels You can’t give everybody

access to all data In fact, regulations require you to restrict access tosensitive customer data Security and access controls are therefore

critical at all stages of data handling

Trang 9

Why Companies Move to Hadoop

To set the stage for exploration of data management, it is helpful to remindourselves of why organizations are moving in the direction of big data tools

Size

“Volume” is one of the main aspects of big data Relational databasescannot scale beyond a certain volume due to architecture restrictions.Organizations find that data processing in relational databases takes toolong, and as they do more and more analytics, such data processingusing conventional ETL tools becomes such a big time sink that theyhold users back from making full use of the data

Variety

Typical sources include flat files, RDBMSes, logs from web servers,devices and sensors, and even legacy mainframe data Sometimes youwant also to export data from Hadoop to an RDBMS or other repository

Free-form data

Some data may be almost completely unstructured, as in the case ofproduct reviews and social media postings Other data will come to youinconsistently structured For instance, different data providers mayprovide the same information in very different formats

Fitting the task to the tool

Data maintained in relational databases — let alone cruder storage

formats, such as spreadsheets — is structured well for certain analytics.But for new paradigms such as Spark or the MapReduce model,

Trang 10

preparing data can take more time than doing the analytics Data in

normalized relational format resides in many different tables and must

be combined to make the format that the analytics engine can efficientlyprocess

Frequent failures

Modern processing systems such as Hadoop contain redundancy andautomatic restart to handle hardware failures or software glitches Even

so, you can expect jobs to be aborted regularly by bad data You’ll want

to get notifications when a job finishes successfully or unsuccessfully.Log files should show you what goes wrong, and you should be able tosee how many corrupted rows were discarded and what other errorsoccurred

Unless you take management into consideration in advance, you end up

unable to make good use of this data One example comes from a telecomcompany whose network generated records about the details of phone callsfor monthly billing purposes Their ETL system didn’t ingest data from callsthat were dropped or never connected, because no billing was involved Soyears later, when they realized they should be looking at which cell towershad low quality, they had no data with which to do so

A failure to collect or store data may be an extreme example of managementproblems, but other hindrances — such as storing it in a format that is hard toread, or failing to remember when it arrived — will also slow down

processing to the point where you give up opportunities for learning insightsfrom your data

When the telecom company just mentioned realized that they could use

information on dropped and incomplete calls, their ETL system required ahuge new programming effort and did not have the capacity to store or

process the additional data Modern organizations may frequently get newsources of data from brokers or publicly available repositories, and can’tafford to spend time and resources doing such coding in order to integratethem

In systems with large, messy data, you have to decide what the system should

do when input is bad When do you skip a record, when do you run a

Trang 11

program to try to fix corrupted data, and when do you abort the whole job?

A minor error such as a missing ZIP code probably shouldn’t stop a job, oreven prevent that record from being processed A missing customer ID,

though, might prevent you from doing anything useful with the data (Theremay be ways to recover from these errors too, as we’ll see.)

Your choice depends of course on your goal If you’re counting sales of aparticular item, you don’t need the customer ID If you want to update

customer records, you probably do

A more global problem with data ingestion comes when someone changes theorder of fields in all the records of an incoming data set Your program might

be able to detect what happened and adjust, or might have to abort

At some point, old data will pile up and you will have to decide whether tobuy more disk space, archive the data (magnetic tape is still in everyday use),

or discard it Archiving or discarding has to be automated to reduce errors.You’ll find old data surprisingly useful if you can manage to hold on to it.And of course, having it readily at hand (instead of on magnetic tape) willpermit you to quickly run analytics on that data

Trang 12

Acquisition and Ingestion

At this point we turn to the steps in data processing Acquisition comes first.Nowadays it involves much more than moving data from an external source

to your own repository In fact, you may not be storing every source you getdata from at all: you might accept streams of fast-changing data from sensors

or social media, process them right away, and save only the results

On the other hand, if you want to keep the incoming data, you may need toconvert it to a format understood by Hadoop or other processing tools, such

as Avro or Parquet

The health care field provides a particularly complex data collection case.You may be collecting:

Electronic health records from hospitals using different formats

Claims data from health care providers or payers

Profiles from health plans

Data from individuals’ fitness devices

Electronic health records illustrate the variety and inconsistency of all thesedata types Although there are standards developed by the HL7 standardsgroup, they are implemented differently by each EHR vendor Furthermore,HL7 exchanges data through several messaging systems that differ from anyother kind of data exchange used in the computer field

In a situation like this, you will probably design several general methods ofingesting data: one to handle the HL7 messages from EHRs, another to

handle claims data, and so on You’ll want to make it easy for a user to

choose one of these methods and adjust parameters such as source,

destination file, and frequency in order to handle a new data feed

Successful ingestion requires you to know in detail how the data is coming

in Read the documentation carefully: you may find that the data doesn’t

Trang 13

contain what you wanted at all, or needs complex processing to extract justwhat you need And the documentation may not be trustworthy, so you have

to test your ingestion process on actual input

As mentioned earlier, you may be able to anticipate how incoming data

changes — such as reordered fields — and adapt to it However, there arerisks to doing this First, your tools become more complicated and harder tomaintain Second, they may make the wrong choice because they think theyunderstand the change and get it wrong

Another common ingestion task is to create a consolidated record from

multiple files of related information that are used frequently together — forexample, an Order Header and Details merged into one file Hadoop has aparticular constraint on incoming data: it was not designed for small files.Input may consist of many small files, but submitting them individually willforce a wasteful input process onto Hadoop and perhaps even cause a failure.For this reason, it is recommended that, prior to processing these small files,they be combined into a single large file to leverage the Hadoop cluster moreefficiently

This example highlights an important principle governing all the processingdiscussed in this report: use open formats if possible, and leverage everythingthe open source and free software communities have made available Thiswill give you more options, because you won’t be locked into one vendor.Open source also makes it easier to hire staff and get them productive

quickly

However, current open source tools don’t do everything you need You’llhave to fill in the gaps with commercial solutions or hand-crafted scripts.For instance, Sqoop is an excellent tool for importing data from a relationaldatabase to Hadoop and supports incremental loads However, building acomplete insert-update-delete solution to keep the Hive table in sync with theRDBMS table would be a pretty complex task Here you might benefit fromZaloni’s Bedrock product, which offers a Change Data Capture (CDC) actionthat handles inserts, updates, and deletes and is easy to configure

Trang 14

Data preparation and cleaning require a firm knowledge of which data setyou’re working on Different sets require different types of preparation,based on what you have learned about them historically.

Analytical methods are often experimental and have some degree of error

To determine whether you can trust results, you may want to check thedata that was used to achieve the results, and review how it was processed

When something goes wrong in any stage from ingestion through to theprocessing, you need to quickly pinpoint the data causing the problem.You also must identify the source so you can contact them and make surethe problem doesn’t reoccur in future data sets

In addition to cleaning data and preventing errors, you may have otherreasons related to quality control to preserve the lineage or provenance ofdata

Access has to be restricted to sensitive data If users deliberately or

inadvertently try to start a job on data they’re not supposed to see, yoursystem should reject the job

Regulatory requirements may require the access restrictions mentioned inthe previous bullet, as well as imposing other requirements that depend onthe data source

Licenses may require access restrictions and other special treatment ofsome data sources

Định dạng
Số trang	27
Dung lượng	1,87 MB