managing the data lake

Managing the Data Lake Moving to Big Data Analysis Andy Oram... Moving to Big Data AnalysisCan you tell by sailing the surface of a lake whether it has been well maintained?. And how abo

Trang 3

Managing the Data Lake Moving to Big Data Analysis

Andy Oram

Trang 4

Managing the Data Lake

by Andy Oram

Printed in the United States of America

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or sales promotional use Online

editions are also available for most titles (http://safaribooksonline.com) For more information,

contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com

Editor: Shannon Cutt

Interior Designer: David Futato

Cover Designer: Karen Montgomery

September 2015: First Edition

Revision History for the First Edition

2015-09-02: First Release

2015-10-20: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Managing the Data Lake and

related trade dress are trademarks of O’Reilly Media, Inc

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all

responsibility for errors or omissions, including without limitation responsibility for damages

resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes

is subject to open source licenses or the intellectual property rights of others, it is your responsibility

to ensure that your use thereof complies with such licenses and/or rights

Cover photo credit: “55 Flying Fish” by Michal (flikr)

978-1-491-94168-3

[LSI]

Trang 5

Chapter 1 Moving to Big Data Analysis

Can you tell by sailing the surface of a lake whether it has been well maintained? Can local fish and plants survive? Dare you swim? And how about the data maintained in your organization’s data lake? Can you tell whether it’s healthy enough to support your business needs?

An increasing number of organizations maintain fast-growing repositories of data, usually from

multiple sources and formatted in multiple ways, that are commonly called “data lakes.” They use a variety of storage and processing tools—especially in the Hadoop family—to extract value quickly and inform key organizational decisions

This report looks at the common needs that modern organizations have for data management and

governance The MapReduce model—introduced in 2004 in a paper1 by Jeffrey Dean and Sanjay Ghemawat—completely overturned the way the computing community approached big data analysis Many other models, such as Spark, have come since then, creating excitement and seeing eager

adoption by organizations of all sizes to solve the problems that relational databases were not suited for But these technologies bring with them new demands for organizing data and keeping track of what you’ve got

I take it for granted that you understand the value of undertaking a big data initiative, as well as the value of a framework such as Hadoop, and are in the process of transforming the way you manage your organization’s data I have interviewed a number of experts in data management to find out the common challenges you are about to face, so you can anticipate them and put solutions in place before you find yourself overwhelmed

Essentially, you’ll need to take care of challenges that never came up with traditional relational

databases and data warehouses, or that were handled by the constraints that the relational model

placed on data There is wonderful value in those constraints, and most of us will be entrusting data

to relational systems for the foreseeable future But some data tasks just don’t fit And once you

escape the familiarity and safety of the relational model, you need other tools to manage the

inconsistencies, unpredictability, and breakneck pace of the data you’re handling

The risk of the new tools is having many disparate sources of data—and perhaps multiple instances of Hadoop or other systems offering analytics operating inefficiently—which in turn causes you to lose track of basic information you need to know about your data This makes it hard to set up new jobs that could provide input to the questions you urgently need to answer

The fix is to restore some of the controls you had over old data sources through careful planning and coding, while still being flexible and responsive to fast-moving corporate data needs

The main topics covered in this report are:

Acquisition and ingestion

Data comes nowadays from many different sources: internal business systems, product data from

Trang 6

customers, external data providers, public data sets, and more You can’t force everyone to

provide the data in a format that’s convenient for you Nor can you take the time (as in the old days) to define strict schemas and enter all data into schemas The problems of data acquisition and ingestion have to be solved with a degree of automation

Metadata (cataloguing)

Questions such as who provided the data, when it came in, and how it was formatted—a slew of

concerns known as lineage or provenance—are critical to managing your data well A catalog

can keep this metadata and make it available to later stages of processing

Data preparation and cleaning

Just as you can’t control incoming formats, you can’t control data quality You will inevitably deal with data that does not conform Data may be missing, entered in diverse formats, contain errors, and so on In addition, data might be lost or corrupted because sensors run out of battery power, networks fail, software along the way harbored a bug, or the incoming data had an

unrecognized format Some data users estimate that detecting these anomalies and cleaning takes

up 90% of their time

Managing workflows

The actual jobs you run on data need to be linked with the three other stages just described Users should be able to submit jobs of their own, based on the work done by experts before them, to handle ingestion, cataloguing, and cleaning You want staff to quickly get a new visualization or report without waiting weeks for a programmer to code it up

Access control

Data is the organization’s crown jewels You can’t give everybody access to all data In fact, regulations require you to restrict access to sensitive customer data Security and access controls are therefore critical at all stages of data handling

Why Companies Move to Hadoop

To set the stage for exploration of data management, it is helpful to remind ourselves of why

organizations are moving in the direction of big data tools

Size

“Volume” is one of the main aspects of big data Relational databases cannot scale beyond a certain volume due to architecture restrictions Organizations find that data processing in

relational databases takes too long, and as they do more and more analytics, such data processing using conventional ETL tools becomes such a big time sink that they hold users back from making full use of the data

Trang 7

Typical sources include flat files, RDBMSes, logs from web servers, devices and sensors, and even legacy mainframe data Sometimes you want also to export data from Hadoop to an RDBMS

or other repository

Free-form data

Some data may be almost completely unstructured, as in the case of product reviews and social media postings Other data will come to you inconsistently structured For instance, different data providers may provide the same information in very different formats

Streaming data

If you don’t keep up with changes in the world around you, it will pass you by—and probably reward a competitor who does adapt to it Streaming has evolved from a few rare cases, such as stock markets and sensor data, to everyday data such as product usage data and social media

Fitting the task to the tool

Data maintained in relational databases—let alone cruder storage formats, such as spreadsheets

—is structured well for certain analytics But for new paradigms such as Spark or the MapReduce model, preparing data can take more time than doing the analytics Data in normalized relational format resides in many different tables and must be combined to make the format that the analytics engine can efficiently process

Frequent failures

Modern processing systems such as Hadoop contain redundancy and automatic restart to handle hardware failures or software glitches Even so, you can expect jobs to be aborted regularly by bad data You’ll want to get notifications when a job finishes successfully or unsuccessfully Log files should show you what goes wrong, and you should be able to see how many corrupted rows were discarded and what other errors occurred

Unless you take management into consideration in advance, you end up unable to make good use of this data One example comes from a telecom company whose network generated records about the details of phone calls for monthly billing purposes Their ETL system didn’t ingest data from calls that were dropped or never connected, because no billing was involved So years later, when they realized they should be looking at which cell towers had low quality, they had no data with which to

do so

A failure to collect or store data may be an extreme example of management problems, but other

hindrances—such as storing it in a format that is hard to read, or failing to remember when it arrived

—will also slow down processing to the point where you give up opportunities for learning insights from your data

Trang 8

When the telecom company just mentioned realized that they could use information on dropped and incomplete calls, their ETL system required a huge new programming effort and did not have the capacity to store or process the additional data Modern organizations may frequently get new sources

of data from brokers or publicly available repositories, and can’t afford to spend time and resources doing such coding in order to integrate them

In systems with large, messy data, you have to decide what the system should do when input is bad When do you skip a record, when do you run a program to try to fix corrupted data, and when do you abort the whole job?

A minor error such as a missing ZIP code probably shouldn’t stop a job, or even prevent that record from being processed A missing customer ID, though, might prevent you from doing anything useful with the data (There may be ways to recover from these errors too, as we’ll see.)

Your choice depends of course on your goal If you’re counting sales of a particular item, you don’t need the customer ID If you want to update customer records, you probably do

A more global problem with data ingestion comes when someone changes the order of fields in all the records of an incoming data set Your program might be able to detect what happened and adjust, or might have to abort

At some point, old data will pile up and you will have to decide whether to buy more disk space, archive the data (magnetic tape is still in everyday use), or discard it Archiving or discarding has to

be automated to reduce errors You’ll find old data surprisingly useful if you can manage to hold on to

it And of course, having it readily at hand (instead of on magnetic tape) will permit you to quickly run analytics on that data

Acquisition and Ingestion

At this point we turn to the steps in data processing Acquisition comes first Nowadays it involves much more than moving data from an external source to your own repository In fact, you may not be storing every source you get data from at all: you might accept streams of fast-changing data from sensors or social media, process them right away, and save only the results

On the other hand, if you want to keep the incoming data, you may need to convert it to a format

understood by Hadoop or other processing tools, such as Avro or Parquet

The health care field provides a particularly complex data collection case You may be collecting: Electronic health records from hospitals using different formats

Claims data from health care providers or payers

Profiles from health plans

Data from individuals’ fitness devices

Trang 9

Electronic health records illustrate the variety and inconsistency of all these data types Although there are standards developed by the HL7 standards group, they are implemented differently by each EHR vendor Furthermore, HL7 exchanges data through several messaging systems that differ from any other kind of data exchange used in the computer field

In a situation like this, you will probably design several general methods of ingesting data: one to handle the HL7 messages from EHRs, another to handle claims data, and so on You’ll want to make

it easy for a user to choose one of these methods and adjust parameters such as source, destination file, and frequency in order to handle a new data feed

Successful ingestion requires you to know in detail how the data is coming in Read the

documentation carefully: you may find that the data doesn’t contain what you wanted at all, or needs complex processing to extract just what you need And the documentation may not be trustworthy, so you have to test your ingestion process on actual input

As mentioned earlier, you may be able to anticipate how incoming data changes—such as reordered fields—and adapt to it However, there are risks to doing this First, your tools become more

complicated and harder to maintain Second, they may make the wrong choice because they think they understand the change and get it wrong

Another common ingestion task is to create a consolidated record from multiple files of related

information that are used frequently together— for example, an Order Header and Details merged into one file Hadoop has a particular constraint on incoming data: it was not designed for small files Input may consist of many small files, but submitting them individually will force a wasteful input process onto Hadoop and perhaps even cause a failure For this reason, it is recommended that, prior

to processing these small files, they be combined into a single large file to leverage the Hadoop

cluster more efficiently

This example highlights an important principle governing all the processing discussed in this report: use open formats if possible, and leverage everything the open source and free software communities have made available This will give you more options, because you won’t be locked into one vendor Open source also makes it easier to hire staff and get them productive quickly

However, current open source tools don’t do everything you need You’ll have to fill in the gaps with commercial solutions or hand-crafted scripts

For instance, Sqoop is an excellent tool for importing data from a relational database to Hadoop and supports incremental loads However, building a complete insert-update-delete solution to keep the Hive table in sync with the RDBMS table would be a pretty complex task Here you might benefit from Zaloni’s Bedrock product, which offers a Change Data Capture (CDC) action that handles

inserts, updates, and deletes and is easy to configure

Metadata (Cataloguing)

Why do you need to preserve metadata about your data? Reasons for doing so abound:

For your analytics, you will want to choose data from the right place and time For instance, you

Trang 10

may want to go back to old data from all your stores in a particular region.

Data preparation and cleaning require a firm knowledge of which data set you’re working on Different sets require different types of preparation, based on what you have learned about them historically

Analytical methods are often experimental and have some degree of error To determine whether you can trust results, you may want to check the data that was used to achieve the results, and

review how it was processed

When something goes wrong in any stage from ingestion through to the processing, you need to quickly pinpoint the data causing the problem You also must identify the source so you can contact them and make sure the problem doesn’t reoccur in future data sets

In addition to cleaning data and preventing errors, you may have other reasons related to quality control to preserve the lineage or provenance of data

Access has to be restricted to sensitive data If users deliberately or inadvertently try to start a job

on data they’re not supposed to see, your system should reject the job

Regulatory requirements may require the access restrictions mentioned in the previous bullet, as well as imposing other requirements that depend on the data source

Licenses may require access restrictions and other special treatment of some data sources

Ben Sharma, CEO and co-founder of Zaloni, talks about creating “a single source of truth” from the diverse data sets you take in By creating a data catalog, you can store this metadata for use by

downstream programs

Zaloni divides metadata roughly into three types:

Business metadata

This can include the business names and descriptions that you assign to data fields to make them easier to find and understand For instance, the technical staff may have a good reason to assign

the name loc_outlet to a field that represents a retail store, but you will want users to be able to

find it through common English words This kind of metadata also covers business rules, such as putting an upper limit (perhaps even a lower limit) on salaries, or determining which data must be removed from some jobs for security and privacy

Operational metadata

This is generated automatically by the processes described in this report, and include such things

as the source and target locations of data, file size, number of records, how many records were rejected during data preparation or a job run, and the success or failure of that run itself

Định dạng
Số trang	17
Dung lượng	1,9 MB