And howabout the data maintained in your organization’s data lake?. Andonce you escape the familiarity and safety of the relational model, you needother tools to manage the inconsistenci
Trang 3Managing the Data LakeMoving to Big Data Analysis
Andy Oram
Trang 4Managing the Data Lake
by Andy Oram
Copyright © 2015 O’Reilly Media, Inc All rights reserved
Printed in the United States of America
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,Sebastopol, CA 95472
O’Reilly books may be purchased for educational, business, or salespromotional use Online editions are also available for most titles(http://safaribooksonline.com) For more information, contact ourcorporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com
Editor: Shannon Cutt
Interior Designer: David Futato
Cover Designer: Karen Montgomery
September 2015: First Edition
Trang 5Revision History for the First Edition
2015-09-02: First Release
2015-10-20: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc
Managing the Data Lake and related trade dress are trademarks of O’Reilly
of or reliance on this work Use of the information and instructions contained
in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights
Cover photo credit: “55 Flying Fish” by Michal (flikr)
978-1-491-94168-3
[LSI]
Trang 6Chapter 1 Moving to Big Data
Analysis
Can you tell by sailing the surface of a lake whether it has been well
maintained? Can local fish and plants survive? Dare you swim? And howabout the data maintained in your organization’s data lake? Can you tell
whether it’s healthy enough to support your business needs?
An increasing number of organizations maintain fast-growing repositories ofdata, usually from multiple sources and formatted in multiple ways, that arecommonly called “data lakes.” They use a variety of storage and processingtools — especially in the Hadoop family — to extract value quickly and
inform key organizational decisions
This report looks at the common needs that modern organizations have fordata management and governance The MapReduce model — introduced in
2004 in a paper1 by Jeffrey Dean and Sanjay Ghemawat — completely
overturned the way the computing community approached big data analysis.Many other models, such as Spark, have come since then, creating
excitement and seeing eager adoption by organizations of all sizes to solvethe problems that relational databases were not suited for But these
technologies bring with them new demands for organizing data and keepingtrack of what you’ve got
I take it for granted that you understand the value of undertaking a big datainitiative, as well as the value of a framework such as Hadoop, and are in theprocess of transforming the way you manage your organization’s data I haveinterviewed a number of experts in data management to find out the commonchallenges you are about to face, so you can anticipate them and put solutions
in place before you find yourself overwhelmed
Essentially, you’ll need to take care of challenges that never came up withtraditional relational databases and data warehouses, or that were handled bythe constraints that the relational model placed on data There is wonderful
Trang 7value in those constraints, and most of us will be entrusting data to relationalsystems for the foreseeable future But some data tasks just don’t fit Andonce you escape the familiarity and safety of the relational model, you needother tools to manage the inconsistencies, unpredictability, and breakneckpace of the data you’re handling.
The risk of the new tools is having many disparate sources of data — andperhaps multiple instances of Hadoop or other systems offering analyticsoperating inefficiently — which in turn causes you to lose track of basicinformation you need to know about your data This makes it hard to set upnew jobs that could provide input to the questions you urgently need to
answer
The fix is to restore some of the controls you had over old data sources
through careful planning and coding, while still being flexible and responsive
to fast-moving corporate data needs
The main topics covered in this report are:
Acquisition and ingestion
Data comes nowadays from many different sources: internal businesssystems, product data from customers, external data providers, publicdata sets, and more You can’t force everyone to provide the data in aformat that’s convenient for you Nor can you take the time (as in theold days) to define strict schemas and enter all data into schemas Theproblems of data acquisition and ingestion have to be solved with adegree of automation
Metadata (cataloguing)
Questions such as who provided the data, when it came in, and how it
was formatted — a slew of concerns known as lineage or provenance —
are critical to managing your data well A catalog can keep this metadataand make it available to later stages of processing
Data preparation and cleaning
Just as you can’t control incoming formats, you can’t control data
quality You will inevitably deal with data that does not conform Datamay be missing, entered in diverse formats, contain errors, and so on In
Trang 8addition, data might be lost or corrupted because sensors run out of
battery power, networks fail, software along the way harbored a bug, orthe incoming data had an unrecognized format Some data users estimatethat detecting these anomalies and cleaning takes up 90% of their time
Managing workflows
The actual jobs you run on data need to be linked with the three otherstages just described Users should be able to submit jobs of their own,based on the work done by experts before them, to handle ingestion,cataloguing, and cleaning You want staff to quickly get a new
visualization or report without waiting weeks for a programmer to code
it up
Access control
Data is the organization’s crown jewels You can’t give everybody
access to all data In fact, regulations require you to restrict access tosensitive customer data Security and access controls are therefore
critical at all stages of data handling
Trang 9Why Companies Move to Hadoop
To set the stage for exploration of data management, it is helpful to remindourselves of why organizations are moving in the direction of big data tools
Size
“Volume” is one of the main aspects of big data Relational databasescannot scale beyond a certain volume due to architecture restrictions.Organizations find that data processing in relational databases takes toolong, and as they do more and more analytics, such data processingusing conventional ETL tools becomes such a big time sink that theyhold users back from making full use of the data
Variety
Typical sources include flat files, RDBMSes, logs from web servers,devices and sensors, and even legacy mainframe data Sometimes youwant also to export data from Hadoop to an RDBMS or other repository
Free-form data
Some data may be almost completely unstructured, as in the case ofproduct reviews and social media postings Other data will come to youinconsistently structured For instance, different data providers mayprovide the same information in very different formats
Fitting the task to the tool
Data maintained in relational databases — let alone cruder storage
formats, such as spreadsheets — is structured well for certain analytics.But for new paradigms such as Spark or the MapReduce model,
Trang 10preparing data can take more time than doing the analytics Data in
normalized relational format resides in many different tables and must
be combined to make the format that the analytics engine can efficientlyprocess
Frequent failures
Modern processing systems such as Hadoop contain redundancy andautomatic restart to handle hardware failures or software glitches Even
so, you can expect jobs to be aborted regularly by bad data You’ll want
to get notifications when a job finishes successfully or unsuccessfully.Log files should show you what goes wrong, and you should be able tosee how many corrupted rows were discarded and what other errorsoccurred
Unless you take management into consideration in advance, you end up
unable to make good use of this data One example comes from a telecomcompany whose network generated records about the details of phone callsfor monthly billing purposes Their ETL system didn’t ingest data from callsthat were dropped or never connected, because no billing was involved Soyears later, when they realized they should be looking at which cell towershad low quality, they had no data with which to do so
A failure to collect or store data may be an extreme example of managementproblems, but other hindrances — such as storing it in a format that is hard toread, or failing to remember when it arrived — will also slow down
processing to the point where you give up opportunities for learning insightsfrom your data
When the telecom company just mentioned realized that they could use
information on dropped and incomplete calls, their ETL system required ahuge new programming effort and did not have the capacity to store or
process the additional data Modern organizations may frequently get newsources of data from brokers or publicly available repositories, and can’tafford to spend time and resources doing such coding in order to integratethem
In systems with large, messy data, you have to decide what the system should
do when input is bad When do you skip a record, when do you run a
Trang 11program to try to fix corrupted data, and when do you abort the whole job?
A minor error such as a missing ZIP code probably shouldn’t stop a job, oreven prevent that record from being processed A missing customer ID,
though, might prevent you from doing anything useful with the data (Theremay be ways to recover from these errors too, as we’ll see.)
Your choice depends of course on your goal If you’re counting sales of aparticular item, you don’t need the customer ID If you want to update
customer records, you probably do
A more global problem with data ingestion comes when someone changes theorder of fields in all the records of an incoming data set Your program might
be able to detect what happened and adjust, or might have to abort
At some point, old data will pile up and you will have to decide whether tobuy more disk space, archive the data (magnetic tape is still in everyday use),
or discard it Archiving or discarding has to be automated to reduce errors.You’ll find old data surprisingly useful if you can manage to hold on to it.And of course, having it readily at hand (instead of on magnetic tape) willpermit you to quickly run analytics on that data
Trang 12Acquisition and Ingestion
At this point we turn to the steps in data processing Acquisition comes first.Nowadays it involves much more than moving data from an external source
to your own repository In fact, you may not be storing every source you getdata from at all: you might accept streams of fast-changing data from sensors
or social media, process them right away, and save only the results
On the other hand, if you want to keep the incoming data, you may need toconvert it to a format understood by Hadoop or other processing tools, such
as Avro or Parquet
The health care field provides a particularly complex data collection case.You may be collecting:
Electronic health records from hospitals using different formats
Claims data from health care providers or payers
Profiles from health plans
Data from individuals’ fitness devices
Electronic health records illustrate the variety and inconsistency of all thesedata types Although there are standards developed by the HL7 standardsgroup, they are implemented differently by each EHR vendor Furthermore,HL7 exchanges data through several messaging systems that differ from anyother kind of data exchange used in the computer field
In a situation like this, you will probably design several general methods ofingesting data: one to handle the HL7 messages from EHRs, another to
handle claims data, and so on You’ll want to make it easy for a user to
choose one of these methods and adjust parameters such as source,
destination file, and frequency in order to handle a new data feed
Successful ingestion requires you to know in detail how the data is coming
in Read the documentation carefully: you may find that the data doesn’t
Trang 13contain what you wanted at all, or needs complex processing to extract justwhat you need And the documentation may not be trustworthy, so you have
to test your ingestion process on actual input
As mentioned earlier, you may be able to anticipate how incoming data
changes — such as reordered fields — and adapt to it However, there arerisks to doing this First, your tools become more complicated and harder tomaintain Second, they may make the wrong choice because they think theyunderstand the change and get it wrong
Another common ingestion task is to create a consolidated record from
multiple files of related information that are used frequently together — forexample, an Order Header and Details merged into one file Hadoop has aparticular constraint on incoming data: it was not designed for small files.Input may consist of many small files, but submitting them individually willforce a wasteful input process onto Hadoop and perhaps even cause a failure.For this reason, it is recommended that, prior to processing these small files,they be combined into a single large file to leverage the Hadoop cluster moreefficiently
This example highlights an important principle governing all the processingdiscussed in this report: use open formats if possible, and leverage everythingthe open source and free software communities have made available Thiswill give you more options, because you won’t be locked into one vendor.Open source also makes it easier to hire staff and get them productive
quickly
However, current open source tools don’t do everything you need You’llhave to fill in the gaps with commercial solutions or hand-crafted scripts.For instance, Sqoop is an excellent tool for importing data from a relationaldatabase to Hadoop and supports incremental loads However, building acomplete insert-update-delete solution to keep the Hive table in sync with theRDBMS table would be a pretty complex task Here you might benefit fromZaloni’s Bedrock product, which offers a Change Data Capture (CDC) actionthat handles inserts, updates, and deletes and is easy to configure
Trang 14Data preparation and cleaning require a firm knowledge of which data setyou’re working on Different sets require different types of preparation,based on what you have learned about them historically.
Analytical methods are often experimental and have some degree of error
To determine whether you can trust results, you may want to check thedata that was used to achieve the results, and review how it was processed
When something goes wrong in any stage from ingestion through to theprocessing, you need to quickly pinpoint the data causing the problem.You also must identify the source so you can contact them and make surethe problem doesn’t reoccur in future data sets
In addition to cleaning data and preventing errors, you may have otherreasons related to quality control to preserve the lineage or provenance ofdata
Access has to be restricted to sensitive data If users deliberately or
inadvertently try to start a job on data they’re not supposed to see, yoursystem should reject the job
Regulatory requirements may require the access restrictions mentioned inthe previous bullet, as well as imposing other requirements that depend onthe data source
Licenses may require access restrictions and other special treatment ofsome data sources