In this book, you’ll learn to: Plan and design your ETL systemChoose the appropriate architecture from the many possible choicesManage the implementation Manage the day-to-day operations
Trang 2The Data Warehouse
ETL Toolkit
i
Trang 3ii
Trang 4The Data Warehouse
ETL Toolkit Practical Techniques for
Extracting, Cleaning,
Conforming, and Delivering Data
Ralph Kimball
Joe Caserta
Wiley Publishing, Inc
iii
Trang 510475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright C 2004 by Wiley Publishing, Inc All rights reserved.
Published simultaneously in Canada
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, ex- cept as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, e-mail: brandreview@wiley.com.
Limit of Liability/Disclaimer of Warranty:The publisher and the author make no representations
or warranties with respect to the accuracy or completeness of the contents of this work and ically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other pro- fessional services If professional assistance is required, the services of a competent professional person should be sought Neither the publisher not the author shall be liable for damages arising herefrom The fact that an organization or Website is referred to in this work as a citation and/or
specif-a potentispecif-al source of further informspecif-ation does not mespecif-an thspecif-at the specif-author or the publisher endorses the information the organization or Website may provide or recommendations it may make Fur- ther, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993
eISBN 0-7645-7923 -1
Trang 6Text Design & Composition:
TechBooks Composition Services
v
Trang 7vi
Trang 8vii
Trang 9Resolving Architectural Conflict: A Hybrid Approach 27
Dimensional Data Models: The Handoff from the Back
Part 2: The Challenge of Extracting from Disparate
Trang 10Transferring Data between Platforms 80
Trang 11Known Table Row Counts 140
The Dimension Manager: Publishing Conformed
Precise Time Stamping of a Type 2 Slowly Changing
Technical Note: POPULATING HIERARCHY BRIDGE TABLES 201
Trang 12Using Positional Attributes in a Dimension to Represent
Trang 13Part III Implementation and operations 255
Extracting a Subset of the Source File Records on Mainframe
Extracting a Subset of the Source File Records on Unix and
Creating Aggregated Extracts on UNIX and Windows
Trang 14Scheduling Tools 304
Measuring Data Warehouse Usage to Help Manage ETL
Trang 15ETL Job Metadata 368
Challenges and Opportunities of Real-Time Data
Trang 16Data Freshness and Historical Needs 430
Trang 17xvi
Trang 18First of all we want to thank the many thousands of readers of the Toolkitseries of data warehousing books We appreciate your wonderful supportand encouragement to write a book about data warehouse ETL We continue
to learn from you, the owners and builders of data warehouses
Both of us are especially indebted to Jim Stagnitto for encouraging Joe
to start this book and giving him the confidence to go through with theproject Jim was a virtual third author with major creative contributions tothe chapters on data quality and real-time ETL
Special thanks are also due to Jeff Coster and Kim M Knyal for significantcontributions to the discussions of pre- and post-load processing and projectmanaging the ETL process, respectively
We had an extraordinary team of reviewers who crawled over the firstversion of the manuscript and made many helpful suggestions It is al-ways daunting to make significant changes to a manuscript that is “done”
but this kind of deep review has been a tradition with the Toolkit series
of books and was successful again this time In alphabetic order, the viewers included: Wouleta Ayele, Bob Becker, Jan-Willem Beldman, IvanChong, Maurice Frank, Mark Hodson, Paul Hoffman, Qi Jin, David Lyle,Michael Martin, Joy Mundy, Rostislav Portnoy, Malathi Vellanki, PadminiRamanujan, Margy Ross, Jack Serra-Lima, and Warren Thornthwaite
re-We owe special thanks to our spouses Robin Caserta and Julie Kimball fortheir support throughout this project and our children Tori Caserta, BrianKimball, Sara (Kimball) Smith, and grandchild(!) Abigail Smith who werevery patient with the authors who always seemed to be working
Finally, the team at Wiley Computer books has once again been a realasset in getting this book finished Thank you Bob Elliott, Kevin Kent, andAdaobi Obi Tulton
xvii
Trang 19xviii
Trang 20About the Authors
Ralph Kimball, Ph.D., founder of the Kimball Group, has been a leadingvisionary in the data warehouse industry since 1982 and is one of today’smost well-known speakers, consultants, teachers, and writers His books in-
clude The Data Warehouse Toolkit (Wiley, 1996), The Data Warehouse Lifecycle
Toolkit (Wiley, 1998), The Data Webhouse Toolkit (Wiley, 2000), and The Data Warehouse Toolkit, Second Edition (Wiley, 2002) He also has written for Intel- ligent Enterprise magazine since 1995, receiving the Readers’ Choice Award
since 1999
Ralph earned his doctorate in electrical engineering at Stanford Universitywith a specialty in man-machine systems design He was a research sci-entist, systems development manager, and product marketing manager atXerox PARC and Xerox Systems’ Development Division from 1972 to 1982
For his work on the Xerox Star Workstation, the first commercial productwith windows, icons, and a mouse, he received the Alexander C Williamsaward from the IEEE Human Factors Society for systems design From 1982
to 1986 Ralph was Vice President of Applications at Metaphor ComputerSystems, the first data warehouse company At Metaphor, Ralph inventedthe “capsule” facility, which was the first commercial implementation of thegraphical data flow interface now in widespread use in all ETL tools From
1986 to 1992 Ralph was founder and CEO of Red Brick Systems, a provider
of ultra-fast relational database technology dedicated to decision support
In 1992 Ralph founded Ralph Kimball Associates, which became known asthe Kimball Group in 2004 The Kimball Group is a team of highly experi-enced data warehouse design professionals known for their excellence inconsulting, teaching, speaking, and writing
xix
Trang 21Joe Caserta is the founder and Principal of Caserta Concepts, LLC He is aninfluential data warehousing veteran whose expertise is shaped by years ofindustry experience and practical application of major data warehousingtools and databases Joe is educated in Database Application Developmentand Design, Columbia University, New York.
Trang 22The Extract-Transform-Load (ETL) system is the foundation of the datawarehouse A properly designed ETL system extracts data from the sourcesystems, enforces data quality and consistency standards, conforms data
so that separate sources can be used together, and finally delivers data
in a presentation-ready format so that application developers can buildapplications and end users can make decisions This book is organizedaround these four steps
The ETL system makes or breaks the data warehouse Although building
the ETL system is a back room activity that is not very visible to end users,
it easily consumes 70 percent of the resources needed for implementationand maintenance of a typical data warehouse
The ETL system adds significant value to data It is far more than ing for getting data out of source systems and into the data warehouse
plumb-Specifically, the ETL system:
Removes mistakes and corrects missing dataProvides documented measures of confidence in dataCaptures the flow of transactional data for safekeepingAdjusts data from multiple sources to be used togetherStructures data to be usable by end-user tools
ETL is both a simple and a complicated subject Almost everyone stands the basic mission of the ETL system: to get data out of the sourceand load it into the data warehouse And most observers are increasinglyappreciating the need to clean and transform data along the way So muchfor the simple view It is a fact of life that the next step in the design of
under-xxi
Trang 23the ETL system breaks into a thousand little subcases, depending on yourown weird data sources, business rules, existing software, and unusualdestination-reporting applications The challenge for all of us is to toleratethe thousand little subcases but to keep perspective on the simple overallmission of the ETL system Please judge this book by how well we meetthis challenge!
The Data Warehouse ETL Toolkit is a practical guide for building successful
ETL systems This book is not a survey of all possible approaches! Rather,
we build on a set of consistent techniques for delivery of dimensional data
Dimensional modeling has proven to be the most predictable and cost fective approach to building data warehouses At the same time, becausethe dimensional structures are the same across many data warehouses, wecan count on reusing code modules and specific development logic
ef-This book is a roadmap for planning, designing, building, and runningthe back room of a data warehouse We expand the traditional ETL steps ofextract, transform, and load into the more actionable steps of extract, clean,conform, and deliver, although we resist the temptation to change ETL intoECCD!
In this book, you’ll learn to:
Plan and design your ETL systemChoose the appropriate architecture from the many possible choicesManage the implementation
Manage the day-to-day operationsBuild the development/test/production suite of ETL processesUnderstand the tradeoffs of various back-room data structures,including flat files, normalized schemas, XML schemas, and star join(dimensional) schemas
Analyze and extract source dataBuild a comprehensive data-cleaning subsystemStructure data into dimensional schemas for the most effectivedelivery to end users, business-intelligence tools, data-mining tools,OLAP cubes, and analytic applications
Deliver data effectively both to highly centralized and profoundlydistributed data warehouses using the same techniques
Tune the overall ETL process for optimum performanceThe preceding points are many of the big issues in an ETL system But asmuch as we can, we provide lower-level technical detail for:
Trang 24Implementing the key enforcement steps of a data-cleaning systemfor column properties, structures, valid values, and complex businessrules
Conforming heterogeneous data from multiple sources intostandardized dimension tables and fact tables
Building replicatable ETL modules for handling the natural timevariance in dimensions, for example, the three types of slowlychanging dimensions (SCDs)
Building replicatable ETL modules for multivalued dimensions andhierarchical dimensions, which both require associative bridge tablesProcessing extremely large-volume fact data loads
Optimizing ETL processes to fit into highly constrained loadwindows
Converting batch and file-oriented ETL systems into continuouslystreaming real-time ETL systems
For illustrative purposes, Oracle is chosen as a common dominator when specific SQL code is revealed However, similar code that presents the same results can typically be written for DB2, Microsoft SQL Server, or any popular relational database system.
And perhaps as a side effect of all of these specific recommendations, wehope to share our enthusiasm for developing, deploying, and managingdata warehouse ETL systems
Overview of the Book: Two Simultaneous Threads
Building an ETL system is unusually challenging because it is so heavilyconstrained by unavoidable realities The ETL team must live with the busi-ness requirements, the formats and deficiencies of the source data, the ex-isting legacy systems, the skill sets of available staff, and the ever-changing(and legitimate) needs of end users If these factors aren’t enough, the bud-get is limited, the processing-time windows are too narrow, and importantparts of the business come grinding to a halt if the ETL system doesn’tdeliver data to the data warehouse!
Two simultaneous threads must be kept in mind when building an ETLsystem: the Planning & Design thread and the Data Flow thread At thehighest level, they are pretty simple Both of them progress in an orderlyfashion from left to right in the diagrams Their interaction makes life very
Trang 25System
Figure Intro-1 The Planning and Design Thread.
interesting In Figure Intro-1 we show the four steps of the Planning &
Design thread, and in Figure Intro-2 we show the four steps of the DataFlow thread
To help you visualize where we are in these two threads, in each chapter
we call out process checks The following example would be used when weare discussing the requirements for data cleaning:
P R O C E S S C H E C K Planning & Design:
Data Flow: Extract ➔ Clean ➔ Conform ➔ Deliver
The Planning & Design Thread
The first step in the Planning & Design thread is accounting for all the
requirements and realities These include:
Business needsData profiling and other data-source realitiesCompliance requirements
Security requirementsData integrationData latencyArchiving and lineage
End User Applications
Figure Intro-2 The Data Flow Thread.
Trang 26End user delivery interfacesAvailable development skillsAvailable management skillsLegacy licenses
We expand these individually in the Chapter 1, but we have to point out
at this early stage how much each of these bullets affects the nature of yourETL system For this step, as well as all the steps in both major threads, wepoint out the places in this book when we are talking specifically about thegiven step
The second step in this thread is the architecture step Here is where we
must make big decisions about the way we are going to build our ETLsystem These decisions include:
Hand-coded versus ETL vendor toolBatch versus streaming data flowHorizontal versus vertical task dependencyScheduler automation
Exception handlingQuality handlingRecovery and restartMetadata
Security
The third step in the Planning & Design thread is system implementation.
Let’s hope you have spent some quality time on the previous two stepsbefore charging into the implementation! This step includes:
HardwareSoftwareCoding practicesDocumentation practicesSpecific quality checksThe final step sounds like administration, but the design of the test andrelease procedures is as important as the more tangible designs of the pre-ceding two steps Test and release includes the design of the:
Development systemsTest systems
Trang 27Production systemsHandoff proceduresUpdate propagation approachSystem snapshoting and rollback proceduresPerformance tuning
The Data Flow Thread
The Data Flow thread is probably more recognizable to most readers cause it is a simple generalization of the old E-T-L extract-transform-loadscenario As you scan these lists, begin to imagine how the Planning & De-
be-sign thread affects each of the following bullets The extract step includes:
Reading source-data modelsConnecting to and accessing dataScheduling the source system, intercepting notifications anddaemons
Capturing changed dataStaging the extracted data to disk
The clean step involves:
Enforcing column propertiesEnforcing structure
Enforcing data and value rulesEnforcing complex business rulesBuilding a metadata foundation to describe data qualityStaging the cleaned data to disk
This step is followed closely by the conform step, which includes:
Conforming business labels (in dimensions)Conforming business metrics and performance indicators (in facttables)
DeduplicatingHouseholdingInternationalizingStaging the conformed data to disk
Trang 28Finally, we arrive at the payoff step where we deliver our wonderful data to
the end-user application We spend most of Chapters 5 and 6 on deliverytechniques because, as we describe in Chapter 1, you still have to serve thefood after you cook it! Data delivery from the ETL system includes:
Loading flat and snowflaked dimensionsGenerating time dimensions
Loading degenerate dimensionsLoading subdimensions
Loading types 1, 2, and 3 slowly changing dimensionsConforming dimensions and conforming facts
Handling late-arriving dimensions and late-arriving factsLoading multi-valued dimensions
Loading ragged hierarchy dimensionsLoading text facts in dimensionsRunning the surrogate key pipeline for fact tablesLoading three fundamental fact table grainsLoading and updating aggregations
Staging the delivered data to disk
In studying this last list, you may say, “But most of that list is modeling,not ETL These issues belong in the front room.” We respectfully disagree
In our interviews with more than 20 data warehouse teams, more thanhalf said that the design of the ETL system took place at the same time
as the design of the target tables These folks agreed that there were twodistinct roles: data warehouse architect and ETL system designer But thesetwo roles often were filled by the same person! So this explains why thisbook carries the data all the way from the original sources into each of thedimensional database configurations
The basic four-step data flow is overseen by the operations step, which
extends from the beginning of the extract step to the end of the deliverystep Operations includes:
SchedulingJob executionException handlingRecovery and restartQuality checking
Trang 29ReleaseSupportUnderstanding how to think about these two fundamental threads (Plan-ning & Design and Data Flow) is the real goal of this book.
How the Book Is Organized
To develop the two threads, we have divided the book into four parts:
I Requirements, Realities and Architecture
II Data FlowIII Implementation and Operations
IV Real Time Streaming ETL SystemsThis book starts with the requirements, realities, and architecture steps
of the planning & design thread because we must establish a logical dation for the design of any kind of ETL system The middle part of thebook then traces the entire data flow thread from the extract step through
foun-to the deliver step Then in the third part we return foun-to implementation andoperations issues In the last part, we open the curtain on the exciting newarea of real time streaming ETL systems
Part I: Requirements, Realities, and Architecture
Part I sets the stage for the rest of the book Even though most of us areeager to get started on moving data into the data warehouse, we have tostep back to get some perspective
Chapter 1: Surrounding the Requirements
The ETL portion of the data warehouse is a classically overconstraineddesign challenge In this chapter we put some substance on the list of re-quirements that we want you to consider up front before you commit to
an approach We also introduce the main architectural decisions you musttake a stand on (whether you realize it or not)
This chapter is the right place to define, as precisely as we can, the majorvocabulary of data warehousing, at least as far as this book is concerned
These terms include:
Data warehouseData mart
Trang 30ODS (operational data store)EDW (enterprise data warehouse)Staging area
Presentation area
We describe the mission of the data warehouse as well as the mission
of the ETL team responsible for building the back room foundation of the
data warehouse We briefly introduce the basic four stages of Data Flow:
extracting, cleaning, conforming, and delivering And finally we state asclearly as possible why we think dimensional data models are the keys tosuccess for every data warehouse
Chapter 2: ETL Data Structures
Every ETL system must stage data in various permanent and
semiperma-nent forms When we say staging, we mean writing data to the disk, and
for this reason the ETL system is sometimes referred to as the staging area
You might have noticed that we recommend at least some form of stagingafter each of the major ETL steps (extract, clean, conform, and deliver) Wediscuss the reasons for various forms of staging in this chapter
We then provide a systematic description of the important data tures needed in typical ETL systems: flat files, XML data sets, independentDBMS working tables, normalized entity/relationship (E/R) schemas, anddimensional data models For completeness, we mention some special ta-bles including legally significant audit tracking tables used to prove theprovenance of important data sets, as well as mapping tables used to keeptrack of surrogate keys We conclude with a survey of metadata typicallysurrounding these types of tables, as well as naming standards The meta-data section in this chapter is just an introduction, as metadata is an impor-tant topic that we return to many times in this book
struc-Part II: Data Flow
The second part of the book presents the actual steps required to effectivelyextract, clean, conform, and deliver data from various source systems into
an ideal dimensional data warehouse We start with instructions on ing the system-of-record and recommend strategies for analyzing sourcesystems This part includes a major chapter on building the cleaning andconforming stages of the ETL system The last two chapters then take thecleaned and conformed data and repurpose it into the required dimensionalstructures for delivery to the end-user environments
Trang 31select-Chapter 3: Extracting
This chapter begins by explaining what is required to design a logical datamapping after data analysis is complete We urge you to create a logicaldata map and to show how it should be laid out to prevent ambiguity
in the mission-critical specification The logical data map provides ETLdevelopers with the functional specifications they need to build the physicalETL process
A major responsibility of the data warehouse is to provide data from ious legacy applications throughout the enterprise data in a single cohesiverepository This chapter offers specific technical guidance for integratingthe heterogeneous data sources found throughout the enterprise, includ-ing mainframes, relational databases, XML sources, flat files, Web logs, andenterprise resource planning (ERP) systems We discuss the obstacles en-countered when integrating these data sources and offer suggestions onhow to overcome them We introduce the notion of conforming data acrossmultiple potentially incompatible data sources, a topic developed fully inthe next chapter
var-Chapter 4: Cleaning and Conforming
After data has been extracted, we subject it to cleaning and conforming
Cleaning means identifying and fixing the errors and omissions in the data.
Conforming means resolving the labeling conflicts between potentially
in-compatible data sources so that they can be used together in an enterprisedata warehouse
This chapter makes an unusually serious attempt to propose specific niques and measurements that you should implement as you build thecleaning and conforming stages of your ETL system The chapter focuses
tech-on data-cleaning objectives, techniques, metadata, and measurements
In particular, the techniques section surveys the key approaches to dataprofiling and data cleaning, and the measurements section gives examples
of how to implement data-quality checks that trigger alerts, as well as how
to provide guidance to the data-quality steward regarding the overall health
of the data
Chapter 5: Delivering Dimension Tables
This chapter and Chapter 6 are the payoff chapters in this book We believethat the whole point of the data warehouse is to deliver data in a simple,actionable format for the benefit of end users and their analytic applications
Dimension tables are the context of a business’ measurements They are alsothe entry points to the data because they are the targets for almost all data
Trang 32warehouse constraints, and they provide the meaningful labels on everyrow of output.
The ETL process that loads dimensions is challenging because it mustabsorb the complexities of the source systems and transform the data intosimple, selectable dimension attributes This chapter explains step-by-stephow to load data warehouse dimension tables, including the most advancedETL techniques The chapter clearly illustrates how to:
Assign surrogate keysLoad Type 1, 2 and 3 slowly changing dimensionsPopulate bridge tables for multivalued and complex hierarchicaldimensions
Flatten hierarchies and selectively snowflake dimensions
We discuss the advanced administration and maintenance issues required
to incrementally load dimensions, track the changes in dimensions usingCRC codes, and contend with late-arriving data
Chapter 6: Delivering Fact Tables
Fact tables hold the measurements of the business In most data warehouses,fact tables are overwhelmingly larger than dimension tables, but at the sametime they are simpler In this chapter we explain the basic structure of allfact tables, including foreign keys, degenerate dimension keys, and thenumeric facts themselves We describe the role of the fact-table provider,the information steward responsible for the delivery of the fact tables toend-user environments
Every fact table should be loaded with a surrogate key pipeline, whichmaps the natural keys of the incoming fact records to the correct contem-porary surrogate keys needed to join to the dimension tables
We describe the three fundamental grains of fact tables, which are cient to support all data warehouse applications
suffi-We discuss some unusual fact table varieties, including factless fact tablesand fact tables whose sole purpose is to register the existence of complexevents, such as automobile accidents
Finally, we discuss the basic architecture of aggregations, which are ically stored summaries that, much like indexes, serve solely to improveperformance
phys-Part III: Implementation and Operations
The third part of the book assumes the reader has analyzed his or herrequirements, heeded the realities of his or her data and available resources,
Trang 33and visualized the flow of data from extraction to delivery Keeping all this
in mind, Part 3 describes in some detail the main approaches to systemimplementation and to organizing the operations of the ETL system Wediscuss the role of metadata in the ETL system and finally the variousresponsibilities of the ETL team members
Chapter 7: Development
Chapter 7 develops the techniques that you’ll need to develop the initialdata load for your data warehouse, such as recreating history for slowlychanging dimensions and integrating historic offline data with current on-line transactions, as well as historic fact loading
The chapter also provides estimation techniques to calculate the time itshould take to complete the initial load, exposes vulnerabilities to long-running ETL processes, and suggests methods to minimize your risk
Automating the ETL process is an obvious requirement of the data house project, but how is it done? The order and dependencies betweentable loads is crucial to successfully load the data warehouse This chapterreviews the fundamental functionality of ETL scheduling and offers cri-teria and options for executing the ETL schedule Once the fundamentalsare covered, topics such as enforcing referential integrity with the ETL andmaintaining operational metadata are examined
ware-Chapter 8: Operations
We begin this chapter by showing the approaches to scheduling the variousETL system jobs, responding to alerts and exceptions, and finally runningthe jobs to completion with all dependencies satisfied
We walk through the steps to migrate the ETL system to the productionenvironment Since the production environment of the ETL system must
be supported like any other mission-critical application, we describe how
to set up levels of support for the ETL system that must be utilized uponfailure of a scheduled process
We identify key performance indicators for rating ETL performance andexplore how to monitor and capture the statistics Once the ETL key per-formance indicators are collected, you are armed with the information youneed to address the components within the ETL system to look for oppor-tunities to modify and increase the throughput as much as possible
Chapter 9: Metadata
The ETL environment often assumes the responsibility of storing and aging the metadata for the entire data warehouse After all, there is no
Trang 34man-better place than the ETL system for storing and managing metadata cause the environment must know most aspects of the data to functionproperly Chapter 9 defines the three types of metadata—business, techni-cal, and process—and presents the elements within each type as they apply
be-to the ETL system The chapter offers techniques for producing, publishing,and utilizing the various types of metadata and also discusses the oppor-tunity for improvement in this area of the data warehouse We finish thechapter by discussing metadata standards and best practices and providerecommended naming standards for the ETL
Chapter 10: Responsibilities
The technical aspects of the ETL process are only a portion of the ETLlifecycle Chapter 10 is dedicated to the managerial aspects of the lifecyclerequired for a successful implementation The chapter describes the dutiesand responsibilities of the ETL team and then goes on to outline a detailedproject plan that can be implemented in any data warehouse environment
Once the basics of managing the ETL system are conveyed, the chapter divesinto more-detailed project management activities such as project staffing,scope management, and team development This somewhat nontechnicalchapter provides the greatest benefit to ETL and data warehouse projectmanagers It describes the roles and skills that are needed for an effectiveteam; and offers a comprehensive ETL project plan that can be repeatedfor each phase of the data warehouse The chapter also includes forms thatmanagers need to lead their teams through the ETL lifecycle Even if you arenot a manager, this chapter is required reading to adequately understandhow your role works with the other members of the ETL team
Part IV: Real Time Streaming ETL Systems
Since real-time ETL is a relatively young technology, we are more likely tocome up against unique requirements and solutions that have not yet beenperfected In this chapter, we share our experiences to provide insight on thelatest challenges in real-time data warehousing and offer recommendations
on overcoming them The crux of real-time ETL is covered in this chapter,and the details of actual implementations are described
Chapter 11: Real-Time ETL
In this chapter, we begin by defining the real-time requirement Next, wereview the different architecture options available today and appraise each
We end the chapter with a decision matrix to help you decide which time architecture is right for your specific data warehouse environment
Trang 35real-Chapter 12: Conclusion
The final chapter summarizes the unique contributions made in this bookand provides a glimpse into the future for ETL and data warehousing as awhole
Who Should Read this Book
Anyone who is involved or intends to be involved in a data-warehouseinitiative should read this book Developers, architects, and managers willbenefit from this book because it contains detailed techniques for delivering
a dimensionally oriented data warehouse and provides a project ment perspective for all the back room activities
manage-Chapters 1, 2, and 10 offer a functional view of the ETL that can easily beread by anyone on the data warehouse team but is intended for businesssponsors and project managers As you progress through these chapters,expect their technical level to increase, eventually getting to the point where
it transforms into a developers handbook This book is a definitive guidefor advice on the tasks required to load the dimensional data warehouse
Summary
The goal of this book is to make the process of building an ETL systemunderstandable with specific checkpoints along the way This book showsthe often under-appreciated value the ETL system brings to data warehousedata We hope you enjoy the book and find it valuable in your workplace
We intentionally remain vendor-neutral throughout the book so you canapply the techniques within to the technology to your liking If this bookaccomplishes nothing else, we hope it encourages you to get thinking andstart breaking new ground to challenge the vendors to extend their productofferings to incorporate the features that the ETL team requires to bring theETL (and the data warehouse) to full maturity
Trang 372
Trang 38C H A P T E R
1
Surrounding the Requirements
Ideally, you must start the design of your ETL system with one of the est challenges: surrounding the requirements By this we mean gathering
tough-in one place all the known requirements, realities, and constratough-ints affecttough-ing
the ETL system We’ll refer to this list as the requirements, for brevity.
The requirements are mostly things you must live with and adapt yoursystem to Within the framework of your requirements, you will have manyplaces where you can make your own decisions, exercise your judgment,and leverage your creativity, but the requirements are just what they arenamed They are required The first section of this chapter is intended toremind you of the relevant categories of requirements and give you a sense
of how important the requirements will be as you develop your ETL system
Following the requirements, we identify a number of architectural sions you need to make at the beginning of your ETL project These decisionsare major commitments because they drive everything you do as you moveforward with your implementation The architecture affects your hardware,software, coding practices, personnel, and operations
deci-The last section describes the mission of the data warehouse We alsocarefully define the main architectural components of the data warehouse,including the back room, the staging area, the operational data store (ODS),and the presentation area We give a careful and precise definition of datamarts and the enterprise data warehouse (EDW) Please read this chap-ter very carefully The definitions and boundaries we describe here drivethe whole logic of this book If you understand our assumptions, you willsee why our approach is more disciplined and more structured than anyother data warehouse design methodology We conclude the chapter with
a succinct statement of the mission of the ETL team
3
Trang 39P R O C E S S C H E C K
Planning & Design: Requirements/Realities➔Architecture➔
Implementation➔Test/Release Data Flow: Haven’t started tracing the data flow yet.
Requirements
In this book’s introduction, we list the major categories of requirements wethink important Although every one of the requirements can be a show-stopper, business needs have to be more fundamental and important
Business Needs
Business needs are the information requirements of the end users of the
data warehouse We use the term business needs somewhat narrowly here
to mean the information content that end users need to make informedbusiness decisions Other requirements listed in a moment broaden thedefinition of business needs, but this requirement is meant to identify theextended set of information sources that the ETL team must introduce intothe data warehouse
Taking, for the moment, the view that business needs directly drive thechoice of data sources, it is obvious that understanding and constantly ex-amining business needs is a core activity of the ETL team
In the Data Warehouse Lifecycle Toolkit, we describe the process for
inter-viewing end users and gathering business requirements The result of thisprocess is a set of expectations that users have about what data will do forthem In many cases, the original interviews with end users and the originalinvestigations of possible sources do not fully reveal the complexities andlimitations of data The ETL team often makes significant discoveries thataffect whether the end user’s business needs can be addressed as originallyhoped for And, of course, the ETL team often discovers additional capabili-ties in the data sources that expand end users’ decision-making capabilities
The lesson here is that even during the most technical back-room ment steps of building the ETL system, a dialog amongst the ETL team,the data warehouse architects, and the end users should be maintained
develop-In a larger sense, business needs and the content of data sources are bothmoving targets that constantly need to be re-examined and discussed
Compliance Requirements
In recent years, especially with the passage of the Sarbanes-Oxley Act of
2002, organizations have been forced to seriously tighten up what they
Trang 40report and provide proof that the reported numbers are accurate, complete,and have not been tampered with Of course, data warehouses in regulatedbusinesses like telecommunications have complied with regulatory report-ing requirements for many years But certainly the whole tenor of financialreporting has become much more serious for everyone.
Several of the financial-reporting issues will be outside the scope of thedata warehouse, but many others will land squarely on the data warehouse
Typical due diligence requirements for the data warehouse include:
Archived copies of data sources and subsequent stagings of dataProof of the complete transaction flow that changed any dataFully documented algorithms for allocations and adjustmentsProof of security of the data copies over time, both on-line and off-line
Data Profiling
As Jack Olson explains so clearly in his book Data Quality: The Accuracy
Dimension, data profiling is a necessary precursor to designing any kind of
system to use that data As he puts it: “[Data profiling] employs analyticmethods for looking at data for the purpose of developing a thorough un-derstanding of the content, structure, and quality of the data A good dataprofiling [system] can process very large amounts of data, and with theskills of the analyst, uncover all sorts of issues that need to be addressed.”
This perspective is especially relevant to the ETL team who may behanded a data source whose content has not really been vetted For ex-ample, Jack points out that a data source that perfectly suits the needs ofthe production system, such as an order-taking system, may be a disaster forthe data warehouse, because the ancillary fields the data warehouse hoped
to use were not central to the success of the order-taking process and wererevealed to be unreliable and too incomplete for data warehouse analysis
Data profiling is a systematic examination of the quality, scope, and text of a data source to allow an ETL system to be built At one extreme, avery clean data source that has been well maintained before it arrives at thedata warehouse requires minimal transformation and human intervention
con-to load directly incon-to final dimension tables and fact tables But a dirty datasource may require:
Elimination of some input fields completelyFlagging of missing data and generation of special surrogate keysBest-guess automatic replacement of corrupted values
Human intervention at the record levelDevelopment of a full-blown normalized representation of the data