Overview of Business Intelligence 3 Role and Purpose of the Data Warehouse 10The Corporate Information Factory 11 The Data Warehouse Data Model 22Nonredundant 22Stable 23Consistent 23Fle
Trang 3Claudia Imhoff Nicholas Galemmo Jonathan G Geiger
Mastering Data Warehouse Design Relational and Dimensional
Techniques
Trang 4Vice President and Executive Publisher: Robert Ipsen
Publisher: Joe Wikert
Executive Editor: Robert M Elliott
Developmental Editor: Emilie Herman
Editorial Manager: Kathryn Malm
Managing Editor: Pamela M Hanley
Text Design & Composition: Wiley Composition Services
This book is printed on acid-free paper ∞
Copyright © 2003 by Claudia Imhoff, Nicholas Galemmo, and Jonathan G Geiger All rights reserved.
Published by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rose- wood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8700 Requests to the Pub- lisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc.,
10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4447, E-mail: permcoordinator@wiley.com.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect
to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may
be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with
a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, inci- dental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Trademarks:Wiley, the Wiley Publishing logo and related trade dress are trademarks or registered trademarks of Wiley Publishing, Inc., in the United States and other countries, and may not be used without written permission All other trademarks are the property of their respective owners Wiley Publishing, Inc., is not associated with any product or ven- dor mentioned in this book.
Wiley also publishes its books in a variety of electronic formats Some content that appears
in print may not be available in electronic books.
ISBN: 0-471-32421-3
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
Trang 5Claudia: For all their patience and understanding throughout the years, this
book is dedicated to David and Jessica Imhoff.
Nick: To my wife Sarah, and children Amanda and Nick Galemmo, for their understanding over the many weekends I spent working on this book Also to
my college professor, Julius Archibald at the State University of New York at Plattsburgh for instilling in me the science and art of computing.
Jonathan: To my wife, Alma Joy, for her patience and understanding of the time spent writing this book, and to my children, Avi and Shana, who are embarking
on their respective careers and of whom I am extremely proud.
Trang 7Overview of Business Intelligence 3
Role and Purpose of the Data Warehouse 10The Corporate Information Factory 11
The Data Warehouse Data Model 22Nonredundant 22Stable 23Consistent 23Flexible in Terms of the Ultimate Data Usage 24The Codd and Date Premise 24Impact on Data Mart Creation 25Summary 26
Trang 8Chapter 2 Fundamental Relational Concepts 29
Why Do You Need a Data Model? 29Relational Data-Modeling Objects 30Subject 31Entity 31
Relationships 34
Subject Area Model Benefits 38
Business Data Model Benefits 39
Relational Data-Modeling Guidelines 45Guidelines and Best Practices 45Normalization 48Normalization of the Relational Data Model 48
Other Normalization Levels 52Summary 52
Considerations for Specific Industries 65Retail Industry Considerations 65Manufacturing Industry Considerations 66Utility Industry Considerations 66Property and Casualty Insurance Industry Considerations 66Petroleum Industry Considerations 67Health Industry Considerations 67Subject Area Model Development Process 67Closed Room Development 68Development through Interviews 70Development through Facilitated Sessions 72Subject Area Model Benefits 78Subject Area Model for Zenith Automobile Company 79
C o n t e n t s
vi
Trang 9Business Data Model 82Business Data Development Process 82Identify Relevant Subject Areas 83Identify Major Entities and Establish Identifiers 85Define Relationships 90
Confirm Model Structure 93Confirm Model Content 94Summary 95
Methodology 98Step 1: Select the Data of Interest 99Inputs 99
Step 2: Add Time to the Key 111Capturing Historical Data 115Capturing Historical Relationships 117Dimensional Model Considerations 118Step 3: Add Derived Data 119Step 4: Determine Granularity Level 121Step 5: Summarize Data 124Summaries for Period of Time Data 125Summaries for Snapshot Data 126
Step 6: Merge Entities 129Step 7: Create Arrays 131Step 8: Segregate Data 132Summary 133
Inconsistent Business Definition of Customer 136Inconsistent System Definition of Customer 138Inconsistent Customer Identifier among Systems 140Inclusion of External Data 140Data at a Customer Level 140Data Grouped by Customer Characteristics 140Customers Uniquely Identified Based on Role 141Customer Hierarchy Not Depicted 142Data Warehouse System Model 144Inconsistent Business Definition of Customer 144Inconsistent System Definition of Customer 144
Trang 10Inconsistent Customer Identifier among Systems 145Absorption of External Data 145Customers Uniquely Identified Based on Role 145Customer Hierarchy Not Depicted 146Data Warehouse Technology Model 146Key from the System of Record 147Key from a Recognized Standard 149
Dimensional Data Mart Implications 151Differences in a Dimensional Model 152Maintaining Dimensional Conformance 153Summary 155
The Fiscal Calendar 159The 4-5-4 Fiscal Calendar 161Thirteen-Month Fiscal Calendar 164Other Fiscal Calendars 164The Billing Cycle Calendar 164The Factory Calendar 164
Time and the Data Warehouse 169
C o n t e n t s
viii
Trang 11Case Study: A Multilingual Calendar 184Analysis 185Storing Multiple Languages 185Handling Different Date Presentation Formats 185Database Localization 187Query Tool Localization 187Delivery Localization 187Delivering Multiple Languages 188Monolingual Reporting 188Creating a Multilingual Data Mart 190Case Study: Multiple Fiscal Calendars 190Analysis 191Expanding the Calendar 192Case Study: Seasonal Calendars 193Analysis 193Seasonal Calendar Structures 194Delivering Seasonal Data 194Summary 195
Updating the Bridge 221
Trang 12The Customer Hierarchy 222The Recursive Hierarchy Tree 223Using Recursive Trees in the Data Mart 226Maintaining History 228Case Study: Retail Purchasing 231Analysis 232Implementing the Business Model 234The Buyer Hierarchy 234Implementing Buyer Responsibility 236Delivering the Buyer Responsibility Relationship 238Case Study: The Combination Pack 241Analysis 241Adding a Bill of Materials 244
Making a Recursive Tree 245Flattening a Recursive Tree 246Summary 248
Business Use of the Data Warehouse 251Average Lines per Transaction 252Business Rules Concerning Changes 253
Method 1—Using Foreign Keys 269Method 2—Using Associative Entities 272Technique 3: Change Snapshot with Delta Capture 275
C o n t e n t s
x
Trang 13Case Study: Transaction Interface 278Modeling the Transactions 279Processing the Transactions 281Simultaneous Delivery 281
Summary 283
Optimizing the Development Process 285Optimizing Design and Analysis 286Optimizing Application Development 286Selecting an ETL Tool 286
Reasons for Partitioning 290Indexing Partitioned Tables 296Enforcing Referential Integrity 299Index-Organized Tables 301
Conclusion 309Optimizing the System Model 310Vertical Partitioning 310Vertical Partitioning for Performance 311Vertical Partitioning of Change History 312Vertical Partitioning of Large Columns 314Denormalization 315
Summary 317
The Changing Data Warehouse 321
Modeling for Business Change 326Assuming the Worst Case 326Imposing Relationship Generalization 327Using Surrogate Keys 330
Trang 14Implementing Business Change 332Integrating Subject Areas 333Standardizing Attributes 333Inferring Roles and Integrating Entities 335Adding Subject Areas 336Summary 337
Governing Models and Their Evolution 339
Technology Data Model 344Synchronization Implications 344
Subject Area and Business Data Models 346Color-Coding 348
Including the Subject Area within the Entity Name 349Business and System Data Models 351System and Technology Data Models 353Managing Multiple Modelers 355Roles and Responsibilities 355
Business Data Model 356System and Technology Data Model 356Collision Management 357
Modifications 357Comparison 358Incorporation 358Summary 358
Criteria for Being in-Architecture 366Migrating from Data Mart Chaos 367Conform the Dimensions 368Create the Data Warehouse Data Model 371Create the Data Warehouse 373Convert by Subject Area 373Convert One Data Mart at a Time 374
C o n t e n t s
xii
Trang 15Build New Data Marts Only “In-Architecture”—
Leave Old Marts Alone 377Build the Architecture from One Data Mart 378Choosing the Right Migration Path 380Summary 381
Chapter 13 Comparison of Data Warehouse Methodologies 383
The Multidimensional Architecture 383The Corporate Information Factory Architecture 387Comparison of the CIF and MD Architectures 389Scope 389Perspective 391
Volatility 392Flexibility 394Complexity 394Functionality 395
Trang 17A C K N O W L E D G M E N T S
We gratefully acknowledge the following individuals who directly or indirectly
contributed to this book:
Greg Backhus – Helzberg Diamonds
William Baker – Microsoft Corporation
John Crawford – Merrill Lynch
David Gleason – Intelligent Solutions, Inc
William H Inmon – Inmon Associates, Inc
Dr Ralph S Kimball- Kimball Associates
Lisa Loftis – Intelligent Solutions, Inc
Bob Lokken – ProClarity Corporation
Anthony Marino – L’Oreal Corporation
Joyce Norris-Montanari – Intelligent Solutions, Inc
Laura Reeves – StarSoft, Inc
Ron Powell – DM Review Magazine
Kim Stannick – Teradata Corporation
Barbara von Halle – Knowledge Partners, Inc
John Zachman – Zachman International, Inc
We would also like to thank our editors, Bob Elliott, Pamela Hanley, andEmilie Herman, whose tireless prodding and assistance kept us honest and onschedule
Trang 19Claudia Imhoff, Ph.D. is the president and founder of Intelligent Solutions
(www.IntelSols.com), a leading consultancy on CRM (Customer RelationshipManagement) and business intelligence technologies and strategies She is apopular speaker and internationally recognized expert and serves as an advi-sor to many corporations, universities, and leading technology companies onthese topics She has coauthored five books and over 50 articles on these top-ics She can be reached at CImhoff@IntelSols.com
Nicholas Galemmowas an information architect at Nestlé USA Nicholas has 27
years’ experience as a practitioner and consultant involved in all aspects ofapplication systems design and development within the manufacturing, dis-tribution, education, military, health care, and financial industries He hasbeen actively involved in large-scale data warehousing and systems integra-tion projects for the past 11 years He has built numerous data warehouses,using both dimensional and relational architectures He has published manyarticles and has presented at national conferences This is his first book
Mr Galemmo is now an independent consultant and can be reached atngalemmo@yahoo.com
Jonathan G Geiger is executive vice president at Intelligent Solutions, Inc
Jonathan has been involved in many Corporate Information Factory and tomer relationship management projects within the utility, telecommunica-tions, manufacturing, education, chemical, financial, and retail industries Inhis 30 years as a practitioner and consultant, Jonathan has managed or per-formed work in virtually every aspect of information management He hasauthored or coauthored over 30 articles and two other books, presents fre-quently at national and international conferences, and teaches several publicseminars Mr Geiger can be reached at JGeiger@IntelSols.com
Trang 21We have found that an understanding of why a particular approach is being
pro-moted helps us recognize its value and apply it Therefore, we start this sectionwith an introduction to the Corporate Information Factory (CIF) This provenand stable architecture includes two formal data stores for business intelli-gence, each with a specific role in the BI environment
The first data store is the data warehouse The major role of the data house is to serve as a data repository that stores data from disparate sources,making it accessible to another set of data stores – the data marts As the col-lection point, the most effective design approach for the data warehouse isbased on an entity-relationship data model and the normalization techniquesdeveloped by Codd and Date in their seminal work throughout the 1970’s, 80’sand 90’s for relational databases
ware-The major role of the data mart is to provide the business users with easyaccess to quality, integrated information There are several types of data marts,and these are also described in Chapter 1 The most popular data mart is built
to support online analytical processing, and the most effective designapproach for it is the dimensional data model
Continuing with the conceptual theme, we explain the importance of tional modeling techniques, introduce the different types of models that areneeded, and provide a process for building a relational data model in Chap-ter 2 We also explain the relationship between the various data models used
rela-in constructrela-ing a solid foundation for any enterprise—the busrela-iness, system,and technology data models—and how they share or inherit characteristicsfrom each other
ONE
Trang 23Introduction 1
Welcome to the first book that thoroughly describes the data modeling
tech-niques used in constructing a multipurpose, stable, and sustainable data house used to support business intelligence (BI) This chapter introduces thedata warehouse by describing the objectives of BI and the data warehouse and
ware-by explaining how these fit into the overall Corporate Information Factory(CIF) architecture It discusses the iterative nature of the data warehouse con-struction and demonstrates the importance of the data warehouse data modeland the justification for the type of data model format suggested in this book
We discuss why the format of the model should be based on relational designtechniques, illustrating the need to maximize nonredundancy, stability, andmaintainability Another section of the chapter outlines the characteristics of amaintainable data warehouse environment The chapter ends with a discus-sion of the impact of this modeling approach on the ultimate delivery of thedata marts This chapter sets up the reader to understand the rationale behindthe ensuing chapters, which describe in detail how to create the data ware-house data model
Overview of Business Intelligence
BI, in the context of the data warehouse, is the ability of an enterprise to studypast behaviors and actions in order to understand where the organization has
3