Managers are frequently interested in trends,whereas operational users of data may only require the current position.Such information may be built up in the data warehouse over a period
Trang 1It is beyond the scope of this chapter to contribute to the ongoingdebate about the relative advantages of these and other data warehousearchitectures (Some suitable references are listed in Further Reading.)Unless otherwise noted, our discussion in this chapter assumes the simplearchitecture of Figure 16.1, but you should have little trouble adapting theprinciples to alternative structures.
Data warehouses are now widely used and generally need to be oped in-house, primarily because the mix of source systems (and associated
devel-Load Program
Load Program
Load Program
Query Tools
Query Tools
Load Program
Load Program
Load Program
Load Program
Load Program
Data Mart
Data Warehouse
Source Data
Source Data
Source Data
Source Data
External Data
Query Tools
Data Mart
Data Mart
Figure 16.1 Typical data warehouse and data mart architecture.
Trang 2operational databases) varies so much from organization to organization.Reporting requirements, of course, may also vary This is good news fordata modelers because data warehouses and data marts are databases,which, of course, must be specified by data models There may also besome reverse engineering and general data management work to be done
in order to understand the organization and meaning of the data in thesource systems (as discussed in Chapter 17)
Data modeling for data warehouses and marts, however, presents arange of new challenges and has been the subject of much debate amongdata modelers and database designers An early quote indicates how thebattle lines were drawn:
“Forget everything you know about entity relationship data modeling using that model with a real-world decision support system almost guarantees failure.” 1
On the other side of the debate were those who argued that “a database
is a database” and nothing needed to change
Briefly, there are two reasons why data modeling for warehouses andmarts is different First, the requirements that data warehouses and martsneed to satisfy are different (or at least differ in relative importance) fromthose for operational databases Second, the platforms on which they areimplemented may not be relational; in particular, data marts are frequently
implemented on specialized multidimensional DBMSs
Many of the principles and techniques of data modeling for operationaldatabases are adaptable to the data warehouse environment but cannot
be carried across uncritically And there are new techniques and patterns
We first look at how the requirements for data marts and data houses differ from those for operational databases We then reexamine therules of data modeling and find that, although the basic objectives(expressed as evaluation criteria/quality measures) remain the same, theirrelative importance changes As a result, we need to modify some of therules and add some general guidelines for data warehouse and datamart modeling Finally, we look specifically at the issues of organizing
ware-1Kimball, R., and Strehlo, K., “Why Decision Support Fails and How to Fix It,” Datamation
(June 1, 1994.)
Trang 3data marts.
and Data Marts
The literature on data warehouses identifies a number of characteristics thatdifferentiate warehouses and marts from conventional operational data-bases Virtually all of these have some impact on data modeling
16.2.1 Data Integration: Working with Existing
Databases
A data warehouse is not simply a collection of copies of records fromsource systems It is a database that “makes sense” in its own right Wewould expect to specify one Producttable even if the warehouse drew ondata from many overlapping Producttables or files with inconsistent defi-nitions and coding schemes The data modeler can do little about these his-torical design decisions but needs to define target tables into which all ofthe old data will fit, after some translation and/or reformatting These tableswill in turn need to be further combined, reformatted, and summarized asrequired to serve the data marts, which may also have been developedprior to the warehouse (Many organizations originally developed individ-ual data marts, fed directly from source systemsand often called “datawarehouses”until the proliferation of ETL programs forced the develop-ment of an intermediate warehouse.) Working within such constraints adds
an extra challenge to the data modeling task and means that we will oftenend up with less than ideal structures
16.2.2 Loads Rather Than Updates
Data marts are intended to support queries and are typically updatedthrough periodic batch loading of data from the warehouse or directly fromoperational databases Similarly, the data warehouse is likely to be loadedfrom the operational databases through batch programs, which are notexpected to run concurrently with other access This strategy may be adoptednot only to improve efficiency and manage contention for data resources, butalso to ensure that the data warehouse and data marts are not “moving targets” for queries, which generally need to produce consistent results
Trang 4Recall our discussion of normalization One of the strongest reasons fornormalizing beyond first normal form was to prevent “update anomalies”where one occurrence of an item is updated but others are left unchanged.
In the data warehouse environment, we can achieve that sort of consistency
in a different way through careful design of the load programsknowingthat no other update transactions will run against the database
Of course, there is no point in abandoning or compromising normalizationjust because we can tackle the problem in another (less elegant) way Thereneeds to be some payoff, and this may come through improved performance
or simplified queries And if we chose to “trickle feed” the warehouse usingconventional transactions, update anomalies could become an issue again
16.2.3 Less Predictable Database “Hits”
In designing an operational database, we usually have a good idea of the typeand volumes of transactions that will run against it We can optimize the data-base design to process those transactions simply and efficiently, sometimes atthe expense of support for lower-volume or unpredicted transactions.Queries against a data mart are less predictable, and, indeed, the ability
to support ad hoc queries is one of the major selling points of data marts.
A design decision (such as use of a repeating group, as described inChapter 2) that favors one type of query at the expense of others will need
to be very carefully thought through
16.2.4 Complex Queries Simple Interface
One of the challenges of designing data marts and associated query tools
is the need to support complex queries and analyses in a relatively simpleway It is not usually reasonable to expect users of the facility to navigatecomplex data structures in the manner of experienced programmers, yettypical queries against a fully normalized database may require data from a
large number of tables (We say “not usually reasonable” because some
users of data marts, such as specialist operational managers, researchers,and data miners may be willing and able to learn to navigate sophisticatedstructures if the payoff is sufficient.)
Perhaps the central challenge for the data mart modeler comes from theapproach that tool vendors have settled on to address the problem Datamart query tools are generally intended for use with a multidimensionaldatabase based on a central “fact” table and associated look-up tables called
dimension tables or just dimensions (Figure 16.2 in Section 16.6.2
shows an example.) The data modeler is required to fit the data into this
Trang 5tive discussed in Chapter 1 From a user perspective, the solution is elegant,
in that it is easy to understand and use and is consistent from one mart tothe next From the data modeler’s perspective, some very inelegant deci-sions may need to be taken to meet the constraint
16.2.5 History
The holding of historical information is one of the most important teristics of a data warehouse Managers are frequently interested in trends,whereas operational users of data may only require the current position.Such information may be built up in the data warehouse over a period oftime and retained long after it is no longer required in the source systems.The challenge of modeling time-dependent data may be greater for the datawarehouse designer than for the operational database designer
charac-16.2.6 Summarization
The data warehouse seldom contains complete copies of all data held rently or historically) in operational databases Some is excluded, and somemay be held only in summary form Whenever we summarize, we loseinformation, and the data modeler needs to be fully aware of the impact ofsummarization on all potential users
and Mart Models
It is interesting to take another look at the evaluation or quality criteria for datamodels that we identified in Chapter 1, but this time in the context of the spe-cial requirements of data warehouses and marts All remain relevant, but theirrelative importance changes Thus, our trade-offs are likely to be different
16.3.1 Completeness
In designing a data warehouse, we are limited by the data available in theoperational databases or from external sources We have to ask not only,
Trang 6“What do we want?” but also, “What do we have?” and, “What can we get?”Practically, this means acquainting ourselves with the source system dataeither at the outset or as we proceed For example:
User: “I want to know what percentage of customers spend more than
a specified amount on CDs when they shop here.”
Modeler: “We only record sales, not customers, so what we can tell you
is what percentage of sales exceed a certain value.”
User: “Same thing, isn’t it?”
Modeler: “Not really What if the customer buys a few CDs in the
clas-sical section then stops by the rock section and buys some more?”
User: “That’d actually be interesting to know Can you tell us how often
that happens? And what about if they see another CD as they’re walkingout and come back and buy it They see the display by the door ”
Modeler: “We can get information on that for those customers who use
their store discount card, because we can identify them ”
The users of data warehouses, interested in aggregated information,
may not make the same demands for absolute accuracy as the user of an
operational system Accordingly, it may be possible to compromise pleteness to achieve simplicity (as discussed below in Section 16.3.3) Ofcourse, this needs to be verified at the outset There are examples of ware-houses that have lost credibility because the outputs did not balance to thelast cent What we cannot afford to compromise is good documentation,which should provide the user with information on the currency, com-pleteness, and quality of the data, as well as the basic definitions
com-Finally, we may lose data by summarizing it to save space and ing The summarization may take place either when data is loaded fromoperational databases to the warehouse (a key design decision) or when it
process-is loaded from the warehouse to the marts (a decprocess-ision more easilyreversed)
16.3.2 Nonredundancy
We can be a great deal less concerned about redundancy in data houses and data marts than we would be with operational databases
ware-As discussed earlier, since data is loaded through special ETL programs
or utilities, and not updated in the usual sense, we do not face thesame risk that fields may be updated inconsistently Redundancy does, ofcourse, still cost us in storage space, and data warehouses can be very largeindeed
Particularly in data marts, denormalization is regularly practiced to plify structures, and we may also carry derived data, such as commonlyused totals
Trang 7sim-We tend not to think of a data warehouse or mart as enforcing business rules
in the usual sense because of the absence of traditional update transactions.Nevertheless, the data structures will determine what sort of data can beloaded, and if the data warehouse or mart implements a rule that is notsupported by a source system, we will have a challenge to address!Sometimes, the need to simplify data leads us to (for example) implement aone-to-many relationship even though a few real world cases are many-to-many Perhaps an insurance policy can occasionally be sold by more thanone salesperson, but we decide to build our data mart around a Policytablewith a Salespersondimension We have specified a tighter rule, and we aregoing to end up trading some “completeness” for the gain in simplicity
16.3.4 Data Reusability
Reusability, in the sense of reusing data captured for operational purposes
to support management queries, is the raison d’être of most data
ware-houses and marts More so than in operational databases, we have toexpect the unexpected as far as queries are concerned Data marts may beconstructed to support a particular set of queries (we can build anothermart if necessary to support a new requirement), but the data warehouseitself needs to be able to feed virtually any conceivable mart that uses the
data that it holds Here is an argument in favor of full normalization in the
data warehouse, and against any measures that irrecoverably lose datasuch
as summarization with removal of the source data
16.3.5 Stability and Flexibility
One of the challenges of data warehouse design is to accommodatechanges in the source data These may reflect real changes in the business
or simply changes (including complete replacement) to the operationaldatabases
Much of the value of a data warehouse may come from the build-up ofhistorical data over a long period We need to build structures that not onlyaccommodate the new data, but also allow us to retain the old
It is a maxim of data warehouse designers that “data warehouse design
is never finished.” If users gain value from the initial implementation, it isalmost inevitable that they will require that the warehouse and marts beextendedoften very substantially Many a warehouse project has delivered
a warehouse that cannot be easily extended, requiring new warehouses to
Trang 8be constructed as the requirements grow The picture in Figure 16.1becomes much less elegant when we add multiple warehouses in themiddle, possibly sharing common source databases and target data marts.
16.3.6 Simplicity and Elegance
As discussed earlier, data marts often need to be restricted to simple tures that suit a range of query tools and are relatively easy for end-users
struc-to understand
16.3.7 Communication Effectiveness
It is challenging enough to communicate “difficult” data structures to fessional programmers, let alone end-users, who may have only an occa-sional need to use the data marts Data marts that use highly generalizedstructures and unfamiliar terminology, or that are based on a sophisticatedoriginal view of the business, are going to cause problems
The data warehouse needs to be able to accept the uploading of largevolumes of data, usually within a limited “batch window” when operationaldatabases are not required for real-time processing It also needs to supportreasonably rapid extraction of data for the data marts Data loading may usepurpose-designed ETL utilities, which will dictate how data should beorganized to achieve best performance
The architecture shown in Figure 16.1 has evolved from earlier approaches
in which the data warehouse and data marts were combined into a singledatabase
Trang 9or clearinghouse between different representations of the data, while thedata marts are designed to present simpler views to the end-users.
The basic rule for the data modeler is to respect this separation.
Accordingly, we design the data warehouse much as we would an ational database, but with a recognition that the relative importance of thevarious design objectives/quality criteria (as reviewed in the previous sec-tion) may be different So, for example, we may be more prepared toaccept a denormalized structure, or some data redundancyprovided, ofcourse, there is a corresponding payoff Flexibility is paramount We canexpect to have to accommodate growth in scope, new and changed oper-ational databases, and new data marts
oper-Data marts are a different matter Here we need to fit data into a quiterestrictive structure, and the modeling challenge is to achieve this withoutlosing the ability to support a reasonably wide range of queries We willusually end up making some serious compromises, which may be accept-able for the data mart but would not be so for an operational database ordata warehouse
Many successful data warehouses have been designed by data modelerswho tackled the modeling assignment as if they were designing an opera-tional database We have even seen examples of data warehouses that had
to be completely redesigned according to this traditional approach after ill-advised attempts to apply modeling approaches borrowed from the datamart theory Conversely, there is a strong school of thought that argues thatthe data warehouse model can usefully anticipate some common datamanipulation and summarization
Both arguments have merit, and the path you take should be guided bythe business and technical requirements in each case That is why wedevoted so much space at the beginning of this chapter to differences andgoals; it is a proper appreciation of these rather than the brute application
of some special technique that leads to good warehouse design
We can, however, identify a few general techniques that are specific todata warehouse design
16.5.1 An Initial Model
Data warehouse designers usually find it useful to start with an E-R model
of the total business or, at least, of the part of the business that the datawarehouse may ultimately cover The starting point may be an existing
Trang 10enterprise data model (see Chapter 17) or a generalization of the data tures in the most important source databases If an enterprise data model
struc-is used, the data modeler will need to check that it aligns reasonably closelywith existing structures rather than representing a radical “future vision.”Data warehouse designers are not granted the latitude of data modelersstarting with a blank slate!
16.5.2 Understanding Existing Data
In theory, we could construct a data warehouse without ever talking to thebusiness users, simply by consolidating data from the operational data-bases Such a warehouse would (again in theory) allow any query possiblewithin the limitations of the source data
In practice, we need user input to help select what data will be relevant
to the data mart users (the extreme alternative would be to load every dataitem from every source system), to contribute to the inevitable decisions oncompromises, and, of course, to “buy in” and support the project
Nevertheless, a good part of data warehouse design involves gaining anunderstanding of data from the source systems and defining structures tohold and consolidate it Usually the most effective approach is to use theinitial model as a starting point and to map the existing structures against
it Initially, we do this at an entity level, but as modeling proceeds in laboration with the users, we add attributes and possibly subtypes
to what is likely to be possible and what alternatives may be available
16.5.4 Determining Sources and Dealing
with Differences
One of the great challenges of data warehouse design is in making the most
of source data in legacy systems If we are lucky, some of the source data
Trang 11overloaded attributes (see Section 5.3), poor documentation of definitionsand coding schemes, and (almost certainly) inconsistency across databases.Our choice of source for a data itemand, hence, its definition in thedata warehousewill depend on a number of factors:
1 The objective of minimizing the number of source systems feeding thedata warehouse, in the interests of simplicity; reduced need for dataintegration; and reduced development, maintenance, and running costs
2 The “quality” of the data itema complex issue involving primarily theaccuracy of the item instances (i.e., whether they accurately reflect thereal world), but also timeliness (when were they last updated?)andcompatibility with other items (update cycles again) Timing differencescan be a major headache The update cycles of data vary in many organ-izations from real-time to annually Because of this, the “same” data itemmay hold different values in different source databases
3 Whether multiple sources can be reconciled to produce a better overallquality We may even choose to hold two or more versions of the
“same” attribute in the warehouse, to enable a choice of the most priate version as required
appro-4 The compatibility of the coding scheme with other data Incompatiblecoding schemes and data formats are relatively straightforward to han-dleas long as the mapping between them is simple If the underlying
definitions are different, it may be impossible to translate to a commonscheme without losing too much meaning It is easy to translate coun-try codes as long as you can agree what a country is! One police forcerecognizes three eye colors, another four.2
5 Whether overloaded attributes can be or need to be unpacked Forexample, one database may hold name and address as a single field,3
while another may break each down into smaller fieldsinitial, familyname, street number, and so on Programmers often take serious liber-ties with data definitions and many a field has been redefined wellbeyond its original intent Usually, the job of unpacking it into primitiveattributes is reasonably straightforward once the rules are identified
In doing the above, the data warehouse designer may need to performwork that is, more properly, the responsibility of a data management or data
2 For a fascinating discussion of how different societies classify colors and a detailed example
of the challenges that we face in coming up with classification schemes acceptable to all, see
Chapter 2 of Language Universals and Linguistic Typology by Bernard Comrie, Blackwell,
Oxford 1981, ISBN 0-631-12971-5.
3 We use the general term “field” here rather than “column” as many legacy databases are not relational.
Trang 12administration team Indeed, the problems of building data warehouses inthe absence of good data management groundwork have often led to suchteams being established or revived.
16.5.5 Shaping Data for Data Marts
How much should the data warehouse design anticipate the way that datawill be held in the data marts? On the one hand, the data warehouse should
be as flexible as possible, which means not organizing data in a way thatwill favor one user over another Remember that the data warehouse may
be required not only to feed data marts, but may also be the commonsource of data for other analysis and decision support systems And somedata marts offer broader options for organizing data
On the other hand, if we can be reasonably sure that all users of thedata will first perform some common transformations such as summariza-tion or denormalization, there is an argument for doing them onceas data
is loaded into the warehouse, rather than each time it is extracted Anddenormalized data can usually be renormalized without too much trouble.(Summarization is a different matter: base data cannot be recovered fromsummarized data.) The data warehouse can act as a stepping-stone togreater levels of denormalization and summarization in the marts Whendata volumes are very high, there is frequently a compelling argument forsummarization to save space and processing
Another advantage of shaping data at the warehouse stage is that it motes a level of commonality across data marts For example, a phonecompany might decide not to hold details of all telephone calls but ratheronly those occurring during a set of representative periods each week If thedecision was made at the warehouse stage, we could decide once and forall what the most appropriate periods were All marts would then workwith the same sampling periods, and results from different marts could bemore readily compared
pro-Sometimes the choice of approach will be straightforward In particular,
if the data marts are implemented as views of the warehouse, we will need
to implement structures that can be directly translated into the requiredshape for the marts
The next section discusses data mart structures, and these can, withappropriate discretion, be incorporated into the data warehouse design.Where you are in doubt, however, our advice is to lean toward design-ing the data warehouse for flexibility, independent of the data marts One
of the great lessons of data modeling is that new and unexpected uses will
be found for data, once it is available, and this is particularly true in thecontext of data warehouses Maximum flexibility and minimum anticipationare good starting points!
Trang 1316.6 Modeling for the Data Mart
16.6.1 The Basic Challenge
In organizing data in a data mart, the basic challenge is to present it in aform that can be understood by general business people A typical opera-tional database design is simply too complex to meet this requirement.Even our best efforts with views cannot always transform the data intosomething that makes immediate sense to nonspecialists Further, the querytools themselves need to make some assumptions about how data is stored
if they are going to be easy to implement and use, and if they are going toproduce reports in predictable formats Data mart users also need to beable to move from one mart to another without too much effort
16.6.2 Multidimensional Databases,
Stars and Snowflakes
Developers of data marts and vendors of data mart software have settled on a common response to the problem of providing a simple data
structure: a star schema specifying a multidimensional database
Multi-dimensional databases can be built using conventional relational DBMSs orspecialized multidimensional DBMSs optimized for such structures
Figure 16.2 shows a star schema The structure is very simple: a fact
table surrounded by a number of dimension tables.
The format is not difficult to understand The fact tables hold (typically)transaction data, either in its raw, atomic form or summarized The dimen-sions effectively classify the data in the fact table into categories, and make
it easy to formulate queries based on categories that aggregate data fromthe fact table: “What percentage of sales were in region 13?” or “What wasthe total value of sales in region 13 to customers in category B?”
With our user hats on, this looks fine Putting our data modelinghats on, we can see some major limitationsat least compared with thedata structures for operational databases that we have been working with
to date
Before we start looking at these “limitations,” it is interesting to observethat multidimensional DBMSs have been around long enough now thatthere are professional designers who have modeled only in that environ-ment They seem to accept the star schema structure as a “given” and donot think of it as a limiting environment to work in It is worth taking a leaffrom their book if you are a “conventional” modeler moving to data martdesign Remember that relational databases themselves are far from com-prehensive in the structures that they supportmany DBMSs do notdirectly support subtypes for exampleyet we manage to get the job done!
Trang 1416.6.2.1 One Fact Table per Star
While there is usually no problem implementing multiple stars, each withits own fact table (within the same4 or separate data marts), we can haveonly one fact table in each star Figure 16.3 illustrates the key problem thatthis causes
It is likely that we will hold numeric data and want to formulate queries
at both the loan and transaction level Some of the options we mightconsider are the following:
1 Move the data in the Loan table into the Transaction table, whichwould then become the fact table This would mean including all of thedata about the relevant loan in each row of the Transaction table
If there is a lot of data for each loan, and many transactions per loan,the space requirement for the duplicated data could be unacceptable.Such denormalization would also have the effect of making it difficult
to hold loans that did not have any transactions against them Our tion might require that we add “dummy” rows in the Transactiontable,containing only loan data Queries about loans and transactions would
solu-Period
Accounting Month No Quarter No Year No
Product
Product ID Product Type Code Product Name
Sale
Accounting Month No * Product ID * Customer ID * Location ID * Quantity Value
Location
Location ID Location Type Code Region Code State Code Location Name
Customer
Customer ID Customer Type Code Region Code State Code Customer Name
Figure 16.2 A star schema: the fact table is Sale.
4 Multiple stars in the same data mart can usually share dimension tables.
Trang 15be more complicated than would be the case with a simple loan ortransaction fact table.
2 Nominate the Loantable as the fact table, and hold transaction tion in a summarized form in the Loantable This would mean holdingtotals rather than individual items If the maximum number of transac-tions per loan was relatively small (perhaps more realistically, we might
informa-be dealing with the numinforma-ber of assets securing the loan), we could hold
a repeating group of transaction data in the Loantableas always withsome loss of simplicity in query formulation
3 Implement separate star schemas, one with Loan as a fact table andthe other with Transaction as a fact table We would probably turn
Loaninto a dimension for the Transaction schema, and we might holdsummarized transaction data in the Loantable
16.6.2.2 One Level of Dimension
A true star schema supports only one level of dimension Some data marts
do support multiple levels (usually simple hierarchies) These variants are
generally known as snowflake schemas (Figure 16.4).
Loan
Customer
Period Branch
Transaction Transaction
Type
be issued by
issue
be owned by
own
be issued in
be time of issue of take place
in
include the time of
be against
be the object of classify
be classified by
Loan Type
classify be
classified by
Figure 16.3 Which is the fact table Loan or Transaction?
Trang 16To compress what may be a multilevel hierarchy down to one level, wehave to denormalize (specifically from fully normalized back to first normalform) Figure 16.5 provides an example.
While we may not need to be concerned about update anomalies fromdenormalizing, we do need to recognize that space requirements can some-times become surprisingly large if the tables near the top of the hierarchycontain a lot of data We may need to be quite brutal in stripping thesedown to codes and (perhaps) names, so that they function only as cate-gories (In practice, space requirements of dimensions are seldom as much
of a problem as those of fact tables.)Another option is to summarize data from lower-level tables into higher-level tables, or completely ignore one or more levels in the hierarchy(Figure 16.6) This option will only be workable if the users are not inter-ested in some of the (usually low-level) classifications
16.6.2.3 One-to-Many Relationships
The fact table in a star schema is in a many-to-one relationship with thedimensions In the discussion above on collapsing hierarchies, we alsoassumed that there were no many-to-many relationships amongst thedimensions, in which case simple denormalization would not work.What do we do if the real-world relationship is many-to-many, as inFigure 16.7? Here, we have a situation in which, most of the time, sales aremade by only one salesperson, but, on occasion, more than one salesper-son shares the sale
One option is to ignore the less common case and tie the relationshiponly to the “most important” or “first” salesperson Perhaps we can
Product Type Product Type ID Product Type Name
Product
Product ID Product Type ID Product Name
Period Accounting Month No Quarter No
Accounting Month No Product ID Customer ID Location ID Quantity Value
Customer
Customer ID Customer Type ID Region ID Customer Name
Location
Location ID Location Type ID Region ID Location Name
Customer Type Customer Type ID Customer Type Name
Location Type Location Type ID Location Type Name
Region
Region ID State ID Region Name
State ID State Name State
Figure 16.4 A snowflake schema Sale is the fact table.
Trang 17Customer ID Customer Type ID Region ID Customer Name
Region
Region ID State ID Region Name
State
State ID State Name
Customer (a) Normalized
Customer ID Customer Type ID Region ID Customer Name Region Name State Name State ID (b) Denormalized
Figure 16.5 Denormalizing to collapse a hierarchy of dimension tables.
Customer Type
Customer
Sale
Customer Type
Sale
be classified
by classify
be to a customer classified by
Trang 18compensate to some degree by carrying the number of salespersonsinvolved in the Sale table, and even by carrying (say) the percentageinvolvement of the key person For some queries, this compromise may bequite acceptable, but it would be less than satisfactory if a key area of inter-est is sales involving multiple salespersons.
We could modify the Salesperson table to allow it to accommodatemore than one salesperson, through use of a repeating group It is aninelegant solution and breaks down once we want to include (as in the pre-vious section) details from higher-level look up tables Which region’s data
do we includethat of the first, the second, or the third salesperson?Another option is to in effect resolve the many-to-many relationship andtreat the Sale-by-Salespersontable as the fact table (Figure 16.8) We willprobably need to include the rest of the sale data in the table
Product Code Product Description Product
Product Variant
Product Code Product Variant Code Standard Price Total Sales Amount
Sale
Sale ID Product Code Product Variant Code Value
.
Product
Product Code Product Description Average Price Total Sales Amount
Sale
Sale ID Product Code Product Variant Code Value
.
Figure 16.6 (b) Summarizing data from lower-level tables into higher-level tables.
Trang 19Once again, we have a situation in which there is no single, cal solution We need to talk to the users about how they want to “sliceand dice” the data and work through with them the pros and cons of thedifferent options.
mechani-16.6.3 Modeling Time-Dependent Data
The basic issues related to the modeling of time, in particular the choice of
“snapshots” or history are covered in Chapter 15 and apply equally to datawarehouses, data marts, and operational databases This section covers afew key aspects of particular relevance to data mart design
16.6.3.1 Time Dimension Tables
Most data marts include one or more dimension tables holding time periods
to enable that dimension to be used in analysis (e.g., “What percentage or
sales were made by salespeople in Region X in the last quarter?”) The key
design decisions are the level of granularity (hours, days, months, years)and how to deal with overlapping time periods (financial years may overlapwith calendar years, months may overlap with billing periods, and so on).The finer the granularity (i.e., the shorter the periods), the fewer problems
we have with overlap and the more precise our queries can be However,
Salesperson
Sale Product
be credited to
be credited with be
classified by
classify
Figure 16.7 Many-to-many relationship between dimension and fact tables.
Trang 20query formulation may be more difficult or time-consuming in terms ofspecifying the particular periods to be covered.
Sometimes, we will need to specify a hierarchy of time periods (as asnowflake or collapsed into a single-level denormalized star) Alternatively,
or in addition, we may specify multiple time dimension tables, possiblycovering overlapping periods
16.6.3.2 Slowly-Changing Dimensions
One of the key concerns of the data mart designer is how quickly the data
in the dimension tables will change, and how quickly fact data may movefrom one dimension to another
Figure 16.9 shows a simple example of the problem in snowflake formfor clarity This might be part of a data mart to support analysis of customerpurchasing patterns over a long period
It should be clear that, if customers can change from one customergroup to another over time and our mart only records the current group,
we will not be able to ask questions such as, “What sort of vehicles did
people buy while they were in group ‘A’?” (We could ask, “What sort of vehicles did people currently in group ‘A’ buy over time?”but this maywell be less useful.)
Sale by Salesperson
Salesperson
be classified
by classify
be credited for
be credited to
be classified by
classify
Figure 16.8 Treating the sale-by-salesperson table as the fact table.
Trang 21In the operational database, such data will generally be supported bymany-to-many relationships, as described in Chapter 15, and/or matching
of timestamps and time periods There are many ways of reworking thestructure to fit the star schema requirement For example:
1 Probably the neatest solution to the problem as described is to carry two
foreign keys to Customer Groupin the Purchasetable One key points
to the customer group to which the customer belonged at the time ofthe purchase; the other points to the customer group to which the cus-tomer currently belongs In fact, the information supported by the latterforeign key may not be required by the users, in which case we candelete it, giving us a very simple solution
Of course, setting up the mart in this form will require some translation
of data held in more conventional structures in the operational databasesand (probably) the data warehouse
2 If the dimension changes sufficiently slowly in the time frames in which
we are interested, then the amount of error or uncertainty that it causesmay be acceptable We may be able to influence the speed of change
by deliberately selecting or creating dimensions (perhaps at the datawarehouse stage) which change relatively slowly For example, we may
be able to classify customers into broad occupational groups sional,” “manual worker,” “technician”) rather than more specific occu-pations, or even develop lifestyle profiles that have been found to berelatively stable over long periods
(“profes-3 We can hold a history of (say) the last three values of Customer Groupinthe Customertable This approach will also give us some information onhow quickly the dimension changes
Logical data warehouse and data mart design are important subdisciplines
of data modeling, with their own issues and techniques
Customer
Figure 16.9 Slowly changing dimensions.
Trang 22Data warehouse design is particularly influenced by its role as a stagingpoint between operational databases and data marts Existing data struc-tures in operational databases or (possibly) existing data marts will limit thefreedom of the designer, who will also need to support high volumes ofdata and load transactions Within these constraints, data warehouse designhas much in common with the design of operational databases.
The rules of data mart design are largely a result of the star schemastructurea limited subset of the full E-R structures used for operationaldatabase designand lead to a number of design challenges, approaches,and patterns peculiar to data marts The data mart designer also has to con-tend with the limitations of the data available from the warehouse
Trang 24Chapter 17
Enterprise Data Models and Data Management
“Always design a thing by considering it in its next larger context—a chair in a room,
a room in a house, a house in an environment, an environment in a city plan.”
– Eliel Saarinen
So far, we have discussed data modeling in the context of database design;
we have assumed that our data models will ultimately be implementedmore or less directly using some DBMS Our interest has been in the datarequirements of individual application systems
However, data models can also play a role in data planning and
manage-ment for an enterprise as a whole An enterprise data model (sometimes called a corporate data model) is a model that covers the whole of, or a
substantial part of, an organization We can use such a model to:
■ Classify or index existing data
■ Provide a target for database and systems planners
■ Provide a context for specifying new databases
■ Support the evaluation and integration of application packages
■ Guide data modelers in the development or implementation of ual databases
individ-■ Specify data formats and definitions to support the exchange of databetween applications and with other organizations
■ Provide input to business planning
■ Specify an organization-wide database (in particular, a data warehouse)
These activities are part of the wider discipline of data management—
the management of data as a shared enterprise resource—that warrants abook in itself.1 In this chapter, we look briefly at data management in
499
1A useful starting point is Guidelines to Implementing Data Resource Management, 4th Edition,
Data Management Association, 2002.
Trang 25ine how development of an enterprise data model differs from ment of a conventional project-level data model.
develop-But first, a word of warning: far too many enterprise data models haveended up “on the shelf” after considerable expenditure on their develop-ment The most common reason, in our experience, is a lack of a clear idea
of how the model is to be used It is vital that any enterprise data model bedeveloped in the context of a data management or information systems strat-egy, within which its role is clearly understood, rather than as an end in itself
17.2.1 Problems of Data Mismanagement
The rationale for data management is that data is a valuable and expensiveresource that therefore needs to be properly managed Parallels are oftendrawn with physical assets, people, and money, all of which need to bemanaged explicitly if the enterprise is to derive the best value from them
As with the management of other assets, we can best understand the needfor data management by looking at the results of not doing it
Databases have traditionally been implemented on an by-application basis—one database per application system Indeed, data-bases are often seen as being “owned” by their parent applications Theproblem is that some data may be required by more than one application.For example, a bank may implement separate applications to handle per-sonal loans and savings accounts, but both will need to hold data about cus-tomers Without some form of planning and control, we will end up holdingthe same data in both databases And here, the element of choice in datamodeling works against us; we have no guarantee that the modelers work-ing on different systems will have represented the common data in the sameway, particularly if they are software package developers working fordifferent vendors Differences in data models can make data duplicationdifficult to identify, document, and control
application-The effects of duplication and inconsistency across multiple systemsare similar to those that arise from poor data modeling at the individualsystem level
There are the costs of keeping multiple copies of data in step (andrepercussions from data users—including customers, managers, and regu-lators—if we do not) Most of us have had the experience of notifying anorganization of a change of address and later discovering that only some oftheir records have been updated
Pulling data together to meet management information needs is far moredifficult if definitions, coding, and formats vary An airline wants to know
Trang 26the total cost of running each of its terminals, but the terminals are fied in different ways in different systems—sometimes only by a series ofaccount numbers An insurance company wants a breakdown of profitabil-ity by product, but different divisions have defined “product” in different
identi-ways Problems of this kind constitute the major challenge in data
ware-house development (Chapter 16)
Finally, poor overall data organization can make it difficult to use thedata in new ways as business functions change in response to market andregulatory pressures and internal initiatives Often, it seems easier to imple-ment yet another single-purpose database than to attempt to use inconsis-tent existing databases A lack of central documentation also makes reuse
of data difficult; we may not even know that the data we require is held in
an existing database The net result, of course, is still more databases, and
an exacerbation of the basic problem Alternatively, we may decide that thenew initiative is “too hard” or economically untenable
We have seen banks with fifty or more “Branch” files, retailers withmore than thirty “Stock Item” files, and organizations that are supposedlycustomer-focused with dozens of “Customer” files Often, just determiningthe scope of the problem has been a major exercise Not surprisingly, it isthe data that is most central to an organization (and, therefore, used by thegreatest number of applications) that is most frequently mismanaged
17.2.2 Managing Data as a Shared Resource
Data management aims to address these issues by taking an organization-wideview of data Instead of regarding databases as the sole property of theirparent applications, we treat them as a shared resource This may entail doc-umenting existing databases; encouraging development of new, sharable data-bases in critical areas; building interfaces to keep data in step; establishingstandards for data representation; and setting an overall target for data organ-ization The task of data management may be assigned to a dedicated datamanagement (or “data administration” or “information architecture”) team, or
be included in the responsibilities of a broader “architectures” group
17.2.3 The Evolution of Data Management
The history of data management as a distinct organizational function datesfrom the early 1970s In an influential paper, Nolan2 identified “Data
2Nolan: Managing the Crisis in Data Processing, Harvard Business Review, 5(2), March–April,
1979.
Trang 27last being “Maturity”) Many medium and large organizations establisheddata management groups, and data management began to emerge as adiscipline in its own right.3
In the early days of data management, some organizations pursued whatseemed to be the ideal solution: development of a single shared database,
or an integrated set of “subject databases” covering all of the enterprise’sdata requirements Even in the days when there were far fewer informationsystems to deal with, the task proved overwhelmingly difficult and expen-sive, and there were few successes Today, most organizations have a sub-stantial base of “legacy” systems and cannot realistically contemplatereplacing them all with new applications built around a common set of datastructures
Recognizing that they could not expect to design and build the prise’s data structures themselves, data managers began to see themselves
enter-as akin to town planners (though the term “architect” henter-as continued to bemore widely used—unfortunately, in our view, as the analogy is mislead-ing) Their role was to define a long-term target (town plan) and to ensurethat individual projects contributed to the realization of that goal
In practice, this meant requiring developers to observe common datastandards and definitions (typically specified by an enterprise-wide datamodel), to reuse existing data where practicable, and to contribute to acommon set of data documentation Like town planners, data managersencountered considerable resistance along the way, as builders assertedtheir preference for operating without outside interference and appealed tohigher authorities for special dispensation for their projects
This approach, too, has not enjoyed a strong record of success, thoughmany organizations have persisted with it A number of factors haveworked against it, in particular the widespread use of packaged software inpreference to in-house development, and greater pressure to deliver results
in the short-to-medium term
In response to such challenges, some data managers have chosen totake a more proactive and focused role, initiating projects to improve datamanagement in specific areas, rather than attempting to solve all of an orga-nization’s data management problems For example, they might address aparticularly costly data quality problem, or establish data standards in anarea in which data matching is causing serious difficulties CustomerRelationship Management (CRM) initiatives fall into this category, though
in many cases they have been initiated and managed outside the datamanagement function
3 The International Data Managers Association (DAMA) at www.dama.org is a worldwide body that supports data management professionals.
Trang 28More recently we have seen a widespread change in philosophy Ratherthan seek to consolidate individual databases, organizations are looking tokeep data in step through messages passed amongst applications In effect,there is a recognition that applications (and their associated databases) will
be purchased or developed one at a time, with relatively little opportunityfor direct data sharing The proposed solution is to accept the duplication
of data, which inevitably results, but to put in place mechanisms to ensurethat when data is updated in one place, messages (typically in XML format)are dispatched to update copies of the data held by other applications.For some data managers, this approach amounts to a rejection of the datamanagement philosophy For others, it is just another mechanism forachieving similar ends What is clear is that while the technology and archi-tecture may have changed, the basic issues of understanding data meaningand formats within and across applications remain To some extent at least,the problem of data specification moves from the databases to the messageformats
An enterprise data model has been central to all of the traditionalapproaches to data management, and, insofar as the newer approaches alsorequire enterprise-wide data definitions, is likely to continue to remain so
In the following sections, we examine the most important roles that anenterprise data model can play
Most organizations have a substantial investment in existing databases andfiles Often, the documentation of these is of variable quality and heldlocally with the parent applications
The lack of a central, properly-indexed register of data is one of thegreatest impediments to data management If we do not know whatdata we have (and where it is), how can we hope to identify opportunitiesfor its reuse or put in place mechanisms to keep the various copies in step?The problem is particularly apparent to builders of data warehouses(Chapter 16) and reporting and analysis applications which need todraw data from existing operational files and databases Just finding therequired data is often a major challenge Correctly interpreting it in theabsence of adequate documentation can prove an even greater one,and serious business mistakes have been made as a result of incorrectassumptions
Commercial data dictionaries and “repositories” have been around for many
years to hold the necessary metadata (data about data) Some organizations
have built their own with mixed success But data inventories are of ited value without an index of some kind; we need to be able to ask, “What