1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Modeling Essentials 2005 phần 10 ppt

56 337 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Data Modeling Essentials 2005 phần 10 ppt
Trường học University of Data Science
Chuyên ngành Data Modeling
Thể loại Chương
Năm xuất bản 2005
Thành phố Hà Nội
Định dạng
Số trang 56
Dung lượng 0,94 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Managers are frequently interested in trends,whereas operational users of data may only require the current position.Such information may be built up in the data warehouse over a period

Trang 1

It is beyond the scope of this chapter to contribute to the ongoingdebate about the relative advantages of these and other data warehousearchitectures (Some suitable references are listed in Further Reading.)Unless otherwise noted, our discussion in this chapter assumes the simplearchitecture of Figure 16.1, but you should have little trouble adapting theprinciples to alternative structures.

Data warehouses are now widely used and generally need to be oped in-house, primarily because the mix of source systems (and associated

devel-Load Program

Load Program

Load Program

Query Tools

Query Tools

Load Program

Load Program

Load Program

Load Program

Load Program

Data Mart

Data Warehouse

Source Data

Source Data

Source Data

Source Data

External Data

Query Tools

Data Mart

Data Mart

Figure 16.1 Typical data warehouse and data mart architecture.

Trang 2

operational databases) varies so much from organization to organization.Reporting requirements, of course, may also vary This is good news fordata modelers because data warehouses and data marts are databases,which, of course, must be specified by data models There may also besome reverse engineering and general data management work to be done

in order to understand the organization and meaning of the data in thesource systems (as discussed in Chapter 17)

Data modeling for data warehouses and marts, however, presents arange of new challenges and has been the subject of much debate amongdata modelers and database designers An early quote indicates how thebattle lines were drawn:

“Forget everything you know about entity relationship data modeling using that model with a real-world decision support system almost guarantees failure.” 1

On the other side of the debate were those who argued that “a database

is a database” and nothing needed to change

Briefly, there are two reasons why data modeling for warehouses andmarts is different First, the requirements that data warehouses and martsneed to satisfy are different (or at least differ in relative importance) fromthose for operational databases Second, the platforms on which they areimplemented may not be relational; in particular, data marts are frequently

implemented on specialized multidimensional DBMSs

Many of the principles and techniques of data modeling for operationaldatabases are adaptable to the data warehouse environment but cannot

be carried across uncritically And there are new techniques and patterns

We first look at how the requirements for data marts and data houses differ from those for operational databases We then reexamine therules of data modeling and find that, although the basic objectives(expressed as evaluation criteria/quality measures) remain the same, theirrelative importance changes As a result, we need to modify some of therules and add some general guidelines for data warehouse and datamart modeling Finally, we look specifically at the issues of organizing

ware-1Kimball, R., and Strehlo, K., “Why Decision Support Fails and How to Fix It,” Datamation

(June 1, 1994.)

Trang 3

data marts.

and Data Marts

The literature on data warehouses identifies a number of characteristics thatdifferentiate warehouses and marts from conventional operational data-bases Virtually all of these have some impact on data modeling

16.2.1 Data Integration: Working with Existing

Databases

A data warehouse is not simply a collection of copies of records fromsource systems It is a database that “makes sense” in its own right Wewould expect to specify one Producttable even if the warehouse drew ondata from many overlapping Producttables or files with inconsistent defi-nitions and coding schemes The data modeler can do little about these his-torical design decisions but needs to define target tables into which all ofthe old data will fit, after some translation and/or reformatting These tableswill in turn need to be further combined, reformatted, and summarized asrequired to serve the data marts, which may also have been developedprior to the warehouse (Many organizations originally developed individ-ual data marts, fed directly from source systemsand often called “datawarehouses”until the proliferation of ETL programs forced the develop-ment of an intermediate warehouse.) Working within such constraints adds

an extra challenge to the data modeling task and means that we will oftenend up with less than ideal structures

16.2.2 Loads Rather Than Updates

Data marts are intended to support queries and are typically updatedthrough periodic batch loading of data from the warehouse or directly fromoperational databases Similarly, the data warehouse is likely to be loadedfrom the operational databases through batch programs, which are notexpected to run concurrently with other access This strategy may be adoptednot only to improve efficiency and manage contention for data resources, butalso to ensure that the data warehouse and data marts are not “moving targets” for queries, which generally need to produce consistent results

Trang 4

Recall our discussion of normalization One of the strongest reasons fornormalizing beyond first normal form was to prevent “update anomalies”where one occurrence of an item is updated but others are left unchanged.

In the data warehouse environment, we can achieve that sort of consistency

in a different way through careful design of the load programsknowingthat no other update transactions will run against the database

Of course, there is no point in abandoning or compromising normalizationjust because we can tackle the problem in another (less elegant) way Thereneeds to be some payoff, and this may come through improved performance

or simplified queries And if we chose to “trickle feed” the warehouse usingconventional transactions, update anomalies could become an issue again

16.2.3 Less Predictable Database “Hits”

In designing an operational database, we usually have a good idea of the typeand volumes of transactions that will run against it We can optimize the data-base design to process those transactions simply and efficiently, sometimes atthe expense of support for lower-volume or unpredicted transactions.Queries against a data mart are less predictable, and, indeed, the ability

to support ad hoc queries is one of the major selling points of data marts.

A design decision (such as use of a repeating group, as described inChapter 2) that favors one type of query at the expense of others will need

to be very carefully thought through

16.2.4 Complex QueriesSimple Interface

One of the challenges of designing data marts and associated query tools

is the need to support complex queries and analyses in a relatively simpleway It is not usually reasonable to expect users of the facility to navigatecomplex data structures in the manner of experienced programmers, yettypical queries against a fully normalized database may require data from a

large number of tables (We say “not usually reasonable” because some

users of data marts, such as specialist operational managers, researchers,and data miners may be willing and able to learn to navigate sophisticatedstructures if the payoff is sufficient.)

Perhaps the central challenge for the data mart modeler comes from theapproach that tool vendors have settled on to address the problem Datamart query tools are generally intended for use with a multidimensionaldatabase based on a central “fact” table and associated look-up tables called

dimension tables or just dimensions (Figure 16.2 in Section 16.6.2

shows an example.) The data modeler is required to fit the data into this

Trang 5

tive discussed in Chapter 1 From a user perspective, the solution is elegant,

in that it is easy to understand and use and is consistent from one mart tothe next From the data modeler’s perspective, some very inelegant deci-sions may need to be taken to meet the constraint

16.2.5 History

The holding of historical information is one of the most important teristics of a data warehouse Managers are frequently interested in trends,whereas operational users of data may only require the current position.Such information may be built up in the data warehouse over a period oftime and retained long after it is no longer required in the source systems.The challenge of modeling time-dependent data may be greater for the datawarehouse designer than for the operational database designer

charac-16.2.6 Summarization

The data warehouse seldom contains complete copies of all data held rently or historically) in operational databases Some is excluded, and somemay be held only in summary form Whenever we summarize, we loseinformation, and the data modeler needs to be fully aware of the impact ofsummarization on all potential users

and Mart Models

It is interesting to take another look at the evaluation or quality criteria for datamodels that we identified in Chapter 1, but this time in the context of the spe-cial requirements of data warehouses and marts All remain relevant, but theirrelative importance changes Thus, our trade-offs are likely to be different

16.3.1 Completeness

In designing a data warehouse, we are limited by the data available in theoperational databases or from external sources We have to ask not only,

Trang 6

“What do we want?” but also, “What do we have?” and, “What can we get?”Practically, this means acquainting ourselves with the source system dataeither at the outset or as we proceed For example:

User: “I want to know what percentage of customers spend more than

a specified amount on CDs when they shop here.”

Modeler: “We only record sales, not customers, so what we can tell you

is what percentage of sales exceed a certain value.”

User: “Same thing, isn’t it?”

Modeler: “Not really What if the customer buys a few CDs in the

clas-sical section then stops by the rock section and buys some more?”

User: “That’d actually be interesting to know Can you tell us how often

that happens? And what about if they see another CD as they’re walkingout and come back and buy it They see the display by the door ”

Modeler: “We can get information on that for those customers who use

their store discount card, because we can identify them ”

The users of data warehouses, interested in aggregated information,

may not make the same demands for absolute accuracy as the user of an

operational system Accordingly, it may be possible to compromise pleteness to achieve simplicity (as discussed below in Section 16.3.3) Ofcourse, this needs to be verified at the outset There are examples of ware-houses that have lost credibility because the outputs did not balance to thelast cent What we cannot afford to compromise is good documentation,which should provide the user with information on the currency, com-pleteness, and quality of the data, as well as the basic definitions

com-Finally, we may lose data by summarizing it to save space and ing The summarization may take place either when data is loaded fromoperational databases to the warehouse (a key design decision) or when it

process-is loaded from the warehouse to the marts (a decprocess-ision more easilyreversed)

16.3.2 Nonredundancy

We can be a great deal less concerned about redundancy in data houses and data marts than we would be with operational databases

ware-As discussed earlier, since data is loaded through special ETL programs

or utilities, and not updated in the usual sense, we do not face thesame risk that fields may be updated inconsistently Redundancy does, ofcourse, still cost us in storage space, and data warehouses can be very largeindeed

Particularly in data marts, denormalization is regularly practiced to plify structures, and we may also carry derived data, such as commonlyused totals

Trang 7

sim-We tend not to think of a data warehouse or mart as enforcing business rules

in the usual sense because of the absence of traditional update transactions.Nevertheless, the data structures will determine what sort of data can beloaded, and if the data warehouse or mart implements a rule that is notsupported by a source system, we will have a challenge to address!Sometimes, the need to simplify data leads us to (for example) implement aone-to-many relationship even though a few real world cases are many-to-many Perhaps an insurance policy can occasionally be sold by more thanone salesperson, but we decide to build our data mart around a Policytablewith a Salespersondimension We have specified a tighter rule, and we aregoing to end up trading some “completeness” for the gain in simplicity

16.3.4 Data Reusability

Reusability, in the sense of reusing data captured for operational purposes

to support management queries, is the raison d’être of most data

ware-houses and marts More so than in operational databases, we have toexpect the unexpected as far as queries are concerned Data marts may beconstructed to support a particular set of queries (we can build anothermart if necessary to support a new requirement), but the data warehouseitself needs to be able to feed virtually any conceivable mart that uses the

data that it holds Here is an argument in favor of full normalization in the

data warehouse, and against any measures that irrecoverably lose datasuch

as summarization with removal of the source data

16.3.5 Stability and Flexibility

One of the challenges of data warehouse design is to accommodatechanges in the source data These may reflect real changes in the business

or simply changes (including complete replacement) to the operationaldatabases

Much of the value of a data warehouse may come from the build-up ofhistorical data over a long period We need to build structures that not onlyaccommodate the new data, but also allow us to retain the old

It is a maxim of data warehouse designers that “data warehouse design

is never finished.” If users gain value from the initial implementation, it isalmost inevitable that they will require that the warehouse and marts beextendedoften very substantially Many a warehouse project has delivered

a warehouse that cannot be easily extended, requiring new warehouses to

Trang 8

be constructed as the requirements grow The picture in Figure 16.1becomes much less elegant when we add multiple warehouses in themiddle, possibly sharing common source databases and target data marts.

16.3.6 Simplicity and Elegance

As discussed earlier, data marts often need to be restricted to simple tures that suit a range of query tools and are relatively easy for end-users

struc-to understand

16.3.7 Communication Effectiveness

It is challenging enough to communicate “difficult” data structures to fessional programmers, let alone end-users, who may have only an occa-sional need to use the data marts Data marts that use highly generalizedstructures and unfamiliar terminology, or that are based on a sophisticatedoriginal view of the business, are going to cause problems

The data warehouse needs to be able to accept the uploading of largevolumes of data, usually within a limited “batch window” when operationaldatabases are not required for real-time processing It also needs to supportreasonably rapid extraction of data for the data marts Data loading may usepurpose-designed ETL utilities, which will dictate how data should beorganized to achieve best performance

The architecture shown in Figure 16.1 has evolved from earlier approaches

in which the data warehouse and data marts were combined into a singledatabase

Trang 9

or clearinghouse between different representations of the data, while thedata marts are designed to present simpler views to the end-users.

The basic rule for the data modeler is to respect this separation.

Accordingly, we design the data warehouse much as we would an ational database, but with a recognition that the relative importance of thevarious design objectives/quality criteria (as reviewed in the previous sec-tion) may be different So, for example, we may be more prepared toaccept a denormalized structure, or some data redundancyprovided, ofcourse, there is a corresponding payoff Flexibility is paramount We canexpect to have to accommodate growth in scope, new and changed oper-ational databases, and new data marts

oper-Data marts are a different matter Here we need to fit data into a quiterestrictive structure, and the modeling challenge is to achieve this withoutlosing the ability to support a reasonably wide range of queries We willusually end up making some serious compromises, which may be accept-able for the data mart but would not be so for an operational database ordata warehouse

Many successful data warehouses have been designed by data modelerswho tackled the modeling assignment as if they were designing an opera-tional database We have even seen examples of data warehouses that had

to be completely redesigned according to this traditional approach after ill-advised attempts to apply modeling approaches borrowed from the datamart theory Conversely, there is a strong school of thought that argues thatthe data warehouse model can usefully anticipate some common datamanipulation and summarization

Both arguments have merit, and the path you take should be guided bythe business and technical requirements in each case That is why wedevoted so much space at the beginning of this chapter to differences andgoals; it is a proper appreciation of these rather than the brute application

of some special technique that leads to good warehouse design

We can, however, identify a few general techniques that are specific todata warehouse design

16.5.1 An Initial Model

Data warehouse designers usually find it useful to start with an E-R model

of the total business or, at least, of the part of the business that the datawarehouse may ultimately cover The starting point may be an existing

Trang 10

enterprise data model (see Chapter 17) or a generalization of the data tures in the most important source databases If an enterprise data model

struc-is used, the data modeler will need to check that it aligns reasonably closelywith existing structures rather than representing a radical “future vision.”Data warehouse designers are not granted the latitude of data modelersstarting with a blank slate!

16.5.2 Understanding Existing Data

In theory, we could construct a data warehouse without ever talking to thebusiness users, simply by consolidating data from the operational data-bases Such a warehouse would (again in theory) allow any query possiblewithin the limitations of the source data

In practice, we need user input to help select what data will be relevant

to the data mart users (the extreme alternative would be to load every dataitem from every source system), to contribute to the inevitable decisions oncompromises, and, of course, to “buy in” and support the project

Nevertheless, a good part of data warehouse design involves gaining anunderstanding of data from the source systems and defining structures tohold and consolidate it Usually the most effective approach is to use theinitial model as a starting point and to map the existing structures against

it Initially, we do this at an entity level, but as modeling proceeds in laboration with the users, we add attributes and possibly subtypes

to what is likely to be possible and what alternatives may be available

16.5.4 Determining Sources and Dealing

with Differences

One of the great challenges of data warehouse design is in making the most

of source data in legacy systems If we are lucky, some of the source data

Trang 11

overloaded attributes (see Section 5.3), poor documentation of definitionsand coding schemes, and (almost certainly) inconsistency across databases.Our choice of source for a data itemand, hence, its definition in thedata warehousewill depend on a number of factors:

1 The objective of minimizing the number of source systems feeding thedata warehouse, in the interests of simplicity; reduced need for dataintegration; and reduced development, maintenance, and running costs

2 The “quality” of the data itema complex issue involving primarily theaccuracy of the item instances (i.e., whether they accurately reflect thereal world), but also timeliness (when were they last updated?)andcompatibility with other items (update cycles again) Timing differencescan be a major headache The update cycles of data vary in many organ-izations from real-time to annually Because of this, the “same” data itemmay hold different values in different source databases

3 Whether multiple sources can be reconciled to produce a better overallquality We may even choose to hold two or more versions of the

“same” attribute in the warehouse, to enable a choice of the most priate version as required

appro-4 The compatibility of the coding scheme with other data Incompatiblecoding schemes and data formats are relatively straightforward to han-dleas long as the mapping between them is simple If the underlying

definitions are different, it may be impossible to translate to a commonscheme without losing too much meaning It is easy to translate coun-try codes as long as you can agree what a country is! One police forcerecognizes three eye colors, another four.2

5 Whether overloaded attributes can be or need to be unpacked Forexample, one database may hold name and address as a single field,3

while another may break each down into smaller fieldsinitial, familyname, street number, and so on Programmers often take serious liber-ties with data definitions and many a field has been redefined wellbeyond its original intent Usually, the job of unpacking it into primitiveattributes is reasonably straightforward once the rules are identified

In doing the above, the data warehouse designer may need to performwork that is, more properly, the responsibility of a data management or data

2 For a fascinating discussion of how different societies classify colors and a detailed example

of the challenges that we face in coming up with classification schemes acceptable to all, see

Chapter 2 of Language Universals and Linguistic Typology by Bernard Comrie, Blackwell,

Oxford 1981, ISBN 0-631-12971-5.

3 We use the general term “field” here rather than “column” as many legacy databases are not relational.

Trang 12

administration team Indeed, the problems of building data warehouses inthe absence of good data management groundwork have often led to suchteams being established or revived.

16.5.5 Shaping Data for Data Marts

How much should the data warehouse design anticipate the way that datawill be held in the data marts? On the one hand, the data warehouse should

be as flexible as possible, which means not organizing data in a way thatwill favor one user over another Remember that the data warehouse may

be required not only to feed data marts, but may also be the commonsource of data for other analysis and decision support systems And somedata marts offer broader options for organizing data

On the other hand, if we can be reasonably sure that all users of thedata will first perform some common transformations such as summariza-tion or denormalization, there is an argument for doing them onceas data

is loaded into the warehouse, rather than each time it is extracted Anddenormalized data can usually be renormalized without too much trouble.(Summarization is a different matter: base data cannot be recovered fromsummarized data.) The data warehouse can act as a stepping-stone togreater levels of denormalization and summarization in the marts Whendata volumes are very high, there is frequently a compelling argument forsummarization to save space and processing

Another advantage of shaping data at the warehouse stage is that it motes a level of commonality across data marts For example, a phonecompany might decide not to hold details of all telephone calls but ratheronly those occurring during a set of representative periods each week If thedecision was made at the warehouse stage, we could decide once and forall what the most appropriate periods were All marts would then workwith the same sampling periods, and results from different marts could bemore readily compared

pro-Sometimes the choice of approach will be straightforward In particular,

if the data marts are implemented as views of the warehouse, we will need

to implement structures that can be directly translated into the requiredshape for the marts

The next section discusses data mart structures, and these can, withappropriate discretion, be incorporated into the data warehouse design.Where you are in doubt, however, our advice is to lean toward design-ing the data warehouse for flexibility, independent of the data marts One

of the great lessons of data modeling is that new and unexpected uses will

be found for data, once it is available, and this is particularly true in thecontext of data warehouses Maximum flexibility and minimum anticipationare good starting points!

Trang 13

16.6 Modeling for the Data Mart

16.6.1 The Basic Challenge

In organizing data in a data mart, the basic challenge is to present it in aform that can be understood by general business people A typical opera-tional database design is simply too complex to meet this requirement.Even our best efforts with views cannot always transform the data intosomething that makes immediate sense to nonspecialists Further, the querytools themselves need to make some assumptions about how data is stored

if they are going to be easy to implement and use, and if they are going toproduce reports in predictable formats Data mart users also need to beable to move from one mart to another without too much effort

16.6.2 Multidimensional Databases,

Stars and Snowflakes

Developers of data marts and vendors of data mart software have settled on a common response to the problem of providing a simple data

structure: a star schema specifying a multidimensional database

Multi-dimensional databases can be built using conventional relational DBMSs orspecialized multidimensional DBMSs optimized for such structures

Figure 16.2 shows a star schema The structure is very simple: a fact

table surrounded by a number of dimension tables.

The format is not difficult to understand The fact tables hold (typically)transaction data, either in its raw, atomic form or summarized The dimen-sions effectively classify the data in the fact table into categories, and make

it easy to formulate queries based on categories that aggregate data fromthe fact table: “What percentage of sales were in region 13?” or “What wasthe total value of sales in region 13 to customers in category B?”

With our user hats on, this looks fine Putting our data modelinghats on, we can see some major limitationsat least compared with thedata structures for operational databases that we have been working with

to date

Before we start looking at these “limitations,” it is interesting to observethat multidimensional DBMSs have been around long enough now thatthere are professional designers who have modeled only in that environ-ment They seem to accept the star schema structure as a “given” and donot think of it as a limiting environment to work in It is worth taking a leaffrom their book if you are a “conventional” modeler moving to data martdesign Remember that relational databases themselves are far from com-prehensive in the structures that they supportmany DBMSs do notdirectly support subtypes for exampleyet we manage to get the job done!

Trang 14

16.6.2.1 One Fact Table per Star

While there is usually no problem implementing multiple stars, each withits own fact table (within the same4 or separate data marts), we can haveonly one fact table in each star Figure 16.3 illustrates the key problem thatthis causes

It is likely that we will hold numeric data and want to formulate queries

at both the loan and transaction level Some of the options we mightconsider are the following:

1 Move the data in the Loan table into the Transaction table, whichwould then become the fact table This would mean including all of thedata about the relevant loan in each row of the Transaction table

If there is a lot of data for each loan, and many transactions per loan,the space requirement for the duplicated data could be unacceptable.Such denormalization would also have the effect of making it difficult

to hold loans that did not have any transactions against them Our tion might require that we add “dummy” rows in the Transactiontable,containing only loan data Queries about loans and transactions would

solu-Period

Accounting Month No Quarter No Year No

Product

Product ID Product Type Code Product Name

Sale

Accounting Month No * Product ID * Customer ID * Location ID * Quantity Value

Location

Location ID Location Type Code Region Code State Code Location Name

Customer

Customer ID Customer Type Code Region Code State Code Customer Name

Figure 16.2 A star schema: the fact table is Sale.

4 Multiple stars in the same data mart can usually share dimension tables.

Trang 15

be more complicated than would be the case with a simple loan ortransaction fact table.

2 Nominate the Loantable as the fact table, and hold transaction tion in a summarized form in the Loantable This would mean holdingtotals rather than individual items If the maximum number of transac-tions per loan was relatively small (perhaps more realistically, we might

informa-be dealing with the numinforma-ber of assets securing the loan), we could hold

a repeating group of transaction data in the Loantableas always withsome loss of simplicity in query formulation

3 Implement separate star schemas, one with Loan as a fact table andthe other with Transaction as a fact table We would probably turn

Loaninto a dimension for the Transaction schema, and we might holdsummarized transaction data in the Loantable

16.6.2.2 One Level of Dimension

A true star schema supports only one level of dimension Some data marts

do support multiple levels (usually simple hierarchies) These variants are

generally known as snowflake schemas (Figure 16.4).

Loan

Customer

Period Branch

Transaction Transaction

Type

be issued by

issue

be owned by

own

be issued in

be time of issue of take place

in

include the time of

be against

be the object of classify

be classified by

Loan Type

classify be

classified by

Figure 16.3 Which is the fact table  Loan or Transaction?

Trang 16

To compress what may be a multilevel hierarchy down to one level, wehave to denormalize (specifically from fully normalized back to first normalform) Figure 16.5 provides an example.

While we may not need to be concerned about update anomalies fromdenormalizing, we do need to recognize that space requirements can some-times become surprisingly large if the tables near the top of the hierarchycontain a lot of data We may need to be quite brutal in stripping thesedown to codes and (perhaps) names, so that they function only as cate-gories (In practice, space requirements of dimensions are seldom as much

of a problem as those of fact tables.)Another option is to summarize data from lower-level tables into higher-level tables, or completely ignore one or more levels in the hierarchy(Figure 16.6) This option will only be workable if the users are not inter-ested in some of the (usually low-level) classifications

16.6.2.3 One-to-Many Relationships

The fact table in a star schema is in a many-to-one relationship with thedimensions In the discussion above on collapsing hierarchies, we alsoassumed that there were no many-to-many relationships amongst thedimensions, in which case simple denormalization would not work.What do we do if the real-world relationship is many-to-many, as inFigure 16.7? Here, we have a situation in which, most of the time, sales aremade by only one salesperson, but, on occasion, more than one salesper-son shares the sale

One option is to ignore the less common case and tie the relationshiponly to the “most important” or “first” salesperson Perhaps we can

Product Type Product Type ID Product Type Name

Product

Product ID Product Type ID Product Name

Period Accounting Month No Quarter No

Accounting Month No Product ID Customer ID Location ID Quantity Value

Customer

Customer ID Customer Type ID Region ID Customer Name

Location

Location ID Location Type ID Region ID Location Name

Customer Type Customer Type ID Customer Type Name

Location Type Location Type ID Location Type Name

Region

Region ID State ID Region Name

State ID State Name State

Figure 16.4 A snowflake schema  Sale is the fact table.

Trang 17

Customer ID Customer Type ID Region ID Customer Name

Region

Region ID State ID Region Name

State

State ID State Name

Customer (a) Normalized

Customer ID Customer Type ID Region ID Customer Name Region Name State Name State ID (b) Denormalized

Figure 16.5 Denormalizing to collapse a hierarchy of dimension tables.

Customer Type

Customer

Sale

Customer Type

Sale

be classified

by classify

be to a customer classified by

Trang 18

compensate to some degree by carrying the number of salespersonsinvolved in the Sale table, and even by carrying (say) the percentageinvolvement of the key person For some queries, this compromise may bequite acceptable, but it would be less than satisfactory if a key area of inter-est is sales involving multiple salespersons.

We could modify the Salesperson table to allow it to accommodatemore than one salesperson, through use of a repeating group It is aninelegant solution and breaks down once we want to include (as in the pre-vious section) details from higher-level look up tables Which region’s data

do we includethat of the first, the second, or the third salesperson?Another option is to in effect resolve the many-to-many relationship andtreat the Sale-by-Salespersontable as the fact table (Figure 16.8) We willprobably need to include the rest of the sale data in the table

Product Code Product Description Product

Product Variant

Product Code Product Variant Code Standard Price Total Sales Amount

Sale

Sale ID Product Code Product Variant Code Value

.

Product

Product Code Product Description Average Price Total Sales Amount

Sale

Sale ID Product Code Product Variant Code Value

.

Figure 16.6 (b) Summarizing data from lower-level tables into higher-level tables.

Trang 19

Once again, we have a situation in which there is no single, cal solution We need to talk to the users about how they want to “sliceand dice” the data and work through with them the pros and cons of thedifferent options.

mechani-16.6.3 Modeling Time-Dependent Data

The basic issues related to the modeling of time, in particular the choice of

“snapshots” or history are covered in Chapter 15 and apply equally to datawarehouses, data marts, and operational databases This section covers afew key aspects of particular relevance to data mart design

16.6.3.1 Time Dimension Tables

Most data marts include one or more dimension tables holding time periods

to enable that dimension to be used in analysis (e.g., “What percentage or

sales were made by salespeople in Region X in the last quarter?”) The key

design decisions are the level of granularity (hours, days, months, years)and how to deal with overlapping time periods (financial years may overlapwith calendar years, months may overlap with billing periods, and so on).The finer the granularity (i.e., the shorter the periods), the fewer problems

we have with overlap and the more precise our queries can be However,

Salesperson

Sale Product

be credited to

be credited with be

classified by

classify

Figure 16.7 Many-to-many relationship between dimension and fact tables.

Trang 20

query formulation may be more difficult or time-consuming in terms ofspecifying the particular periods to be covered.

Sometimes, we will need to specify a hierarchy of time periods (as asnowflake or collapsed into a single-level denormalized star) Alternatively,

or in addition, we may specify multiple time dimension tables, possiblycovering overlapping periods

16.6.3.2 Slowly-Changing Dimensions

One of the key concerns of the data mart designer is how quickly the data

in the dimension tables will change, and how quickly fact data may movefrom one dimension to another

Figure 16.9 shows a simple example of the problem in snowflake formfor clarity This might be part of a data mart to support analysis of customerpurchasing patterns over a long period

It should be clear that, if customers can change from one customergroup to another over time and our mart only records the current group,

we will not be able to ask questions such as, “What sort of vehicles did

people buy while they were in group ‘A’?” (We could ask, “What sort of vehicles did people currently in group ‘A’ buy over time?”but this maywell be less useful.)

Sale by Salesperson

Salesperson

be classified

by classify

be credited for

be credited to

be classified by

classify

Figure 16.8 Treating the sale-by-salesperson table as the fact table.

Trang 21

In the operational database, such data will generally be supported bymany-to-many relationships, as described in Chapter 15, and/or matching

of timestamps and time periods There are many ways of reworking thestructure to fit the star schema requirement For example:

1 Probably the neatest solution to the problem as described is to carry two

foreign keys to Customer Groupin the Purchasetable One key points

to the customer group to which the customer belonged at the time ofthe purchase; the other points to the customer group to which the cus-tomer currently belongs In fact, the information supported by the latterforeign key may not be required by the users, in which case we candelete it, giving us a very simple solution

Of course, setting up the mart in this form will require some translation

of data held in more conventional structures in the operational databasesand (probably) the data warehouse

2 If the dimension changes sufficiently slowly in the time frames in which

we are interested, then the amount of error or uncertainty that it causesmay be acceptable We may be able to influence the speed of change

by deliberately selecting or creating dimensions (perhaps at the datawarehouse stage) which change relatively slowly For example, we may

be able to classify customers into broad occupational groups sional,” “manual worker,” “technician”) rather than more specific occu-pations, or even develop lifestyle profiles that have been found to berelatively stable over long periods

(“profes-3 We can hold a history of (say) the last three values of Customer Groupinthe Customertable This approach will also give us some information onhow quickly the dimension changes

Logical data warehouse and data mart design are important subdisciplines

of data modeling, with their own issues and techniques

Customer

Figure 16.9 Slowly changing dimensions.

Trang 22

Data warehouse design is particularly influenced by its role as a stagingpoint between operational databases and data marts Existing data struc-tures in operational databases or (possibly) existing data marts will limit thefreedom of the designer, who will also need to support high volumes ofdata and load transactions Within these constraints, data warehouse designhas much in common with the design of operational databases.

The rules of data mart design are largely a result of the star schemastructurea limited subset of the full E-R structures used for operationaldatabase designand lead to a number of design challenges, approaches,and patterns peculiar to data marts The data mart designer also has to con-tend with the limitations of the data available from the warehouse

Trang 24

Chapter 17

Enterprise Data Models and Data Management

“Always design a thing by considering it in its next larger context—a chair in a room,

a room in a house, a house in an environment, an environment in a city plan.”

– Eliel Saarinen

So far, we have discussed data modeling in the context of database design;

we have assumed that our data models will ultimately be implementedmore or less directly using some DBMS Our interest has been in the datarequirements of individual application systems

However, data models can also play a role in data planning and

manage-ment for an enterprise as a whole An enterprise data model (sometimes called a corporate data model) is a model that covers the whole of, or a

substantial part of, an organization We can use such a model to:

■ Classify or index existing data

■ Provide a target for database and systems planners

■ Provide a context for specifying new databases

■ Support the evaluation and integration of application packages

■ Guide data modelers in the development or implementation of ual databases

individ-■ Specify data formats and definitions to support the exchange of databetween applications and with other organizations

■ Provide input to business planning

■ Specify an organization-wide database (in particular, a data warehouse)

These activities are part of the wider discipline of data management—

the management of data as a shared enterprise resource—that warrants abook in itself.1 In this chapter, we look briefly at data management in

499

1A useful starting point is Guidelines to Implementing Data Resource Management, 4th Edition,

Data Management Association, 2002.

Trang 25

ine how development of an enterprise data model differs from ment of a conventional project-level data model.

develop-But first, a word of warning: far too many enterprise data models haveended up “on the shelf” after considerable expenditure on their develop-ment The most common reason, in our experience, is a lack of a clear idea

of how the model is to be used It is vital that any enterprise data model bedeveloped in the context of a data management or information systems strat-egy, within which its role is clearly understood, rather than as an end in itself

17.2.1 Problems of Data Mismanagement

The rationale for data management is that data is a valuable and expensiveresource that therefore needs to be properly managed Parallels are oftendrawn with physical assets, people, and money, all of which need to bemanaged explicitly if the enterprise is to derive the best value from them

As with the management of other assets, we can best understand the needfor data management by looking at the results of not doing it

Databases have traditionally been implemented on an by-application basis—one database per application system Indeed, data-bases are often seen as being “owned” by their parent applications Theproblem is that some data may be required by more than one application.For example, a bank may implement separate applications to handle per-sonal loans and savings accounts, but both will need to hold data about cus-tomers Without some form of planning and control, we will end up holdingthe same data in both databases And here, the element of choice in datamodeling works against us; we have no guarantee that the modelers work-ing on different systems will have represented the common data in the sameway, particularly if they are software package developers working fordifferent vendors Differences in data models can make data duplicationdifficult to identify, document, and control

application-The effects of duplication and inconsistency across multiple systemsare similar to those that arise from poor data modeling at the individualsystem level

There are the costs of keeping multiple copies of data in step (andrepercussions from data users—including customers, managers, and regu-lators—if we do not) Most of us have had the experience of notifying anorganization of a change of address and later discovering that only some oftheir records have been updated

Pulling data together to meet management information needs is far moredifficult if definitions, coding, and formats vary An airline wants to know

Trang 26

the total cost of running each of its terminals, but the terminals are fied in different ways in different systems—sometimes only by a series ofaccount numbers An insurance company wants a breakdown of profitabil-ity by product, but different divisions have defined “product” in different

identi-ways Problems of this kind constitute the major challenge in data

ware-house development (Chapter 16)

Finally, poor overall data organization can make it difficult to use thedata in new ways as business functions change in response to market andregulatory pressures and internal initiatives Often, it seems easier to imple-ment yet another single-purpose database than to attempt to use inconsis-tent existing databases A lack of central documentation also makes reuse

of data difficult; we may not even know that the data we require is held in

an existing database The net result, of course, is still more databases, and

an exacerbation of the basic problem Alternatively, we may decide that thenew initiative is “too hard” or economically untenable

We have seen banks with fifty or more “Branch” files, retailers withmore than thirty “Stock Item” files, and organizations that are supposedlycustomer-focused with dozens of “Customer” files Often, just determiningthe scope of the problem has been a major exercise Not surprisingly, it isthe data that is most central to an organization (and, therefore, used by thegreatest number of applications) that is most frequently mismanaged

17.2.2 Managing Data as a Shared Resource

Data management aims to address these issues by taking an organization-wideview of data Instead of regarding databases as the sole property of theirparent applications, we treat them as a shared resource This may entail doc-umenting existing databases; encouraging development of new, sharable data-bases in critical areas; building interfaces to keep data in step; establishingstandards for data representation; and setting an overall target for data organ-ization The task of data management may be assigned to a dedicated datamanagement (or “data administration” or “information architecture”) team, or

be included in the responsibilities of a broader “architectures” group

17.2.3 The Evolution of Data Management

The history of data management as a distinct organizational function datesfrom the early 1970s In an influential paper, Nolan2 identified “Data

2Nolan: Managing the Crisis in Data Processing, Harvard Business Review, 5(2), March–April,

1979.

Trang 27

last being “Maturity”) Many medium and large organizations establisheddata management groups, and data management began to emerge as adiscipline in its own right.3

In the early days of data management, some organizations pursued whatseemed to be the ideal solution: development of a single shared database,

or an integrated set of “subject databases” covering all of the enterprise’sdata requirements Even in the days when there were far fewer informationsystems to deal with, the task proved overwhelmingly difficult and expen-sive, and there were few successes Today, most organizations have a sub-stantial base of “legacy” systems and cannot realistically contemplatereplacing them all with new applications built around a common set of datastructures

Recognizing that they could not expect to design and build the prise’s data structures themselves, data managers began to see themselves

enter-as akin to town planners (though the term “architect” henter-as continued to bemore widely used—unfortunately, in our view, as the analogy is mislead-ing) Their role was to define a long-term target (town plan) and to ensurethat individual projects contributed to the realization of that goal

In practice, this meant requiring developers to observe common datastandards and definitions (typically specified by an enterprise-wide datamodel), to reuse existing data where practicable, and to contribute to acommon set of data documentation Like town planners, data managersencountered considerable resistance along the way, as builders assertedtheir preference for operating without outside interference and appealed tohigher authorities for special dispensation for their projects

This approach, too, has not enjoyed a strong record of success, thoughmany organizations have persisted with it A number of factors haveworked against it, in particular the widespread use of packaged software inpreference to in-house development, and greater pressure to deliver results

in the short-to-medium term

In response to such challenges, some data managers have chosen totake a more proactive and focused role, initiating projects to improve datamanagement in specific areas, rather than attempting to solve all of an orga-nization’s data management problems For example, they might address aparticularly costly data quality problem, or establish data standards in anarea in which data matching is causing serious difficulties CustomerRelationship Management (CRM) initiatives fall into this category, though

in many cases they have been initiated and managed outside the datamanagement function

3 The International Data Managers Association (DAMA) at www.dama.org is a worldwide body that supports data management professionals.

Trang 28

More recently we have seen a widespread change in philosophy Ratherthan seek to consolidate individual databases, organizations are looking tokeep data in step through messages passed amongst applications In effect,there is a recognition that applications (and their associated databases) will

be purchased or developed one at a time, with relatively little opportunityfor direct data sharing The proposed solution is to accept the duplication

of data, which inevitably results, but to put in place mechanisms to ensurethat when data is updated in one place, messages (typically in XML format)are dispatched to update copies of the data held by other applications.For some data managers, this approach amounts to a rejection of the datamanagement philosophy For others, it is just another mechanism forachieving similar ends What is clear is that while the technology and archi-tecture may have changed, the basic issues of understanding data meaningand formats within and across applications remain To some extent at least,the problem of data specification moves from the databases to the messageformats

An enterprise data model has been central to all of the traditionalapproaches to data management, and, insofar as the newer approaches alsorequire enterprise-wide data definitions, is likely to continue to remain so

In the following sections, we examine the most important roles that anenterprise data model can play

Most organizations have a substantial investment in existing databases andfiles Often, the documentation of these is of variable quality and heldlocally with the parent applications

The lack of a central, properly-indexed register of data is one of thegreatest impediments to data management If we do not know whatdata we have (and where it is), how can we hope to identify opportunitiesfor its reuse or put in place mechanisms to keep the various copies in step?The problem is particularly apparent to builders of data warehouses(Chapter 16) and reporting and analysis applications which need todraw data from existing operational files and databases Just finding therequired data is often a major challenge Correctly interpreting it in theabsence of adequate documentation can prove an even greater one,and serious business mistakes have been made as a result of incorrectassumptions

Commercial data dictionaries and “repositories” have been around for many

years to hold the necessary metadata (data about data) Some organizations

have built their own with mixed success But data inventories are of ited value without an index of some kind; we need to be able to ask, “What

Ngày đăng: 08/08/2014, 18:22

TỪ KHÓA LIÊN QUAN