1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Modeling Techniques for Data Warehousing phần 7 ppsx

21 224 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 21
Dung lượng 200,04 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

• Relationship between each measure and the source data item or items it is derived from: Although there is no guarantee that source data items are actually correct representations of bu

Trang 1

do have to take it into account in our modeling approach Dimension keys in facttables should be given names that reflect the roles they play for the fact Adimension key called Time is therefore not a very good idea From the examplespresented above, we should provide names for the various time dimensionssuch as Order Date, Shipment Date, and Delivery Date (see Figure 58).

Figure 58 Dimension Keys and Their Roles for Facts i n Dimensional Models

Getting the Measures Right: Measures are elements of prime importance for adimensional model During the initial dimensional modeling phase, candidatemeasures are determined based on the end-user queries and their requirements

in general Candidate measures identified in this way may not be the bestpossible choices We strongly suggest that each and every candidate measure

be submitted to a detailed assessment of its representativity and its usefulnessfor information analysis purposes

It is generally recommended that the measures within the dimensional model berepresentative from a generic business perspective Failing to do so will makemodels nonintuitive and complicated to handle Failing to do so also will makethe dimensional model unstable and difficult to extend beyond a pure localinterest

When investigating the ″quality″ of candidate measures, you should focus on thefollowing main issues:

Meaning of each candidate measure: Expressed in business terms, a clear

and precise statement of what the measure actually represents is a vitalpiece of metadata that must be made available to end users

Granularities of the dimensions of each measure: Although granularities of

dimensions are usually considered at the level of facts, it is important thatmeasures incorporated within a fact are evaluated against the dimensionkeys of that fact Such an evaluation may reveal that a given measure maybetter be incorporated in another fact or that granularities should perhaps bechanged Particular attention should be paid to analyzing the

meaningfulness of the candidate measures versus the time dimension of thefact

Relationship between each measure and the source data item or items it is derived from: Although there is no guarantee that source data items are

actually correct representations of business-related items, it is clear thatmeasures are derived from these source data items and that therefore thisderivation must be identified as clearly and precisely as possible You mayhave to deal with very simple derivations such as when a measure is animport of a particular source data item You may also have to deal withcomplex derivation formulas, involving several source data items, functional

Trang 2

transformations such as sums, averages or even complex statistical

functions, and many more This information is of similar if not more

importance than the definition statement that describes the meaning of themeasure in business terms Unfortunately, this work is seldom done in aprecise way It is a complex task, especially if complex formulas are

involved The work usually is further complicated because of replication andduplication of data items in the source data systems and the lack of a sourcedata business directory and a precise understanding of the data items inthese systems Nevertheless, we strongly advocate that this definition work

be done as precisely as possible and that the information is made available

to end users as part of the metadata

Use of each measure in the data analysis processes: Measures are used by

end users in calculations that are essential for producing ″meaningful″ dataanalysis results Calculations such as these can be simple such as in theseexamples:

1 Display a list of values of a particular measure, for a selection of facts.Other calculations can involve complicated formulas

2 Assuming products shipped to customers are packaged in cases orpackaging units, to calculate the Quantity Shipped of a given product in

an analytical operation that compares these numbers, for products thatcan be packed in different quantities, the formula should in some wayinclude packaging conversion rules and values These calculations mayinvolve a sequence of related calculations

3 To calculate the Net Profit of a Sale, w e m a y first have to calculateseveral kinds of costs and a Net Invoice Price for the Sale before

calculating the Net Profit

For a data warehouse modeler, it is essential to capture the fundamental

calculations that are part of the information analysis process and assess theirimpact on the dimensional model Two fundamental questions must be

investigated each time: Can the calculation be performed? In other words, doesthe model include all of the data items required for the calculation? And, Canthe calculation be performed efficiently? In this case you are assessing primarilyhow easy it is for the analyst to formulate the calculation If feasible, someperformance aspects associated with the calculations may be assessed here too

In practice, it is clear that analyzing each and every calculation involving aparticular measure or a set of measures is impossible to do What we suggest,however, is that the modeler take the time to analyze the key derivation

formulas of the data analysis processes The purpose is to find out whether thecandidate measures are correctly defined and incorporated in the model Thiswork is obviously heavily influenced by the available end-user requirements,knowledge of the business process, and the analytical processing that is

performed

In addition to evaluating the key derivation formulas and how they impact thedimensional model, building a prototype for the dimensional model is a verywelcome aid for this part of the work As with any ″learning″ process, the

prototype may be filled up with a sampling of source data and made available toend users as a ″training set.″

Measures are also heavily involved in the typical OLAP operations: slicing,rollup, drilldown Here too, some assessment of the measures involved in theseoperations may help improve the model The ″quality″ analysis of the measuresfor these cases is somewhat simpler than the above, though In fact, a dominant

Trang 3

question related to all these operations is whether a particular measure isadditive or not, and whether this property is applicable to all of the dimensionkeys of the measure or only to some.

Ralph Kimball defines three types of measures (Ralph Kimball, The DataWarehouse Toolkit):

Additive: Additive measures can be added across any of their dimensions.

They are the most frequently occurring measures Examples of additivemeasures in the CelDial model are: Total Cost and Total Revenue

Semiadditive: Semiadditive measures can be added only across some of

their dimensions An example in the CelDial model is Average Quantity OnHand in the Inventory fact, which is not additive across its time dimension

Nonadditive: Nonadditive measures cannot be added across any of their

dimensions Frequently occurring examples of nonadditive dimensions indimensional models are ratios

Semiadditive and nonadditive measures should be modeled differently to makethem (more) additive: otherwise, the end user must be made aware of therestrictions

Fact Attributes Other Than Dimension Keys and Measures: So far, our facttables have only contained dimension keys and measures In reality, fact tablescan contain other attributes too Because fact tables tend to become very large

in terms of the number of facts they contain, we recommend being very selectivewhen adding attributes Very specifically, all kinds of descriptive attributes andlabels should be avoided within facts In reality though, adding one or moreattributes to a base fact can make querying much more easy without causing toomuch of an impact on the size of the fact and consequently on the size of thefact table itself

Several of the attributes the modeler will want to add to a fact will be derivedattributes For a data warehouse model, adding derived attributes should notreally be a problem, particularly because the data warehouse is a read-onlyenvironment When adding derived attributes to a fact, however, the modelershould understand and assess the impact of adding attributes on the datawarehouse populating subsystem Usually, adding derived attributes anywhere

in the data warehouse model is a trade-off between making querying easier andmore efficient and the populating process more complicated

Three types of fact attributes are particularly interesting to consider They areillustrated with the CelDial model in Figure 59 on page 115

Trang 4

Figure 59 Degenerate Keys, Status Tracking Attributes, and Supportive Attributes i n theCelDial Model

Degenerate keys are equivalent to dimension keys of a fact, with the exception

that there is no other dimension information associated with a degenerate key.Degenerate keys are used in data analysis processes to group facts together: forexample, in the CelDial model, SalesOrder is represented through the Orderdimension key in the Sales fact

Status tracking attributes identify different states in which the fact can be found.

Often, status tracking attributes are status indicators or date/time combinations.Status tracking attributes are used by the information analyst to select or classifyrelevant facts Their appearance in a fact table often is related to the granularity

of the dimensions associated with the fact For example, in the CelDial model,the Sales fact may contain status tracking attributes that indicate whether theSale is ″Received,″ ″In process,″ or ″Shipped.″ This can either be modeledusing a state attribute or three date/time attributes representing when the salewas received, being processed or shipped to the customer

Supportive attributes are added to a fact to make querying more effective.

Supportive attributes are those a modeler has to be particularly careful with,because there is often no limit to what can be considered as being supportive.For example, in the CelDial model, Unit Cost in the Sales fact could beconsidered a supportive attribute Other frequently occurring examples ofsupportive attributes are key references to other parts in the dimensional model.These attributes help reduce complex join operations, which end users shouldotherwise have to formulate

8.4.3 Requirements Validation

During requirements validation, the results of requirements analysis areassessed and validated against the initially captured end-user requirements.Also as part of requirements validation, candidate data sources on which theend-user requirements will have to be mapped are identified and inventoried.Figure 60 on page 116 illustrates the kinds of activities that are part ofrequirements validation

Trang 5

Figure 60 Requirements Validation Process

The main activities that have to be performed as part of requirements validationare:

• Checking of the coherence and completeness of the initial dimensionalmodels and validation against the given end-user requirements The initialmodels are analyzed with the end users As a result, more investigationscould be performed by the requirements analyst and the initial models may

be adapted, in an attempt to fix the requirements as they are expressed inthe models, before passing them to the requirements modeling phase

• Candidate data sources are identified An inventory of required andavailable data sources is established

• The initial dimensional models, possibly completed with informal end-userrequirements, are mapped to the identified data sources This is usually atedious task The source data mapping must investigate the followingmapping issues:

− Which source data items are available and which are not? For those thatare not available, should the source applications be extended, can theyperhaps be found using external data sources, or should end users beinformed about their unavailability and as a consequence, should thecoverage of the dimensional model be reduced?

− Are other interesting data items available in the data sources but havenot been requested? Identifying data items that are available but notrequested may reveal interesting other facets of the information analysisactivities and may therefore have significant impact on the content andstructure of the dimensional model being constructed

− How redundant are the available data sources? Usually, data items arereplicated several times in operational databases Basically, this is theresult of an application-oriented database development approach thatalmost automatically leads to disparate operational data sources inwhich lots of data is redundantly copied Studying redundant sourcesinvolves studying data ownership This study must identify the primecopy of the source data items required for the dimensional model

Trang 6

− Even if the source data items are available, one still has to investigatewhether they can be captured or extracted from the source applicationsand at what cost As part of requirements validation, a high-levelassessment of the feasibility of source data capturing must be done.Feasibility of data capture is very much influenced by the temporalaspects of the dimensional model and by the base granularities of factsand measures in the model.

− To conclude the requirements validation phase, an initial sizing of themodel must be performed If possible at all, the initial sizing should alsoinvestigate volume and performance aspects related to populating thedata warehouse

The results of requirements validation must be used to assess the scope andcomplexity of the data warehouse development project and to (re-)assess thebusiness justification of it Requirements validation must be performed incollaboration with the end users Incompleteness or incorrectness of the initialmodels should be revealed and corrected Requirements validation may involvebuilding a prototype of the dimensional model

As a result of requirements validation, end-user requirements and end-userexpectations should be confirmed or reestablished Also as a result ofrequirements validation, source data reengineering recommendations may beidentified and evaluated At the end of requirements validation, a (new)

″sign-off″ for the data warehouse modeling project should be obtained

8.4.4 Requirements Modeling - CelDial Case Study Example

Requirements modeling consists of several important activities that all areperformed with the intent of producing a detailed conceptual model thatrepresents at best the problem domain of the information analyst Figure 61gives an overview of the major activities that are part of requirements modeling.Obviously, the project itself determines to what extent each of these activitiesshould be performed

Figure 61 Requirements M o d e l i n g Activities

Trang 7

Modeling the dimensions consists of a series of activities that produce detailedmodels for the various candidate dimensions which are part of the initialdimensional model.

A detailed dimension model should incorporate all there is to capture about thestructure of the dimension as well as all of its attributes One approach consists

of producing the dimension models in the form of a flat dimension table Thisapproach results in models called star models or star schemas Anotherapproach produces dimension models in the form of structured ER models Thisapproach is said to produce so-called snowflake models or snowflake schemas.Figure 62 illustrates the star model approach and Figure 63 illustrates thesnowflake approach for the Celdial case study

Figure 62 Star M o d e l for the Sales and Inventory Facts i n the CelDial Case Study

Figure 63 Snowflake M o d e l for the Sales and Inventory Facts i n the CelDial CaseStudy

Trang 8

Dimensions play a particular role in a dimensional model Other than facts,whose primary use is in calculations, the dimensions are used primarily for:

1 Selecting relevant facts

2 Aggregating measures

The base structure of a dimension is the hierarchy Dimension hierarchies areused to aggregate business measures, like Total Revenue of Sales, at a lesserlevel of detail than the base granularity at which the measures are present in thedimensional model In this case, the operation is known as roll-up processing.Roll-up processing is performed against base facts or measures in a

dimensional model

To illustrate roll up: Sales Revenue at the Regional level of CelDial′s SalesOrganization can be derived from the base values of the Revenue measure thatare recorded in the Sales facts, by calculating the total of Sales Revenue foreach of the levels of the hierarchy in the Sales Organization

If measures are rolled up to a lesser level of detail as in the above example, theend user can obviously also perform the inverse operation (drill down), whichconsists of looking at more detailed measures or, to put it differently, exploringthe aggregated measures at lower levels of detail along the dimension

hierarchies Figure 64 illustrates roll-up and drill-down activities performedagainst the Inventory fact in the CelDial case

Figure 64 Roll Up and D r i l l D o w n against the Inventory Fact

For all of the above reasons, dimensions are also called aggregation paths oraggregation hierarchies In real life, where pure hierarchies are not so common,

a modeler very frequently has to deal with dimensions that incorporate severaldifferent parallel aggregation paths, as in the example in Figure 65 on page 120

Trang 9

Figure 65 Sample CelDial Dimension with Parallel Aggregation Paths

One of the essential activities of dimension modeling consists of capturing theaggregation paths along which end users perform roll up and drill down Themodels of the dimensions produced as the result of these activities will further

be extended and changed when other modeling activities are performed, such asmodeling the variancy of slow-varying time dimensions, dealing with constraintswithin the dimensions, and capturing relationships and constraints acrossdimensions These elaborated modeling activities can have an impact on thedimensional model as a whole

Now, let us explore the basics of dimension modeling (notice the subtle textualdifference between dimension modeling and dimensional modeling ),

developing models for some representative nontemporal dimensions for CelDial,

as well as for the time dimension

8.4.4.1 Modeling of Nontemporal Dimensions

Figure 66 illustrates the Sales and Inventory facts in the CelDial case study withtheir associated dimensions: Product, Manufacturing, Customer, Sales

Organization, and Time Let′s explore the representative nontemporaldimensions in the CelDial case study

Figure 66 Inventory and Sales Facts and Their Dimensions i n the CelDial Case Study

Trang 10

Notice that the dimensions in CelDial′s models in Figure 65 are extremelysimple The Manufacturing dimension, for instance, consists of a manufacturingkey, a region, and a plant name This provides support for selecting facts

associated with given manufacturing units or plants and aggregates them atregional level The Product dimension has some more properties, but still it isonly partly representative of reality: because we have captured particular

end-user requirements, we should expect to find only part of what the real modelshould incorporate The simplicity of the model is a consequence of end-userfocused development Even though this approach may lead to an acceptablesolution for the identified end users and the queries they expressed, it usuallyneeds considerably more attention to produce a model that has the potential tobecome acceptable for a broad set of users in the organization Failing toextend the solution model and in particular to make dimension models

representative for a broad scope of interest results in stovepipe solutions whereeach group of end users has its own little data mart with which it is satisfied (for

a while) Such solutions are costly to maintain, do not provide consistencybeyond the narrow view of a particular group of users, and, as a consequence,usually lack integration capabilities Such solutions should be avoided at allcosts

As a consequence, it is recommended that you consider modeling the

dimensions in a broader context We illustrate this next The effects of thisglobal approach to modeling the dimensions will become clear when we

progress through our examples

The Product Dimension: The Product dimension is one of the dominant

dimensions in most dimensional models It incorporates the complete set ofitems an organization makes, buys, and sells It also incorporates all of theimportant properties of the items and the relationships among the items, asthese are used by end users when selecting appropriate facts and measures andexploring and aggregating them against several aggregation paths that theproduct dimension provides

CelDial′s product context is inherently simple Products are manufactured andmodels of the products are stocked in inventories in manufacturing plants,waiting for customers to buy them In addition, end users are interested

primarily in sales analysis and do not seem to attach a lot of importance tobeing able to analyze sales figures at different levels of aggregation Productlevel and Regional level analysis seems to be what they want In this situation,the product dimension built in the data warehouse can easily be represented by

a flat structure, such as the Product dimension table in Figure 66 on page 120

In most cases, however, the product dimension is a rather big component of thewarehouse, potentially comprising several tens of thousands of items Theproduct dimension in a data warehouse usually is derived from the productmaster database, which is in most cases present in the operational inventorymanagement system You also have to consider that users usually show interest

in far more extensive classification levels and classification types and that theyhandle potentially hundreds of properties of the items It then should becomeclear that we should look at a broader context to bring out the real issues

involved in dimension modeling for the Product dimension

Let us therefore have a look at what could happen with the Product dimension, ifCelDial were part of a large sales organization, comprising retail sales (mostlyanonymous sales) as well as corporate sales Figure 67 on page 122 provides

Ngày đăng: 14/08/2014, 06:22

TỪ KHÓA LIÊN QUAN