1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Modeling Techniques for Data Warehousing phần 6 pot

21 276 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 21
Dung lượng 197,61 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

8.4.1 Requirements Gathering End-user requirements suitable for a data warehouse modeling project can be classified in two major categories see Figure 45 on page 93: process-oriented req

Trang 1

Figure 43 Requirements Validation

Requirements Modeling Validated initial models are further developed into

detailed dimensional models, showing all elements of the model and theirproperties Detailed dimensional models can further be extended and

optimized Many techniques in this area should be thought of as advancedmodeling techniques Not every project requires all of them to be applied

We cover some of the more commonly applied techniques and indicate whatother issues may have to be addressed The major activities that are part ofrequirements modeling are illustrated in Figure 44

Figure 44 Requirements Modeling

When advanced dimensional modeling techniques are used such as the onesindicated in Figure 44, the dimensional model usually tends to becomecomplex and dense This may cause problems for end users To solve this,consider building two-tiered data models, in which the back-end tier

comprises all of the model artifacts and the full structure of the model,

Trang 2

whereas the front-end tier (the part of the model with which the end user isdealing directly) is a derivation of the entire model, made simple enough forend users to use in their data analysis activities Two-tier data modeling isnot required as such If end users can fully understand the dimensionalmodel, the additional work of constructing the two tiers of the model shouldnot be done.

Design, Construction, Validation, and Integration Once requirements are

modeled, possibly in a two-tiered dimensional model, design andconstruction activities are to be performed These will further extend andpossibly even change the models produced in the previous stages of thework, to make the resulting solution implementable in the softwareinfrastructure of the data warehouse environment Also, a functionalvalidation of the proposed solution must be performed, together with the endusers This usually results in end users using the constructed solution for awhile, giving them the opportunity to work with the information that has beenmade available to them in a local solution (perhaps in a data mart) Inaddition, the local solution may then be integrated into a more global datawarehouse architecture, including the model of the data produced

We attach particular importance to clearly separating modeling from design.Good modeling practice focuses on the essence of the problem domain.Modeling addresses the ″what″ question Design addresses the question of

″how″ the model representing reality has to be prepared for implementing it

in a given computing environment

The separation between modeling and design is of significant importance fordata warehouse modeling Unfortunately though, all too often modelingissues are mixed with design issues, and, as a consequence, end users areconfronted with the results of what typically are design techniques Becausemodeling is not always already separated from design, many data

warehouse models have a technical outlook

Neglecting a clear separation between modeling and design also results inmodels that are closely linked with the computing environment in generaland with tools in particular Thus it is difficult to integrate the models withothers and adapt and expand them Keep in mind that a data warehouse anddata warehouse models are very long lasting

Each of the requirements steps in the dimensional modeling process are nowdiscussed in more detail The design, construction, validation, and integrationsteps are discussed within the context of the dimensional modeling

requirements

8.4.1 Requirements Gathering

End-user requirements suitable for a data warehouse modeling project can be

classified in two major categories (see Figure 45 on page 93): process-oriented requirements, which represent the major information processing elements that

end users are performing or would like to perform against the data warehouse

being developed, and information oriented requirements, which represent the

major information categories and data items that end users require for their dataanalysis activities

Typically, requirements can be captured that belong to either or both of thesecategories The types of requirements that will be available and the degree ofprecision with which the requirements will be stated (or can be stated) oftendepend on two factors: the type of information analysis problem being

Trang 3

considered for the data warehouse implementation project, and the ability of endusers to express their information needs and the scenarios and strategies theyuse in their information analysis activities.

Figure 45 Categories of (Informal) End-User Requirements

8.4.1.1 Process Oriented Requirements

Several types of process-oriented requirements may be available:

Business objectives

Business objectives are high-level expressions of information analysis

objectives, expressed in business terms One or more business objectivescan be specified for a given data warehouse implementation project

As an example, in the CelDial case study (see Appendix A, “The CelDialCase Study” on page 163), the business objectives could be stated as:

− ″The data warehouse has to support the analysis of manufacturing costsand sales revenue of products manufactured and sold by CelDial.″

The combined business objectives can be used in the data warehouseimplementation project as indicators of the scope of the project Theycan also be used to identify information subject areas involved in theproject and as a means to identify (usually high-level) measures of thebusiness processes the end user is analyzing In the CelDial example,the apparent information subject areas are Products and Sales Theobjectives indicate that the global measures used in the informationanalysis process are ″manufacturing cost″ and ″sales revenue.″ Noticethat these high-level measures ″hide″ a substantial requirement in terms

of detailed data to calculate them

Business queries

Business queries represent the queries, hypotheses, and analytical

questions that end users issue and try to resolve in the course of theirinformation analysis activities Just as with business objectives, businessqueries are expressed in business terms You should expect that they are

Trang 4

usually not precisely formulated They are certainly not expressed in terms

of SQL

Examples of frequently encountered categories of business queries are:

− Existence checking queries, such as ″Has a given product been sold to aparticular customer?″

− Item comparison queries, such as ″Compare the value of purchases oftwo customers over the last six months,″ or ″Compare the number ofitems sold for a given product category, per store, and per week.″

− Trend analysis queries, such as ″What is the growth in item sales for agiven set of products, over the last 12 months?″

− Queries to analyze ratios, rankings, and clusters, such as ″Rank our bestcustomers in terms of dollar sales over the last year.″

− Statistical analysis queries, such as ″Calculate the average item salesper product category, per sales region.″

For the CelDial case study, several business queries were identified For thesake of this chapter, we selected three of them to use for illustration:

• (Q1) What is the average quantity on hand this month, for each productmodel in each manufacturing plant?

• (Q2) What is the total cost and revenue for each model sold today,summarized by outlet, outlet type, region, and corporate sales levels?

• (Q3) What is the total cost and revenue for each model sold today,summarized by manufacturing plant and region?

For a complete description of the CelDial case study, see Appendix A, “TheCelDial Case Study” on page 163 and the description of the modeling process inChapter 7

Data analysis scenarios

Data analysis scenarios are a good way of adding substance to the set ofrequirements being captured and analyzed Unfortunately, they are moredifficult to obtain than other processing requirements and thus are notalways available for requirements analysis

Essentially two types of data analysis scenarios are of interest for datawarehouse modeling:

Query workflow scenarios: These scenarios represent sequences of

business queries that end users perform as part of their informationanalysis activities Query workflow scenarios can significantly helpcreate a better understanding of the information analysis process

Knowledge inference strategies: These end-user requirements

acknowledge the fact that activities performed by end users of a datawarehouse have expert system characteristics As with query workflowscenarios, these strategies can provide more understanding of theactivities performed by end users The simplest forms of knowledgeinference strategies are those that show how users roll up and drill downalong dimension hierarchies

Whether or not these end-user requirements will be available depends

on the capabilities of end users to express how they get to an answer orfind a solution for their problems as well as on the type of data

warehouse application that is being considered for the modeling project

Trang 5

8.4.1.2 Information-Oriented Requirements

Information-oriented requirements capture an initial perception of the kinds ofinformation end users use in their information analysis activities There aredifferent categories of information-oriented requirements that may be of interestfor the requirements analysis and data warehouse modeling process:

Information subject areas

Information subject areas are high-level categories of business information.Information subject areas usually are used to build the high-level enterprisedata model When available, information subject areas indicate the scope ofthe data warehouse project They also contribute to the requirements

analyst′s ability to relate the data warehouse project with other (alreadydeveloped) parts of the data warehouse or to data marts

For the CelDial case study, the information subject areas of interest are:

Products, Sales (including Sales Organization), and Manufacturing (includingInventories) Whether or not the Customers information subject area is present

in the scope of the CelDial case study is debatable Although customer salesare involved, there is no apparent substantial requirement that indicates that theCustomers subject area should also be included in the project In addition, ifretail outlets within the Sales Organization also hold inventories of products theymay sell, then most probably Inventories should become an information subjectarea in its own right rather than be incorporated in Manufacturing Debates such

as these are typical when trying to establish the information subject areas

involved in a data warehouse development project

High-level data models, ER and/or dimensional models

Several data models may be available and could be used to further specify

or support end-user requirements They can be available as high-levelenterprise data models, ER models, or dimensional models The ER modelsmay be collected by reengineering and integrating source data models.Dimensional models may be the result of previous dimensional data

warehouse modeling projects

Figure 46 on page 96 illustrates the relationships among the various datamodels in the data warehouse modeling process

In user-driven modeling approaches, source data models are used as aids inthe process of fully developing the data warehouse model

Source data models may have to be constructed by using reverse

engineering techniques that develop ER models from existing source

databases Several of these models may first have to be integrated into aglobal model representing the sources in a logically integrated way

Trang 6

Figure 46 Data Models i n the Data Warehouse M o d e l i n g Process

8.4.2 Requirements Analysis

Requirements analysis techniques are used to build an initial dimensional modelthat represents the end-user requirements captured previously in an informalway The requirements analysis produces a schematic representation of amodel that information analysts can interpret directly The results ofrequirements analysis will be the primary input for data warehouse modelingonce they have passed the requirements validation phase

The scope of work of requirements analysis can be summarized as follows:

• Determine candidate measures, facts, and dimensions, including thedimension hierarchies

• Determine granularities

• Build the initial dimensional model

• Establish the business directory for the elements in the model

Figure 47 on page 97 summarizes the context in which initial dimensionalmodeling is performed and the kinds of deliverables that are produced

Trang 7

Figure 47 Overview of Initial Dimensional Modeling

Figure 48 illustrates a notation technique that can be used to schematicallydocument the initial dimensional model It shows facts (or fact tables, if youprefer) with the measures they represent and the dimension hierarchies oraggregation paths associated with the facts Dimension hierarchies are

represented as arrows showing intermediary aggregation points The

dimensions may include alternate or parallel dimension hierarchies Dimensionhierarchies are given names drawn from the problem domain of the informationanalyst These initial dimensional models also formally state the lowest level ofdetail—the granularity—of each dimension An initial dimensional model consists

of one or more such schemas

Figure 48 Notation Technique for Schematically Documenting Initial Dimensional

Models

Trang 8

8.4.2.1 Determining Candidate Measures, Dimensions, and Facts

To build an initial dimensional model, the following base elements have to beidentified and arranged in the model:

• Measures

• Dimensions and dimension hierarchies

• FactsSeveral approaches can be used to determine the base elements of adimensional model In reality, analysts combine the use of several of theapproaches to find appropriate candidate elements for the model and integratetheir findings in an initial dimensional model, which then combines severaldifferent views on reality Because the requirements analysis process isnonlinear and knowing that inherent relationships exist between the candidateelements, it does not really matter which approach is used, as long as theprocess is performed with a clear perspective on the business problem domain.The approaches essentially differ in the sequence with which they identify themodeling elements Some of the most common approaches are:

• Determine measures first, then dimensions associated with measures, thenfacts

This approach could be called the query-oriented approach because it is theapproach that flows naturally when the requirements analyst picks up theend-user queries as the first source of inspiration Chapter 7, “The Process

of Data Warehousing” on page 49 and the case study in Appendix A, “TheCelDial Case Study” on page 163 were developed by using this approach

• Determine facts, then dimensions, then measuresThis approach is a business-oriented approach Typically, it tries todetermine first the fundamental elements of the business problem domain(facts and measures) and only then are the details required by the end usersdeveloped in it This chapter shows how this approach can be used tocompensate the strict end-user-oriented view when trying to develop morefundamental and longer lasting models for the problem domain

• Determine dimensions, then measures, then factsThis approach frequently is used when the source data models are beingused as the basis for determining candidate elements for the initialdimensional model We refer to it as the data-source-oriented approach.Notice that facts, dimensions, and measures determined during this stage arecandidate elements only Some of them may later disappear from the model, bereplaced by or merged with others, be split in two or more, or even change their

″nature.″

Candidate Measures: Candidate measures can be recognized by analyzing thebusiness queries Candidate measures essentially correspond to data items thatthe users use in their queries to measure the performance or behavior of abusiness process or a business object

For the CelDial project, the following candidate measures are present in Q1, Q2and Q3:

• Average quantity on hand

• Total Cost

• Total Revenue

Trang 9

For a complete list of measures, refer to Chapter 7, “The Process of Data

Warehousing” on page 49 and Appendix A, “The CelDial Case Study” on

page 163

Determining candidate measures requires smart, not mechanical, analysis of thebusiness queries Good candidate measures are numeric and are usually

involved in aggregation calculations, but not every numeric attribute is a

candidate measure Also, candidate measures identified from the availablequeries may have peculiar properties that do not really make them ″good″

measures We investigate some properties of measures later in this chapter andindicate how they may affect the model

Measure Granularities within a Dimensional Model The granularity of a measure

can be defined intuitively as the lowest level of detail used for recording themeasure in the dimensional model For instance, Average Quantity On Handcan be considered to be present in the model per day or per month AverageQuantity On Hand could also be considered at the level of detail of product orperhaps at product category level or packaging unit

Measures are usually associated with several dimensions The granularity of ameasure is determined by the combination of the recording details of all of itsdimensions

Different measures can have identical granularities Because both Total Costand Total Revenue seem to be associated with sales transactions in the CelDialcase, they have identical granularities We show next that measures with

identical granularities are candidates for being part of another element of thedimensional model: the fact

Determining the right granularities of measures in the data warehouse model is

of extreme importance It basically determines the depth at which end users will

be able to perform information analysis using the data warehouse or the datamart For data warehouses, the granularity situation is even more complex.Fine granular recording of data in the data warehouse model supports finedetailed analysis of information in the warehouse, but it also increases thevolume of data that will be recorded in the data warehouse and therefore hasgreat impact on the size of the data warehouse and the performance and

resource consumption of end-user activities As a base guideline, however, weadvocate building initial dimensional models with the finest possible

granularities

Candidate Dimensions: Measures require dimensions for their interpretation.For example, average quantity on hand requires that we know with which

product, inventory location (manufacturing plant), and period of time (which day

or month) the value is associated Average quantity on hand for CelDial

therefore is to be associated with three dimensions: Product, Manufacturing, andTime Likewise, Total Revenue analyzed in Query Q2 requires Sales (shorthandfor Sales Organization), Product, and Time as dimensions, whereas for QueryQ3, the dimensions are Manufacturing, Product, and Time

Dimensions are ″the coordinates″ against which measures have to be

interpreted Analyzing the query context in which candidate measures are

specified results in identifying candidate dimensions for each of the measures,within the given query context Notice that this happens ″per measure″ and ″perquery.″ One of the next steps involves the consolidation of candidate measuresand their dimensions across all queries

Trang 10

For CelDial, four candidate dimensions can thus be identified at this time:

Product, Sales Organization, Manufacturing, and Time The associationsbetween candidate measures and dimensions, for each of the business querycontexts of the CelDial case study, are documented in Chapter 7, “The Process

of Data Warehousing” on page 49 and Appendix A, “The CelDial Case Study” onpage 163

A more generic and usually more interesting approach for identifying candidatedimensions consists of investigating the fundamental properties of candidatemeasures, within the context of the business processes and business rulesthemselves In this way, dimensions can be identified in a much morefundamental way Determining candidate dimensions from the context of givenbusiness queries should be used as an aid in determining the fundamentaldimensions of the problem domain

As an example, Sales revenue is inherently linked with Sales transactions, whichmust, within the CelDial business context, be associated with a combination ofProduct, Sales Organization, Manufacturing and Time Because Sales

transaction also involves a customer (for CelDial, this can be either a corporatecustomer or an anonymous customer buying ″off the counter″), we may decide toadd Customer as another dimension associated with the sales revenue measure

Candidate Facts: In principle, measures together with their dimensions make upfacts of a dimensional model

Two facts can be identified in the CelDial case: Sales and Inventory Theobvious interpretation of the fact that is manipulated in Q1 is that of an inventoryrecord, providing the Average Quantity On Hand per product model, at a givenmanufacturing plant (the inventory location) during a period of time (a day or amonth) For this reason, we call it the inventory fact Given values for all threedimensions, for instance, a model, a manufacturing plant, and a time period, theexistence of a corresponding Inventory fact can be established, and, if it exists, itgives us the value of the corresponding Average Quantity On Hand The factmanipulated in Q2 and Q3 is called Sales It incorporates two measures, TotalCost and Total Revenue Both measures are dependent on the same

dimensions

Semantic Properties of Business-Related Facts Facts are core elements of a

dimensional model A representative choice of facts, corresponding to a givenproblem domain, can be an enabler for a profound analysis of the business areathe end user is dealing with, even beyond what is requested and expected (andwhat is consequently expressed in the end-user requirements) A choice ofrepresentative, business-related facts can also support the extension of the use

of the data warehouse model to other end-user problem domains Identifyingcandidate facts through the process of consolidating candidate measures anddimensions is a viable approach but may lead to facts with a ″technical″ nature

We recommend that candidate facts be identified with a clear businessperspective

Facts can indeed represent several fundamental ″things″ related to the business:

• A fact can represent a business transaction or a business event (Example: aSale, representing what was bought, where and when the sale took place,who bought the item, how much was paid for the item sold, possiblediscounts involved in the sale, etc.)

Ngày đăng: 14/08/2014, 06:22

TỪ KHÓA LIÊN QUAN