8.4.1 Requirements Gathering End-user requirements suitable for a data warehouse modeling project can be classified in two major categories see Figure 45 on page 93: process-oriented req
Trang 1Figure 43 Requirements Validation
• Requirements Modeling Validated initial models are further developed into
detailed dimensional models, showing all elements of the model and theirproperties Detailed dimensional models can further be extended and
optimized Many techniques in this area should be thought of as advancedmodeling techniques Not every project requires all of them to be applied
We cover some of the more commonly applied techniques and indicate whatother issues may have to be addressed The major activities that are part ofrequirements modeling are illustrated in Figure 44
Figure 44 Requirements Modeling
When advanced dimensional modeling techniques are used such as the onesindicated in Figure 44, the dimensional model usually tends to becomecomplex and dense This may cause problems for end users To solve this,consider building two-tiered data models, in which the back-end tier
comprises all of the model artifacts and the full structure of the model,
Trang 2whereas the front-end tier (the part of the model with which the end user isdealing directly) is a derivation of the entire model, made simple enough forend users to use in their data analysis activities Two-tier data modeling isnot required as such If end users can fully understand the dimensionalmodel, the additional work of constructing the two tiers of the model shouldnot be done.
• Design, Construction, Validation, and Integration Once requirements are
modeled, possibly in a two-tiered dimensional model, design andconstruction activities are to be performed These will further extend andpossibly even change the models produced in the previous stages of thework, to make the resulting solution implementable in the softwareinfrastructure of the data warehouse environment Also, a functionalvalidation of the proposed solution must be performed, together with the endusers This usually results in end users using the constructed solution for awhile, giving them the opportunity to work with the information that has beenmade available to them in a local solution (perhaps in a data mart) Inaddition, the local solution may then be integrated into a more global datawarehouse architecture, including the model of the data produced
We attach particular importance to clearly separating modeling from design.Good modeling practice focuses on the essence of the problem domain.Modeling addresses the ″what″ question Design addresses the question of
″how″ the model representing reality has to be prepared for implementing it
in a given computing environment
The separation between modeling and design is of significant importance fordata warehouse modeling Unfortunately though, all too often modelingissues are mixed with design issues, and, as a consequence, end users areconfronted with the results of what typically are design techniques Becausemodeling is not always already separated from design, many data
warehouse models have a technical outlook
Neglecting a clear separation between modeling and design also results inmodels that are closely linked with the computing environment in generaland with tools in particular Thus it is difficult to integrate the models withothers and adapt and expand them Keep in mind that a data warehouse anddata warehouse models are very long lasting
Each of the requirements steps in the dimensional modeling process are nowdiscussed in more detail The design, construction, validation, and integrationsteps are discussed within the context of the dimensional modeling
requirements
8.4.1 Requirements Gathering
End-user requirements suitable for a data warehouse modeling project can be
classified in two major categories (see Figure 45 on page 93): process-oriented requirements, which represent the major information processing elements that
end users are performing or would like to perform against the data warehouse
being developed, and information oriented requirements, which represent the
major information categories and data items that end users require for their dataanalysis activities
Typically, requirements can be captured that belong to either or both of thesecategories The types of requirements that will be available and the degree ofprecision with which the requirements will be stated (or can be stated) oftendepend on two factors: the type of information analysis problem being
Trang 3considered for the data warehouse implementation project, and the ability of endusers to express their information needs and the scenarios and strategies theyuse in their information analysis activities.
Figure 45 Categories of (Informal) End-User Requirements
8.4.1.1 Process Oriented Requirements
Several types of process-oriented requirements may be available:
• Business objectives
Business objectives are high-level expressions of information analysis
objectives, expressed in business terms One or more business objectivescan be specified for a given data warehouse implementation project
As an example, in the CelDial case study (see Appendix A, “The CelDialCase Study” on page 163), the business objectives could be stated as:
− ″The data warehouse has to support the analysis of manufacturing costsand sales revenue of products manufactured and sold by CelDial.″
The combined business objectives can be used in the data warehouseimplementation project as indicators of the scope of the project Theycan also be used to identify information subject areas involved in theproject and as a means to identify (usually high-level) measures of thebusiness processes the end user is analyzing In the CelDial example,the apparent information subject areas are Products and Sales Theobjectives indicate that the global measures used in the informationanalysis process are ″manufacturing cost″ and ″sales revenue.″ Noticethat these high-level measures ″hide″ a substantial requirement in terms
of detailed data to calculate them
• Business queries
Business queries represent the queries, hypotheses, and analytical
questions that end users issue and try to resolve in the course of theirinformation analysis activities Just as with business objectives, businessqueries are expressed in business terms You should expect that they are
Trang 4usually not precisely formulated They are certainly not expressed in terms
of SQL
Examples of frequently encountered categories of business queries are:
− Existence checking queries, such as ″Has a given product been sold to aparticular customer?″
− Item comparison queries, such as ″Compare the value of purchases oftwo customers over the last six months,″ or ″Compare the number ofitems sold for a given product category, per store, and per week.″
− Trend analysis queries, such as ″What is the growth in item sales for agiven set of products, over the last 12 months?″
− Queries to analyze ratios, rankings, and clusters, such as ″Rank our bestcustomers in terms of dollar sales over the last year.″
− Statistical analysis queries, such as ″Calculate the average item salesper product category, per sales region.″
For the CelDial case study, several business queries were identified For thesake of this chapter, we selected three of them to use for illustration:
• (Q1) What is the average quantity on hand this month, for each productmodel in each manufacturing plant?
• (Q2) What is the total cost and revenue for each model sold today,summarized by outlet, outlet type, region, and corporate sales levels?
• (Q3) What is the total cost and revenue for each model sold today,summarized by manufacturing plant and region?
For a complete description of the CelDial case study, see Appendix A, “TheCelDial Case Study” on page 163 and the description of the modeling process inChapter 7
• Data analysis scenarios
Data analysis scenarios are a good way of adding substance to the set ofrequirements being captured and analyzed Unfortunately, they are moredifficult to obtain than other processing requirements and thus are notalways available for requirements analysis
Essentially two types of data analysis scenarios are of interest for datawarehouse modeling:
− Query workflow scenarios: These scenarios represent sequences of
business queries that end users perform as part of their informationanalysis activities Query workflow scenarios can significantly helpcreate a better understanding of the information analysis process
− Knowledge inference strategies: These end-user requirements
acknowledge the fact that activities performed by end users of a datawarehouse have expert system characteristics As with query workflowscenarios, these strategies can provide more understanding of theactivities performed by end users The simplest forms of knowledgeinference strategies are those that show how users roll up and drill downalong dimension hierarchies
Whether or not these end-user requirements will be available depends
on the capabilities of end users to express how they get to an answer orfind a solution for their problems as well as on the type of data
warehouse application that is being considered for the modeling project
Trang 58.4.1.2 Information-Oriented Requirements
Information-oriented requirements capture an initial perception of the kinds ofinformation end users use in their information analysis activities There aredifferent categories of information-oriented requirements that may be of interestfor the requirements analysis and data warehouse modeling process:
• Information subject areas
Information subject areas are high-level categories of business information.Information subject areas usually are used to build the high-level enterprisedata model When available, information subject areas indicate the scope ofthe data warehouse project They also contribute to the requirements
analyst′s ability to relate the data warehouse project with other (alreadydeveloped) parts of the data warehouse or to data marts
For the CelDial case study, the information subject areas of interest are:
Products, Sales (including Sales Organization), and Manufacturing (includingInventories) Whether or not the Customers information subject area is present
in the scope of the CelDial case study is debatable Although customer salesare involved, there is no apparent substantial requirement that indicates that theCustomers subject area should also be included in the project In addition, ifretail outlets within the Sales Organization also hold inventories of products theymay sell, then most probably Inventories should become an information subjectarea in its own right rather than be incorporated in Manufacturing Debates such
as these are typical when trying to establish the information subject areas
involved in a data warehouse development project
• High-level data models, ER and/or dimensional models
Several data models may be available and could be used to further specify
or support end-user requirements They can be available as high-levelenterprise data models, ER models, or dimensional models The ER modelsmay be collected by reengineering and integrating source data models.Dimensional models may be the result of previous dimensional data
warehouse modeling projects
Figure 46 on page 96 illustrates the relationships among the various datamodels in the data warehouse modeling process
In user-driven modeling approaches, source data models are used as aids inthe process of fully developing the data warehouse model
Source data models may have to be constructed by using reverse
engineering techniques that develop ER models from existing source
databases Several of these models may first have to be integrated into aglobal model representing the sources in a logically integrated way
Trang 6Figure 46 Data Models i n the Data Warehouse M o d e l i n g Process
8.4.2 Requirements Analysis
Requirements analysis techniques are used to build an initial dimensional modelthat represents the end-user requirements captured previously in an informalway The requirements analysis produces a schematic representation of amodel that information analysts can interpret directly The results ofrequirements analysis will be the primary input for data warehouse modelingonce they have passed the requirements validation phase
The scope of work of requirements analysis can be summarized as follows:
• Determine candidate measures, facts, and dimensions, including thedimension hierarchies
• Determine granularities
• Build the initial dimensional model
• Establish the business directory for the elements in the model
Figure 47 on page 97 summarizes the context in which initial dimensionalmodeling is performed and the kinds of deliverables that are produced
Trang 7Figure 47 Overview of Initial Dimensional Modeling
Figure 48 illustrates a notation technique that can be used to schematicallydocument the initial dimensional model It shows facts (or fact tables, if youprefer) with the measures they represent and the dimension hierarchies oraggregation paths associated with the facts Dimension hierarchies are
represented as arrows showing intermediary aggregation points The
dimensions may include alternate or parallel dimension hierarchies Dimensionhierarchies are given names drawn from the problem domain of the informationanalyst These initial dimensional models also formally state the lowest level ofdetail—the granularity—of each dimension An initial dimensional model consists
of one or more such schemas
Figure 48 Notation Technique for Schematically Documenting Initial Dimensional
Models
Trang 88.4.2.1 Determining Candidate Measures, Dimensions, and Facts
To build an initial dimensional model, the following base elements have to beidentified and arranged in the model:
• Measures
• Dimensions and dimension hierarchies
• FactsSeveral approaches can be used to determine the base elements of adimensional model In reality, analysts combine the use of several of theapproaches to find appropriate candidate elements for the model and integratetheir findings in an initial dimensional model, which then combines severaldifferent views on reality Because the requirements analysis process isnonlinear and knowing that inherent relationships exist between the candidateelements, it does not really matter which approach is used, as long as theprocess is performed with a clear perspective on the business problem domain.The approaches essentially differ in the sequence with which they identify themodeling elements Some of the most common approaches are:
• Determine measures first, then dimensions associated with measures, thenfacts
This approach could be called the query-oriented approach because it is theapproach that flows naturally when the requirements analyst picks up theend-user queries as the first source of inspiration Chapter 7, “The Process
of Data Warehousing” on page 49 and the case study in Appendix A, “TheCelDial Case Study” on page 163 were developed by using this approach
• Determine facts, then dimensions, then measuresThis approach is a business-oriented approach Typically, it tries todetermine first the fundamental elements of the business problem domain(facts and measures) and only then are the details required by the end usersdeveloped in it This chapter shows how this approach can be used tocompensate the strict end-user-oriented view when trying to develop morefundamental and longer lasting models for the problem domain
• Determine dimensions, then measures, then factsThis approach frequently is used when the source data models are beingused as the basis for determining candidate elements for the initialdimensional model We refer to it as the data-source-oriented approach.Notice that facts, dimensions, and measures determined during this stage arecandidate elements only Some of them may later disappear from the model, bereplaced by or merged with others, be split in two or more, or even change their
″nature.″
Candidate Measures: Candidate measures can be recognized by analyzing thebusiness queries Candidate measures essentially correspond to data items thatthe users use in their queries to measure the performance or behavior of abusiness process or a business object
For the CelDial project, the following candidate measures are present in Q1, Q2and Q3:
• Average quantity on hand
• Total Cost
• Total Revenue
Trang 9For a complete list of measures, refer to Chapter 7, “The Process of Data
Warehousing” on page 49 and Appendix A, “The CelDial Case Study” on
page 163
Determining candidate measures requires smart, not mechanical, analysis of thebusiness queries Good candidate measures are numeric and are usually
involved in aggregation calculations, but not every numeric attribute is a
candidate measure Also, candidate measures identified from the availablequeries may have peculiar properties that do not really make them ″good″
measures We investigate some properties of measures later in this chapter andindicate how they may affect the model
Measure Granularities within a Dimensional Model The granularity of a measure
can be defined intuitively as the lowest level of detail used for recording themeasure in the dimensional model For instance, Average Quantity On Handcan be considered to be present in the model per day or per month AverageQuantity On Hand could also be considered at the level of detail of product orperhaps at product category level or packaging unit
Measures are usually associated with several dimensions The granularity of ameasure is determined by the combination of the recording details of all of itsdimensions
Different measures can have identical granularities Because both Total Costand Total Revenue seem to be associated with sales transactions in the CelDialcase, they have identical granularities We show next that measures with
identical granularities are candidates for being part of another element of thedimensional model: the fact
Determining the right granularities of measures in the data warehouse model is
of extreme importance It basically determines the depth at which end users will
be able to perform information analysis using the data warehouse or the datamart For data warehouses, the granularity situation is even more complex.Fine granular recording of data in the data warehouse model supports finedetailed analysis of information in the warehouse, but it also increases thevolume of data that will be recorded in the data warehouse and therefore hasgreat impact on the size of the data warehouse and the performance and
resource consumption of end-user activities As a base guideline, however, weadvocate building initial dimensional models with the finest possible
granularities
Candidate Dimensions: Measures require dimensions for their interpretation.For example, average quantity on hand requires that we know with which
product, inventory location (manufacturing plant), and period of time (which day
or month) the value is associated Average quantity on hand for CelDial
therefore is to be associated with three dimensions: Product, Manufacturing, andTime Likewise, Total Revenue analyzed in Query Q2 requires Sales (shorthandfor Sales Organization), Product, and Time as dimensions, whereas for QueryQ3, the dimensions are Manufacturing, Product, and Time
Dimensions are ″the coordinates″ against which measures have to be
interpreted Analyzing the query context in which candidate measures are
specified results in identifying candidate dimensions for each of the measures,within the given query context Notice that this happens ″per measure″ and ″perquery.″ One of the next steps involves the consolidation of candidate measuresand their dimensions across all queries
Trang 10For CelDial, four candidate dimensions can thus be identified at this time:
Product, Sales Organization, Manufacturing, and Time The associationsbetween candidate measures and dimensions, for each of the business querycontexts of the CelDial case study, are documented in Chapter 7, “The Process
of Data Warehousing” on page 49 and Appendix A, “The CelDial Case Study” onpage 163
A more generic and usually more interesting approach for identifying candidatedimensions consists of investigating the fundamental properties of candidatemeasures, within the context of the business processes and business rulesthemselves In this way, dimensions can be identified in a much morefundamental way Determining candidate dimensions from the context of givenbusiness queries should be used as an aid in determining the fundamentaldimensions of the problem domain
As an example, Sales revenue is inherently linked with Sales transactions, whichmust, within the CelDial business context, be associated with a combination ofProduct, Sales Organization, Manufacturing and Time Because Sales
transaction also involves a customer (for CelDial, this can be either a corporatecustomer or an anonymous customer buying ″off the counter″), we may decide toadd Customer as another dimension associated with the sales revenue measure
Candidate Facts: In principle, measures together with their dimensions make upfacts of a dimensional model
Two facts can be identified in the CelDial case: Sales and Inventory Theobvious interpretation of the fact that is manipulated in Q1 is that of an inventoryrecord, providing the Average Quantity On Hand per product model, at a givenmanufacturing plant (the inventory location) during a period of time (a day or amonth) For this reason, we call it the inventory fact Given values for all threedimensions, for instance, a model, a manufacturing plant, and a time period, theexistence of a corresponding Inventory fact can be established, and, if it exists, itgives us the value of the corresponding Average Quantity On Hand The factmanipulated in Q2 and Q3 is called Sales It incorporates two measures, TotalCost and Total Revenue Both measures are dependent on the same
dimensions
Semantic Properties of Business-Related Facts Facts are core elements of a
dimensional model A representative choice of facts, corresponding to a givenproblem domain, can be an enabler for a profound analysis of the business areathe end user is dealing with, even beyond what is requested and expected (andwhat is consequently expressed in the end-user requirements) A choice ofrepresentative, business-related facts can also support the extension of the use
of the data warehouse model to other end-user problem domains Identifyingcandidate facts through the process of consolidating candidate measures anddimensions is a viable approach but may lead to facts with a ″technical″ nature
We recommend that candidate facts be identified with a clear businessperspective
Facts can indeed represent several fundamental ″things″ related to the business:
• A fact can represent a business transaction or a business event (Example: aSale, representing what was bought, where and when the sale took place,who bought the item, how much was paid for the item sold, possiblediscounts involved in the sale, etc.)