7.5.2 Identifying the Sources Once the validated portion of the model passes on to the design stage, the firststep is to identify the sources of the data that will be used to load the mo
Trang 1Figure 33 Dimensional and ER Views of Product-Related Data
The reason for this difference is the different role the model plays in the datawarehouse To the user, the data must look like the data warehouse model Inthe operational world, a user does not generally use the model to access thedata The operational model is only used as a tool to capture requirements, not
to access data
Data warehouse design also has a different focus from operational design.Design in an operational system is concerned with creating a database that willperform well based on a well-defined set of access paths Data warehousedesign is concerned with creating a process that will retrieve and transformoperational data into useful and timely warehouse data
This is not to imply that there is no concern for performance in a datawarehouse On the contrary, due to the amount of data typically present in a
Trang 2data warehouse, performance is an essential consideration However,performance considerations cannot be handled in a data warehouse in the sameway they are handled in operational systems Access paths have already beenbuilt into the model due to the nature of dimensional modeling The
unpredictable nature of data warehouse queries limits how much further you candesign for performance After implementation, additional tuning may be possiblebased on monitoring usage patterns
One area where design can impact performance is renormalizing, orsnowflaking, dimensions This decision should be made based on how thespecific query tools you choose will access the dimensions Some tools enablethe user to view the contents of a dimension more efficiently if it is snowflakedwhile for other tools the opposite is true As well, the choice to snowflake willalso have a tool-dependent impact on the join techniques used to relate a set ofdimensions to a fact Regardless of the design decision made, the model shouldremain the same From the user perspective, each dimension should have asingle consolidated image
7.5.2 Identifying the Sources
Once the validated portion of the model passes on to the design stage, the firststep is to identify the sources of the data that will be used to load the model.These sources should then be mapped to the target warehouse data model.Mapping should be done for each dimension, dimension attribute, fact, andmeasure For dimensions and facts, only the source entities (for example,relational tables, flat files, IMS DBDs and segments) need be documented Fordimension attributes and measures, along with the source entities, the specificsource attributes (such as columns and fields) must be documented
Conversion and derivation algorithms must also be included in the metadata Atthe dimension attribute and measure level, this includes data type conversion,algorithms for merging and splitting source attributes, calculations that must beperformed, domain conversions, and source selection logic
A domain conversion is the changing of the domain in the source system to anew set of values in the target For example, in the operational system you mayuse codes for gender, such as 1=female and 2=male You may want to convertthis to female and male in the target system Such a conversion should bedocumented in the metadata
In some cases you may choose to load your target attribute from different sourceattributes based on certain conditions Suppose you have a distributed salesorganization and each location has its own customer file However, youraccounts receivable system is centralized If you try to relate customerpayments to sales data, you will likely have to pull some customer data fromdifferent locations based on where the customer does business Sourceselection logic such as this must be included in the metadata
At the fact and dimension level, conversion and derivation metadata includes thelogic for merging and splitting rows of data in the source, the rules for joiningmultiple sources, and the logic followed to determine which of multiple sourceswill be used
Identifying sources can also cause changes to your model This will occur whenyou cannot find a valid source Two possibilities exist First, there simply is no
Trang 3source that comes close to meeting the user′s requirements This should bevery rare, but it is possible If only a portion of the model is affected, removethat component and continue designing the remainder Whatever portion of themodel cannot be sourced must return to the requirements stage to redefine theneed in a manner that can be met.
A more likely scenario is that there will be a source that comes close but is notexactly what the user had in mind In the case study we have a product
description but no model description The model code is available to selectindividual models for analysis, but it is hardly user friendly However, ratherthan not meet the requirement to perform analysis by model, model code will beused If user knowledge of source systems is high, this may occur during themodeling stage, but often it occurs during design
All of the metadata regarding data sources must be documented in the datawarehouse model (see Figure 34 on page 77)
7.5.3 Cleaning the Data
Data cleaning has three basic components: validation of data, dataenhancement, and error handling Validation of data consists of a number ofchecks, including:
• Valid values for an attribute (domain check)
• Attribute valid in context of the rest of the row
• Attribute valid in context of related rows in this or other tables
• Relationship between rows in this and other tables valid (foreign key check)
This is not an exhaustive list It is only meant to highlight the basic concepts ofdata validation
Data enhancement is the process of cleaning valid data to make it moremeaningful The most common example is name and address information.Often we store name and address information for customers in multiplelocations Over time, these tend to become unsynchronized Merging data forthe customer is often difficult because the data we use to match the differentimages of the customer no longer matches Data enhancement resynchronizesthis data
Error handling is a process that determines what to do with less than perfectdata Data may be rejected, stored for repair in a holding area, or passed onwith its imperfections to the data warehouse From a data model perspective,
we only care about the data that is passed on to the data warehouse Themetadata for imperfect data should include statements about the data quality(types of errors) to be expected and the data accuracy (frequency of errors) ofthe data (see Figure 34 on page 77)
7.5.4 Transforming the Data
Data transformation is a critical step in any data warehouse development effort.Two major decisions must be made at this point: how to capture the source data,and a method for assigning keys to the target data Along with these two
decisions, you must generate a plan documenting the steps to get the data fromsource to target From a modeling perspective, this is simply adding moremetadata
Trang 47.5.4.1 Capturing the Source Data
The first step in transformation is capturing the source data Initially, a full copy
of the data is required Once this initial copy has been loaded, a means ofmaintaining it must be devised There are four primary methods of capturingdata:
• Full refresh
• Log capture
• Time-stamped source
• Change transaction files
A full refresh, as the name implies, is simply a full copy of the data to be movedinto the target data warehouse This copy may replace what is in the datawarehouse, add a complete new copy at the new point in time, or be compared
to the target data to produce a record of changes in the target
The other three methods focus on capturing only what has changed in the
source data Log capture extracts relevant changes from the DBMS′s log files
If source data has been time stamped, the extract process can select only datathat has changed since the previous extract was run Some systems will
produce a file of changes that have been made in the source An extract canuse this in the same manner it would use a log file
From a modeling perspective, the method used should be documented in themetadata for the model As well, the schedule of the extract should be
documented at this point Later, in the production environment, actual extractstatistics will be added to this metadata (see Figure 34 on page 77)
7.5.4.2 Generating Keys
Key selection in the data warehouse is a difficult issue It involves a trade-offbetween performance and management
Key selection applies mainly to dimensions The keys chosen for the
dimensions must be the foreign keys of the fact
There are two choices for dimension keys Either an arbitrary key can be
assigned, or identifiers from the operational system can be used An arbitrarykey is usually just a sequential number where the next available number isassigned when a new key is required
To uniquely represent a dimension using identifiers from an operational systemusually requires a composite key A composite key is a key made up of multiplecolumns An arbitrary key is one column and is almost always smaller than anoperationally derived key Therefore arbitrary keys will generally perform joinsfaster
Generation of an arbitrary key is slightly more complex If you get your key fromthe operational system, there is no need to determine the next available key.The exception to this is where history of a dimension is kept In this case, whenyou use identifiers from an operational system, you must add an additional keybecause keys must be unique One option is an arbitrary sequence number.Another is to add begin and end time stamps to the dimension key Both ofthese options also work for an arbitrary key, but it is simpler just to generate anew arbitrary key when an entry in a dimension changes
Trang 5Once the history issue is considered, it certainly seems as if an arbitrary key isthe way to go However, the last factor in key selection is its impact on the facttable When a fact is created, the key from each dimension must be assigned to
it If operationally derived keys, with time stamps for history, are used in thedimensions, there is no additional work when a fact is created The linkagehappens automatically With arbitrary keys, or arbitrary history identifiers, a keymust be assigned to a fact at the time the fact is created
There are two ways to assign keys One is to maintain a translation table ofoperational and data warehouse keys The other is to store the operational keysand, if necessary, time stamps, as attribute data on the dimension
The above discussion also applies to degenerate keys on the fact The onlydifference is that there is no need to join on a degenerate key, thus diminishingthe performance impact of an arbitrary key The issue is more likely to comedown to whether a user may need to know the value of a degenerate key foranalysis purposes or that it is simply recorded to create the desired level ofgranularity
The choice, then, is between better performance of an arbitrary key and easiermaintenance of an operational key The questions of how much better
performance and how much more maintenance must be evaluated in your ownorganization
Regardless of the choice you make, the keys, and the process that generatesthem, must be documented in the metadata (see Figure 34 on page 77) Thisdata is necessary for the technical staff who administer and maintain the datawarehouse If the tools you use do not hide join processing, the user may need
to understand this also However, it is not recommended that a user be required
to have this knowledge
7.5.4.3 Getting from Source to Target
It is often the case that getting from source to target is a multiple step process.Rarely can it be completed in one step Among the many reasons for creating amultiple step process to get from source to target are these:
• Sources to be merged are in different locations
• Not all data can be merged at once as some tables require outer joins
• Sources are stored on multiple incompatible technologies
• Complex summarization and derivation must take place
The point is simply that the process must be documented The metadata for amodel must include not only the steps of the process, but the contents of eachstep, as well as the reasons for it It should look something like this:
1 Step 1 - Get Product ChangesObjective of step
Create a table containing rows where product information haschanged
Inputs to step
Change transaction log for Products and Models, ProductComponent table, Component table, and the Product dimensiontable
Transformations performed
For each change record, read the related product component andcomponent rows For each product model, the cost of each
Trang 6component is multiplied by the number of components used tomanufacture the model The sum of results for all componentsthat make up the model is the cost of that model A key isgenerated for each record consisting of a sequential numberstarting with the next number after the highest used in the productdimension table Write a record to the output table containing thegenerated key, the product and model keys, the current date,product description, model code, unit cost, suggested wholesaleprice, suggested retail price, and eligible for volume discountcode.
Transformations performed
For each change record, check that the product and model exist
in the work table If they do, the component change is alreadyrecorded so ignore the change record If not, read the productand model tables for related information For each productmodel, the cost of each component is multiplied by the number ofcomponents used to manufacture the model The sum of resultsfor all components that make up the model is the cost of thatmodel A key is generated for each record consisting of asequential number starting with the next number after the highestused in the product dimension table Add a record to the worktable containing the generated key, the product and model keys,the current date, product description, model code, unit cost,suggested wholesale price, suggested retail price, and eligible forvolume discount code
Outputs of step
A work table containing additional new rows for the productdimension where there has been a change in the productcomponent table or the component table
3 Step 3 - Update Product Dimension
Trang 77.5.5 Designing Subsidiary Targets
Subsidiary targets are targets derived from the originally designed fact anddimension tables The reason for developing such targets is performance If, forexample, a user frequently runs a query that sums across one dimension andscans the entire fact table, it is likely that a subsidiary target should be createdwith the dimension removed and measures summed to produce a table with lessrows for this query
Creating a subsidiary dimension should only be done if the original dimensionwill not join properly with a subsidiary fact This is likely to be a tool-dependentdecision
Because this is a performance issue, rules should be defined for when asubsidiary target will be considered Consider a maximum allowable time for aquery before an aggregate is deemed necessary You may also create a slidingscale of time it takes to run a query versus the frequency of the query
Metadata for subsidiary targets should be the same as for the original facts anddimensions, with only the aggregates themselves being different However, ifyour suite of tools can hide the subsidiary targets from the user and select themwhen appropriate based on the query, the metadata should be made visible onlyfor technical purposes The metadata should contain the reasons for creatingthe subsidiary target (see Figure 34 on page 77)
Often it is not possible to predict which subsidiary targets will be necessary atthe design stage These targets should not be created unless there is a clearjustification Rather than commit significant resources to them at this time,consider creating them as a result of monitoring efforts in the
post-implementation environment
Trang 8Figure 34 The Complete Metadata Diagram for the Data Warehouse
7.5.6 Validating the Design
During the design stage you will create a test version of the productionenvironment When it comes time to validate the design with the user, hands-ontesting is the best approach Let the user try to answer questions throughmanipulation of the test target Document any areas where the test targetcannot provide the data requested
Aside from testing, review with the user any additions and changes to the modelthat have resulted from the design phase to ensure they are understandable.Similar to the model validation step, pass what works on to the implementationphase What does not work should be returned to the requirements phase forclarification and reentry into modeling
7.5.7 What About Data Mining?
Decisions in data warehouse modeling would typically not be affected by adecision to support data mining However, the discussion on data mining, asone of the key data analysis techniques, is presented here for your informationand completeness
As stated previously, data mining is about creating hypotheses, not testing them
It is important to make this distinction If you are really testing hypotheses, the
Trang 9dimensional model will meet your requirements It cannot, however, safelycreate a hypothesis The reason for this is that by defining the dimensions of thedata and organizing dimensions and measure into facts, you are building thehypotheses based on known rules and relationships Once done, you havecreated a paradigm To create a hypothesis, you must be able to work outsidethe paradigm, searching for patterns hidden in the unknown depths of the data.There are, in general, four steps in the process of making data available formining: data scoping, data selection, data cleaning, and data transformation Insome cases, a fifth step, data summarization, may be necessary.
7.5.7.1 Data Scoping
Even within the scope of your data warehouse project, when mining data youwant to define a data scope, or possibly multiple data scopes Because patternsare based on various forms of statistical analysis, you must define a scope inwhich a statistically significant pattern is likely to emerge For example, buyingpatterns that show different products being purchased together may differ greatly
in different geographical locations To simply lump all of the data together mayhide all of the patterns that exist in each location Of course, by imposing such ascope you are defining some, though not all, of the business rules It is
therefore important that data scoping be done in concert with someoneknowledgeable in both the business and in statistical analysis so that artificialpatterns are not imposed and real patterns are not lost
7.5.7.2 Data Selection
Data selection consists of identifying the source data that will be mined
Generally, the main focus will be on a transaction file Once the transaction file
is selected, related data may be added to your scope The related data willconsist of master files relevant to the transaction In some cases, you will want
to go beyond the directly related data and delve into other operational systems.For example, if you are doing sales analysis, you may want to include store staffscheduling data, to determine whether staffing levels, or even individual staff,create a pattern of sales of particular products, product combinations, or levels
of sales Clearly this data will not be part of your transaction, and it is quitelikely the data is not stored in the same operational system
7.5.7.3 Data Cleaning
Once you have scoped and selected the data to be mined, you must analyze itfor quality When cleaning data that will be mined, use extreme caution Thesimple act of cleaning the data can remove or introduce patterns
The first type of data cleaning (see 7.5.3, “Cleaning the Data” on page 72) isdata validation Validating the contents of a source field or column is veryimportant when preparing data for mining For example, if a gender code hasvalid values of M and F, all other values should be corrected If this is notpossible, you may want to document a margin of error for any patternsgenerated that relate to gender You may also want to determine whether thereare any patterns related to the bad data that can reveal an underlying cause.Documenting relationships is the act of defining the relationships when adding indata such as the sales schedules in our data selection example An algorithmmust be developed to determine what part of the schedule gets recorded with aparticular transaction Although it seems clear that a sales transaction must berelated to the schedule by the date and time of the sale, this may not be enough.What if some salespeople tend to start earlier than their shift and leave a little
Trang 10earlier? As long as it all balances out, it may be easier for staff to leave thescheduling system alone, but your patterns could be distorted by such anunknown Of course, you may not be able to correct the problem with thisexample The point is simply that you must be able to document the relationship
to be able to correctly transform the data for mining purposes
The second type of data cleaning (see 7.5.3, “Cleaning the Data” on page 72),data enhancement, is risky when preparing data for mining It is certainlyimportant to be able to relate all images of a customer However, thedifferences that exist in your data may also expose hidden patterns You shouldproceed with enhancement cautiously
transformation, unless you need to find patterns to indicate the cause of theerrors Such pattern searching should only be necessary, and indeed possible, ifthere is a high degree of error in the source data
7.5.7.5 Data Summarization
There may be cases where you cannot relate the transaction data to other data
at the granularity of the transaction; for example, the data needed to set thescope at the right level is not contained in the original transaction data In suchcases, you may consider summarizing data to allow the relationships to be built.However, be aware that altering your data in this way may remove the detailneeded to produce the very patterns for which you are searching You may want
to consider mining at two levels when this summarization appears to benecessary
7.6 The Dynamic Warehouse Model
In an operational system, shortly after implementation the system stabilizes andthe model becomes static, until the next development initiative But, the datawarehouse is more dynamic, and it is possible for the model to change with noadditional development initiative simply because of usage patterns
Metadata is constantly added to the data warehouse from four sources (seeFigure 35 on page 80) Monitoring of the warehouse provides usage statistics.The transform process adds metadata about what and how much data wasloaded and when it was loaded An archive process will record what data hasbeen removed from the warehouse, when it was removed, and where it isstored A purge process will remove data and update the metadata to reflectwhat remains in the data warehouse