Data Modeling Techniques for Data Warehousing phần 5 ppsx

7.5.2 Identifying the Sources Once the validated portion of the model passes on to the design stage, the firststep is to identify the sources of the data that will be used to load the mo

Trang 1

Figure 33 Dimensional and ER Views of Product-Related Data

The reason for this difference is the different role the model plays in the datawarehouse To the user, the data must look like the data warehouse model Inthe operational world, a user does not generally use the model to access thedata The operational model is only used as a tool to capture requirements, not

to access data

Data warehouse design also has a different focus from operational design.Design in an operational system is concerned with creating a database that willperform well based on a well-defined set of access paths Data warehousedesign is concerned with creating a process that will retrieve and transformoperational data into useful and timely warehouse data

This is not to imply that there is no concern for performance in a datawarehouse On the contrary, due to the amount of data typically present in a

Trang 2

data warehouse, performance is an essential consideration However,performance considerations cannot be handled in a data warehouse in the sameway they are handled in operational systems Access paths have already beenbuilt into the model due to the nature of dimensional modeling The

unpredictable nature of data warehouse queries limits how much further you candesign for performance After implementation, additional tuning may be possiblebased on monitoring usage patterns

One area where design can impact performance is renormalizing, orsnowflaking, dimensions This decision should be made based on how thespecific query tools you choose will access the dimensions Some tools enablethe user to view the contents of a dimension more efficiently if it is snowflakedwhile for other tools the opposite is true As well, the choice to snowflake willalso have a tool-dependent impact on the join techniques used to relate a set ofdimensions to a fact Regardless of the design decision made, the model shouldremain the same From the user perspective, each dimension should have asingle consolidated image

7.5.2 Identifying the Sources

Once the validated portion of the model passes on to the design stage, the firststep is to identify the sources of the data that will be used to load the model.These sources should then be mapped to the target warehouse data model.Mapping should be done for each dimension, dimension attribute, fact, andmeasure For dimensions and facts, only the source entities (for example,relational tables, flat files, IMS DBDs and segments) need be documented Fordimension attributes and measures, along with the source entities, the specificsource attributes (such as columns and fields) must be documented

Conversion and derivation algorithms must also be included in the metadata Atthe dimension attribute and measure level, this includes data type conversion,algorithms for merging and splitting source attributes, calculations that must beperformed, domain conversions, and source selection logic

A domain conversion is the changing of the domain in the source system to anew set of values in the target For example, in the operational system you mayuse codes for gender, such as 1=female and 2=male You may want to convertthis to female and male in the target system Such a conversion should bedocumented in the metadata

In some cases you may choose to load your target attribute from different sourceattributes based on certain conditions Suppose you have a distributed salesorganization and each location has its own customer file However, youraccounts receivable system is centralized If you try to relate customerpayments to sales data, you will likely have to pull some customer data fromdifferent locations based on where the customer does business Sourceselection logic such as this must be included in the metadata

At the fact and dimension level, conversion and derivation metadata includes thelogic for merging and splitting rows of data in the source, the rules for joiningmultiple sources, and the logic followed to determine which of multiple sourceswill be used

Identifying sources can also cause changes to your model This will occur whenyou cannot find a valid source Two possibilities exist First, there simply is no

Trang 3

source that comes close to meeting the user′s requirements This should bevery rare, but it is possible If only a portion of the model is affected, removethat component and continue designing the remainder Whatever portion of themodel cannot be sourced must return to the requirements stage to redefine theneed in a manner that can be met.

A more likely scenario is that there will be a source that comes close but is notexactly what the user had in mind In the case study we have a product

description but no model description The model code is available to selectindividual models for analysis, but it is hardly user friendly However, ratherthan not meet the requirement to perform analysis by model, model code will beused If user knowledge of source systems is high, this may occur during themodeling stage, but often it occurs during design

All of the metadata regarding data sources must be documented in the datawarehouse model (see Figure 34 on page 77)

7.5.3 Cleaning the Data

Data cleaning has three basic components: validation of data, dataenhancement, and error handling Validation of data consists of a number ofchecks, including:

• Valid values for an attribute (domain check)

• Attribute valid in context of the rest of the row

• Attribute valid in context of related rows in this or other tables

• Relationship between rows in this and other tables valid (foreign key check)

This is not an exhaustive list It is only meant to highlight the basic concepts ofdata validation

Data enhancement is the process of cleaning valid data to make it moremeaningful The most common example is name and address information.Often we store name and address information for customers in multiplelocations Over time, these tend to become unsynchronized Merging data forthe customer is often difficult because the data we use to match the differentimages of the customer no longer matches Data enhancement resynchronizesthis data

Error handling is a process that determines what to do with less than perfectdata Data may be rejected, stored for repair in a holding area, or passed onwith its imperfections to the data warehouse From a data model perspective,

we only care about the data that is passed on to the data warehouse Themetadata for imperfect data should include statements about the data quality(types of errors) to be expected and the data accuracy (frequency of errors) ofthe data (see Figure 34 on page 77)

7.5.4 Transforming the Data

Data transformation is a critical step in any data warehouse development effort.Two major decisions must be made at this point: how to capture the source data,and a method for assigning keys to the target data Along with these two

decisions, you must generate a plan documenting the steps to get the data fromsource to target From a modeling perspective, this is simply adding moremetadata

Trang 4

7.5.4.1 Capturing the Source Data

The first step in transformation is capturing the source data Initially, a full copy

of the data is required Once this initial copy has been loaded, a means ofmaintaining it must be devised There are four primary methods of capturingdata:

• Full refresh

• Log capture

• Time-stamped source

• Change transaction files

A full refresh, as the name implies, is simply a full copy of the data to be movedinto the target data warehouse This copy may replace what is in the datawarehouse, add a complete new copy at the new point in time, or be compared

to the target data to produce a record of changes in the target

The other three methods focus on capturing only what has changed in the

source data Log capture extracts relevant changes from the DBMS′s log files

If source data has been time stamped, the extract process can select only datathat has changed since the previous extract was run Some systems will

produce a file of changes that have been made in the source An extract canuse this in the same manner it would use a log file

From a modeling perspective, the method used should be documented in themetadata for the model As well, the schedule of the extract should be

documented at this point Later, in the production environment, actual extractstatistics will be added to this metadata (see Figure 34 on page 77)

7.5.4.2 Generating Keys

Key selection in the data warehouse is a difficult issue It involves a trade-offbetween performance and management

Key selection applies mainly to dimensions The keys chosen for the

dimensions must be the foreign keys of the fact

There are two choices for dimension keys Either an arbitrary key can be

assigned, or identifiers from the operational system can be used An arbitrarykey is usually just a sequential number where the next available number isassigned when a new key is required

To uniquely represent a dimension using identifiers from an operational systemusually requires a composite key A composite key is a key made up of multiplecolumns An arbitrary key is one column and is almost always smaller than anoperationally derived key Therefore arbitrary keys will generally perform joinsfaster

Generation of an arbitrary key is slightly more complex If you get your key fromthe operational system, there is no need to determine the next available key.The exception to this is where history of a dimension is kept In this case, whenyou use identifiers from an operational system, you must add an additional keybecause keys must be unique One option is an arbitrary sequence number.Another is to add begin and end time stamps to the dimension key Both ofthese options also work for an arbitrary key, but it is simpler just to generate anew arbitrary key when an entry in a dimension changes

Trang 5

Once the history issue is considered, it certainly seems as if an arbitrary key isthe way to go However, the last factor in key selection is its impact on the facttable When a fact is created, the key from each dimension must be assigned to

it If operationally derived keys, with time stamps for history, are used in thedimensions, there is no additional work when a fact is created The linkagehappens automatically With arbitrary keys, or arbitrary history identifiers, a keymust be assigned to a fact at the time the fact is created

There are two ways to assign keys One is to maintain a translation table ofoperational and data warehouse keys The other is to store the operational keysand, if necessary, time stamps, as attribute data on the dimension

The above discussion also applies to degenerate keys on the fact The onlydifference is that there is no need to join on a degenerate key, thus diminishingthe performance impact of an arbitrary key The issue is more likely to comedown to whether a user may need to know the value of a degenerate key foranalysis purposes or that it is simply recorded to create the desired level ofgranularity

The choice, then, is between better performance of an arbitrary key and easiermaintenance of an operational key The questions of how much better

performance and how much more maintenance must be evaluated in your ownorganization

Regardless of the choice you make, the keys, and the process that generatesthem, must be documented in the metadata (see Figure 34 on page 77) Thisdata is necessary for the technical staff who administer and maintain the datawarehouse If the tools you use do not hide join processing, the user may need

to understand this also However, it is not recommended that a user be required

to have this knowledge

7.5.4.3 Getting from Source to Target

It is often the case that getting from source to target is a multiple step process.Rarely can it be completed in one step Among the many reasons for creating amultiple step process to get from source to target are these:

• Sources to be merged are in different locations

• Not all data can be merged at once as some tables require outer joins

• Sources are stored on multiple incompatible technologies

• Complex summarization and derivation must take place

The point is simply that the process must be documented The metadata for amodel must include not only the steps of the process, but the contents of eachstep, as well as the reasons for it It should look something like this:

1 Step 1 - Get Product ChangesObjective of step

Create a table containing rows where product information haschanged

Inputs to step

Change transaction log for Products and Models, ProductComponent table, Component table, and the Product dimensiontable

Transformations performed

For each change record, read the related product component andcomponent rows For each product model, the cost of each

Trang 6

component is multiplied by the number of components used tomanufacture the model The sum of results for all componentsthat make up the model is the cost of that model A key isgenerated for each record consisting of a sequential numberstarting with the next number after the highest used in the productdimension table Write a record to the output table containing thegenerated key, the product and model keys, the current date,product description, model code, unit cost, suggested wholesaleprice, suggested retail price, and eligible for volume discountcode.

Transformations performed

For each change record, check that the product and model exist

in the work table If they do, the component change is alreadyrecorded so ignore the change record If not, read the productand model tables for related information For each productmodel, the cost of each component is multiplied by the number ofcomponents used to manufacture the model The sum of resultsfor all components that make up the model is the cost of thatmodel A key is generated for each record consisting of asequential number starting with the next number after the highestused in the product dimension table Add a record to the worktable containing the generated key, the product and model keys,the current date, product description, model code, unit cost,suggested wholesale price, suggested retail price, and eligible forvolume discount code

Outputs of step

A work table containing additional new rows for the productdimension where there has been a change in the productcomponent table or the component table

3 Step 3 - Update Product Dimension

Trang 7

7.5.5 Designing Subsidiary Targets

Subsidiary targets are targets derived from the originally designed fact anddimension tables The reason for developing such targets is performance If, forexample, a user frequently runs a query that sums across one dimension andscans the entire fact table, it is likely that a subsidiary target should be createdwith the dimension removed and measures summed to produce a table with lessrows for this query

Creating a subsidiary dimension should only be done if the original dimensionwill not join properly with a subsidiary fact This is likely to be a tool-dependentdecision

Because this is a performance issue, rules should be defined for when asubsidiary target will be considered Consider a maximum allowable time for aquery before an aggregate is deemed necessary You may also create a slidingscale of time it takes to run a query versus the frequency of the query

Metadata for subsidiary targets should be the same as for the original facts anddimensions, with only the aggregates themselves being different However, ifyour suite of tools can hide the subsidiary targets from the user and select themwhen appropriate based on the query, the metadata should be made visible onlyfor technical purposes The metadata should contain the reasons for creatingthe subsidiary target (see Figure 34 on page 77)

Often it is not possible to predict which subsidiary targets will be necessary atthe design stage These targets should not be created unless there is a clearjustification Rather than commit significant resources to them at this time,consider creating them as a result of monitoring efforts in the

post-implementation environment

Trang 8

Figure 34 The Complete Metadata Diagram for the Data Warehouse

7.5.6 Validating the Design

During the design stage you will create a test version of the productionenvironment When it comes time to validate the design with the user, hands-ontesting is the best approach Let the user try to answer questions throughmanipulation of the test target Document any areas where the test targetcannot provide the data requested

Aside from testing, review with the user any additions and changes to the modelthat have resulted from the design phase to ensure they are understandable.Similar to the model validation step, pass what works on to the implementationphase What does not work should be returned to the requirements phase forclarification and reentry into modeling

7.5.7 What About Data Mining?

Decisions in data warehouse modeling would typically not be affected by adecision to support data mining However, the discussion on data mining, asone of the key data analysis techniques, is presented here for your informationand completeness

As stated previously, data mining is about creating hypotheses, not testing them

It is important to make this distinction If you are really testing hypotheses, the

Trang 9

dimensional model will meet your requirements It cannot, however, safelycreate a hypothesis The reason for this is that by defining the dimensions of thedata and organizing dimensions and measure into facts, you are building thehypotheses based on known rules and relationships Once done, you havecreated a paradigm To create a hypothesis, you must be able to work outsidethe paradigm, searching for patterns hidden in the unknown depths of the data.There are, in general, four steps in the process of making data available formining: data scoping, data selection, data cleaning, and data transformation Insome cases, a fifth step, data summarization, may be necessary.

7.5.7.1 Data Scoping

Even within the scope of your data warehouse project, when mining data youwant to define a data scope, or possibly multiple data scopes Because patternsare based on various forms of statistical analysis, you must define a scope inwhich a statistically significant pattern is likely to emerge For example, buyingpatterns that show different products being purchased together may differ greatly

in different geographical locations To simply lump all of the data together mayhide all of the patterns that exist in each location Of course, by imposing such ascope you are defining some, though not all, of the business rules It is

therefore important that data scoping be done in concert with someoneknowledgeable in both the business and in statistical analysis so that artificialpatterns are not imposed and real patterns are not lost

7.5.7.2 Data Selection

Data selection consists of identifying the source data that will be mined

Generally, the main focus will be on a transaction file Once the transaction file

is selected, related data may be added to your scope The related data willconsist of master files relevant to the transaction In some cases, you will want

to go beyond the directly related data and delve into other operational systems.For example, if you are doing sales analysis, you may want to include store staffscheduling data, to determine whether staffing levels, or even individual staff,create a pattern of sales of particular products, product combinations, or levels

of sales Clearly this data will not be part of your transaction, and it is quitelikely the data is not stored in the same operational system

7.5.7.3 Data Cleaning

Once you have scoped and selected the data to be mined, you must analyze itfor quality When cleaning data that will be mined, use extreme caution Thesimple act of cleaning the data can remove or introduce patterns

The first type of data cleaning (see 7.5.3, “Cleaning the Data” on page 72) isdata validation Validating the contents of a source field or column is veryimportant when preparing data for mining For example, if a gender code hasvalid values of M and F, all other values should be corrected If this is notpossible, you may want to document a margin of error for any patternsgenerated that relate to gender You may also want to determine whether thereare any patterns related to the bad data that can reveal an underlying cause.Documenting relationships is the act of defining the relationships when adding indata such as the sales schedules in our data selection example An algorithmmust be developed to determine what part of the schedule gets recorded with aparticular transaction Although it seems clear that a sales transaction must berelated to the schedule by the date and time of the sale, this may not be enough.What if some salespeople tend to start earlier than their shift and leave a little

Trang 10

earlier? As long as it all balances out, it may be easier for staff to leave thescheduling system alone, but your patterns could be distorted by such anunknown Of course, you may not be able to correct the problem with thisexample The point is simply that you must be able to document the relationship

to be able to correctly transform the data for mining purposes

The second type of data cleaning (see 7.5.3, “Cleaning the Data” on page 72),data enhancement, is risky when preparing data for mining It is certainlyimportant to be able to relate all images of a customer However, thedifferences that exist in your data may also expose hidden patterns You shouldproceed with enhancement cautiously

transformation, unless you need to find patterns to indicate the cause of theerrors Such pattern searching should only be necessary, and indeed possible, ifthere is a high degree of error in the source data

7.5.7.5 Data Summarization

There may be cases where you cannot relate the transaction data to other data

at the granularity of the transaction; for example, the data needed to set thescope at the right level is not contained in the original transaction data In suchcases, you may consider summarizing data to allow the relationships to be built.However, be aware that altering your data in this way may remove the detailneeded to produce the very patterns for which you are searching You may want

to consider mining at two levels when this summarization appears to benecessary

7.6 The Dynamic Warehouse Model

In an operational system, shortly after implementation the system stabilizes andthe model becomes static, until the next development initiative But, the datawarehouse is more dynamic, and it is possible for the model to change with noadditional development initiative simply because of usage patterns

Metadata is constantly added to the data warehouse from four sources (seeFigure 35 on page 80) Monitoring of the warehouse provides usage statistics.The transform process adds metadata about what and how much data wasloaded and when it was loaded An archive process will record what data hasbeen removed from the warehouse, when it was removed, and where it isstored A purge process will remove data and update the metadata to reflectwhat remains in the data warehouse

Định dạng
Số trang	21
Dung lượng	164,8 KB