The star schema derives its name from its structure: a central fact table and a number of dimension tables clustered around it like the points of a star.. FIGURE 70-1Simple star schema d
Trang 1These concepts are also in direct support of the Information Architecture Principle described in
Chapter 2, making data ‘‘readily available in a usable format for daily operations and analysis .’’
This chapter describes key concepts and practices behind the BI, enabling the availability and usability
that maximizes your data’s value
Data Warehousing
Data warehousing is a key concept behind both the structure and the discipline of a BI solution While
increasingly powerful tools provided in the latest release of SQL Server ease some of the requirements
around warehousing, it is still helpful to understand these concepts in considering your design
Star schema
The data warehouse is the industry-standard approach to structuring a relational OLAP data store It
begins with the idea of dimensions and measures, whereby a dimension is a categorization or ‘‘group
by’’ in the data, and the measure is the value being summarized For example, in ‘‘net sales by quarter
and division,’’ the measure is ‘‘net sales,’’ and the dimensions are ‘‘quarter’’ (time) and ‘‘division’’
(organization)
Deciding which dimensions and measures to include in the warehouse should be based on the needs
of the business, bringing together an understanding of the types of questions that will be asked and
the semantics of the data being warehoused Interviews and details about existing reports and metrics
can help gain a first approximation, but for most organizations a pilot project is needed to fully define
requirements
Best Practice
Organizations that are not familiar with what BI solutions can deliver have a difficult time understanding
the power they represent Kick-start your effort by developing a simple prototype that demonstrates
some of what is possible using data that everyone understands Choose a small but relevant subset of data
and implement a few dimensions and one or two measures to keep implementation time to a minimum
Business needs can then provide a basis for building the star schema that is the building block of the
warehouse (see Figure 70-1)
The star schema derives its name from its structure: a central fact table and a number of dimension
tables clustered around it like the points of a star Each dimension is connected back to the fact table by
a foreign key relationship
Trang 2FIGURE 70-1
Simple star schema
dimDrug
factAdim
dimTherapy dimStage
The fact table consists of two types of columns: the keys that relate to each dimension in the star and
the facts (or measures) of interest
Each dimension table consists of a primary key by which it relates back to the fact table, and one or
more attributes that categorize data for that dimension For example, a customer dimension may include
attributes for name, e-mail address, and zip code In general, the dimension represents a denormalization
of the data in the OLTP system For example, the AdventureWorksDW customer dimension is derived
from theAdventureWorkstablesSales.IndividualandPerson.Contact, and fields parsed from
the XML column describing demographics, among others
Snowflake schema
Occasionally, it makes sense to limit denormalization by making one dimension table refer to another,
thus changing the star schema into a snowflake schema For example, Figure 70-2 shows how the
product dimension has been snowflaked inAdventureWorks’ Internet Sales schema Product category
and subcategory information could have been included directly in the productDimProducttable,
but instead separate tables have been included to describe the categorizations Snowflakes are useful
for complex dimensions for which consistency issues might otherwise arise, such as the assignment of
subcategories to categories in Figure 70-2, or for large dimensions where storage size is a concern
Trang 3FIGURE 70-2
Snowflake dimension
DimProductSubcategory (d
ProductSubcategoryKey
ProductSubcategoryAlterna
EnglishProductSubcategory
SpanishProductSubcategory
FrenchProductSubcategory
ProductCategoryKey
DimProductCategory (dbo.D
ProductCategoryKey
ProductCategoryAlternat
EnglishProductCategoryN
SpanishProductCategory
FrenchProductCategoryN
DimProduct (dbo.DimProd
ProductKey ProductAlternateKey ProductSubcategoryKey WeightUnitMeasureCode SizeUnitMeasureCode EnglishProductName SpanishProductName FrenchProductName StandardCost FinishedGoodsFlag Color
FactInternetSales (dbo.Fa
SalesOrderNumber SalesOrderLineNumber ProductKey OrderDateKey DueDateKey ShipDateKey CustomerKey PromotionKey CurrencyKey SalesTerritoryKey RevisionNumber OrderQuantity UnitPrice ExtendedAmount UnitPriceDiscountPct DiscountAmount ProductStandardCost TotalProductCost SalesAmount TaxAmt
Traditionally, snowflake schema have been discouraged because they add complexity and can slow SQL operations, but recent versions of SQL Server eliminate the majority
of these issues If a dimension can be made more consistent using a snowflake structure, then do so
unless: (1) the procedure required to publish data into the snowflake is too complex or slow to be
sustainable or (2) the schema being designed will be used for extensive SQL queries that will be slowed
and complicated by the snowflake design.
Surrogate keys
Foreign key relationships are the glue that holds the star schema together, but avoid using OLTP keys
to relate fact and dimension tables, even though it is often very convenient to do so Consider a star
schema containing financial data when the accounting package is upgraded, changing all the customer
IDs in the process If the warehouse relation is based on an identity column, changes in the OLTP data
can be accomplished by changes to the relatively small amount of data in the customer dimension If the
OLTP key were used, then the entire fact table would need to be converted
Instead, create local surrogate keys, usually an identity column on dimension tables, to relate to the fact
table This helps avoid ID conflicts and adds robustness and flexibility in a number of scenarios
Consistency
Defining the star schema enables fast OLAP queries against the warehouse, but gaining needed
consis-tency requires following some warehousing rules:
■ When loading data into the warehouse, null and invalid values should be replaced with their reportable equivalents This enables the data’s semantics to be researched carefully once and
Trang 4then used by a wide audience, leading to a consistent interpretation throughout the
organiza-tion Often, this involves manually adding rows to dimension tables to allow for all the cases
that can arise in the data (e.g., Unknown, Internal, N/A, etc.)
■ Rows in the fact table should never be deleted, and no other operations should be performed
that will lead to inconsistent query results from one day to the next Often this leads to
delaying import of in-progress transactions If a summary of in-progress transactions is
required, keep them in a separate fact table of ‘‘provisional’’ data and ensure that the business
user is presented with distinct summaries of data that will change Nothing erodes confidence
in a system like inconsistent results
A key design consideration is the eventual size of the fact table Fact tables often grow such that the
only practical operation is the insertion of new rows — large-scale delete and update operations become
impractical In fact, database size estimates can ignore the size of dimension tables and use just the size
of the fact tables
Best Practice
Keeping large amounts of history in the warehouse doesn’t imply keeping data forever Plan to archive
from the beginning by using partitioned fact tables that complement the partitioning strategy used in
Analysis Services For example, a large fact table might be broken into monthly partitions, maintaining two
years of history
Loading data
Given the architecture of the star/snowflake schema, adding new data begins at the points and moves
inward, adding rows to the fact table last in order to satisfy the foreign key constraints Usually,
warehouse tables are loaded using Integration Services, but examples are shown here as SQL inserts for
illustration
Loading dimensions
The approach to loading varies with the nature of the source data In the fortunate case where the
source fact data is related by foreign key to the table containing dimension data, only the dimension
data needs to be scanned This example uses the natural primary key in the source data, product code,
to identify rows in theProductsstaging table that have not yet been added to the dimension table:
INSERT INTO Warehouse.dbo.dimProduct (ProductCode, ProductName)
SELECT stage.Code, stage.Name FROM Staging.dbo.Products stage
LEFT OUTER JOIN Warehouse.dbo.dimProduct dim
ON stage.Code=dim.ProductCode
WHERE dim.ProductCode is NULL;
Often, the source dimension data will not be related by foreign key to the source fact data, so loading
the dimension table requires a full scan of the fact data in order to ensure consistency
Trang 5This next example scans the fact data, picks up a corresponding description from the dimension data
when available, or uses ‘‘Unknown’’ as the description when none is found:
INSERT INTO Warehouse.dbo.dimOrderStatus (OrderStatusID, OrderStatusDesc)
SELECT DISTINCT o.status, ISNULL(mos.Description,’Unknown’) FROM Staging.dbo.Orders o
LEFT OUTER JOIN Warehouse.dbo.dimOrderStatus os
ON o.status=os.OrderStatusID LEFT OUTER JOIN Staging.dbo.map_order_status mos
ON o.status = mos.Number WHERE os.OrderStatusID is NULL;
Finally, a source table may contain both fact and dimension data, which opens the door to
inconsis-tent relationships between dimension attributes The following example adds new codes that appear in
the source data, but guards against multiple product name spellings by choosing one with an aggregate
function Without usingMAXhere, the query may return multiple rows for the same product code:
INSERT INTO Warehouse.dbo.dimProduct (ProductCode, ProductName) SELECT stage.Code, MAX(stage.Name)
FROM Staging.dbo.Orders stage LEFT OUTER JOIN Warehouse.dbo.dimProduct dim ON stage.Code=dim.ProductCode
WHERE dim.ProductCode is NULL;
Loading fact tables
Once all the dimensions have been populated, the fact table can be loaded Dimension primary
keys generally take one of two forms: the key is either a natural key based on dimension data (e.g.,
ProductCode) or it is a surrogate key without any relationship to the data (e.g., theidentity
column) Surrogate keys are more general and adapt well to data from multiple sources, but each
surrogate key requires a join while loading For example, suppose our simple fact table is related to
dimTime,dimCustomer, anddimProduct IfdimCustomeranddimProductuse surrogate keys,
the load might look like the following:
INSERT INTO Warehouse.dbo.factOrder (OrderDate, CustomerID, ProductID, OrderAmount) SELECT o.Date, c.CustomerID, p.ProductID, ISNULL(Amount,0) FROM Staging.dbo.Orders o
INNER JOIN Warehouse.dbo.dimCustomer c
ON o.CustCode = c.CustomerCode INNER JOIN Warehouse.dbo.dimProduct p
ON o.Code = p.ProductCode;
BecausedimTimeis related to the fact table on the date value itself, no join is required to determine
the dimension relationship Measures should be converted into reportable form, eliminating nulls
when-ever possible In this case, a null amount, should it when-ever occur, is best converted to 0
Trang 6Best Practice
The extract-transform-load (ETL) process consists of a large number of relatively simple steps that evolve
over time as source data changes Centralize ETL logic in a single location as much as possible,
document non-obvious aspects, and place it under source control When some aspect of the process requires
maintenance, this will simplify rediscovering all the components and their revision history Integration
Services and SourceSafe are excellent tools in this regard
Changing data in dimensions
Proper handling of changes to dimension data can be a complex topic, but it boils down to how the
organization would like to track history If an employee changes her last name, is it important to know
both the current and previous values? How about address history for a customer? Or changes in credit
rating?
Following are the four common scenarios for tracking history in dimension tables:
■ Slowly Changing Dimension Type 1: History is not tracked, so any change to
dimen-sion data applies across all time For example, when the customer’s credit rating changes
from excellent to poor, there will be no way to know when the change occurred or that the
rating was ever anything but poor Such tracking makes it difficult to explain why the
cus-tomer’s purchase order was accepted last quarter without prepayment Conversely, this simple
approach will suffice for many dimensions When implementing an Analysis Services database
on OLTP data instead of a data warehouse, this is usually the only option available, as the
OLTP database rarely tracks history
■ Slowly Changing Dimension Type 2: Every change in the source data is tracked as history
by multiple rows in the dimension table For example, the first time a customer appears in
OLTP data, a row is entered into the dimension table for that customer and corresponding fact
rows are related to that dimension row Later, when that customer’s information changes in
the OLTP data, the existing row for that customer is expired, and a new row is entered into
the dimension table with the new attribute data Future fact rows are then associated with this
new dimension table row Because multiple surrogate keys are created for the same customer,
aggregations and distinct counts must use an alternate key Generally, this alternate key will
be the same one used to match rows when loading the dimension (see ‘‘Loading dimensions’’
earlier in this chapter)
■ Slowly Changing Dimension Type 3: Combines both type 1 and 2 concepts, whereby
history on some but not all changes is tracked based on business rules Perhaps employee
transfers within a division are treated as type 1 changes (just updated), while transfers between
divisions are treated as type 2 (a new dimension row is inserted)
■ Rapidly Changing Dimension: Occasionally an attribute (or a few attributes) in a dimension
will change rapidly enough to cause a type 2 approach to generate too many records in the
dimension table Such attributes are often related to status or rankings This approach resolves
the combinatorial explosion by breaking the rapidly changing attributes out into a separate
dimension tied directly to the fact table Thus, instead of tracking changes as separate rows in
the dimension table, the fact table contains the current ranking or status for each fact row
Trang 7Accommodating possible changes to dimensions complicates our ‘‘loading dimensions’’ example from
the previous section For example, a Type 1 dimension requires that each dimension value encountered
must first be checked to see if it exists If it does not, then it can be inserted as described If it does
exist, then a check is performed to determine whether attributes have changed (e.g., whether the
employee name changed), and an update is performed if required
This type of conditional logic is another reason to use Integration Services to load data, as it simplifies
the task and performs most operations quickly in memory A common practice for detecting changes
is to create and store a checksum of the current data values in the dimension, and then calculate the
checksum on each dimension row read, comparing the two checksums to determine whether a change
has occurred This practice minimizes the amount of data read from the database when performing
comparisons, which can be substantial when a large number of attributes are involved While there is
a small chance that different rows will return the same checksum, the risk/reward of this approach is
frequently judged acceptable
Type 2 dimensions extend the idea of Type 1’s insert-or-update regimen If a dimension row does not
exist, it is still inserted, but if it exists two things need to happen First, the existing row of data is
expired — it continues to be used by all previously loaded fact data, but new fact data will be associated
with the new dimension row Second, the new dimension row is inserted and marked as active
There are a number of ways to accomplish this expire/insert behavior, but a common approach adds
three columns to the dimension table: effective start and end dates, plus an active flag When a row
is initially inserted, the effective start is the date when the row is first seen, the end date is set far in
the future, and the active flag is on Later, when a new version of the row appears, the effective end
date is set and the active flag is cleared It is certainly possible to use only the active flag or date range
alone, but including both provides both easy identification of the current rows and change history for
debugging
Summary
All organizations use BI whether they realize it or not, because every organization needs to measure
what is happening in its business The only alternative is to make every decision based on intuition,
rather than data The quest for BI solutions usually begins with simple queries run against OLTP data
and evolves into an inconsistent jumble of numbers The concepts of BI and data warehousing help
organize that chaos
Storing data in a separate warehouse database avoids the contention, security, history, and consistency
pitfalls associated with directly using OLTP data The discipline of organizing dimensions and facts into
a star schema using report-ready data delivers both the quality and the performance needed to effectively
manage your data
The concepts introduced in this chapter should help prepare and motivate you to read the Analysis
Services and Integration Services chapters in this section
Trang 8Building Multidimensional Cubes
with Analysis Services
IN THIS CHAPTER
Concepts and terminology of Analysis Services
How to build up the components of a multidimensional database Detailed exploration of dimensions and cubes Dimension checklist Configuring database storage and partitions
Designing aggregations to increase query speed Configuring data integrity options
Anạve view of Analysis Services would be that it is the same data used
in relational databases with a slightly different format Why bother?
One could get by with data relational format only, without needing new
technology One can build a house with only a handsaw as well — it is all about
the right tool for the job
Analysis Services is fast It serves up summaries of billions of rows in a
second — a task that would take relational queries several minutes or longer And
unlike creating summary tables in a relational database, you don’t need to create a
different data structure for each type of summary
Analysis Services is all about simple access to clean, consistent data Building a
database in Analysis Services eliminates the need for joins and other query-time
constructs requiring intimate knowledge of the underlying data structures The
data modeling tools provide methods to handle null and inconsistent data
Com-plex calculations, even those involving period over period comparisons, can be
easily constructed and made to look like just another item available for query to
the user
Analysis Services also provides simple ways to relate data from disparate systems
The facilities provided in the server combined with the rich design environment
provide a compelling toolkit for data analysis and reporting
Analysis Services Quick Start
One quick way to get started with both data warehousing and Analysis Services
is to let the Business Intelligence Development Studio build the Analysis Services
database and associated warehouse tables for you, based on templates shipped
with SQL Server Begin by identifying or creating a SQL Server warehouse
Trang 9(relational) database Then open Business Intelligence Development Studio and create a new Analysis
Services project
Right-click on the Cubes node in the Solution Explorer and choose New Cube to begin the Cube
Wiz-ard On the Select Creation Method page of the wizard, choose ‘‘Generate tables in the data source’’ and
choose the template from the list that corresponds to your edition of SQL Server Work through the rest
of the wizard choosing measures and dimensions that make sense in your business Be sure to pause at
the Define Time Periods page long enough to define an appropriate time range and periods to make the
time dimension interesting for your application
At the Completing the Wizard page, select the Generate Schema Now option to automatically start the
Schema Generation Wizard Work through the remaining wizard pages, specifying the warehouse
loca-tion and accepting the defaults otherwise At the end of the Schema Generaloca-tion Wizard, all the Analysis
Services and relational objects are created Even if the generated system does not exactly meet a current
need, it provides an interesting example The resulting design can be modified and the schema
regener-ated by right-clicking the project within the Solution Explorer and choosing Generate Relational Schema
at any time
Analysis Services Architecture
Analysis Services builds on the concepts of the data warehouse to present data in a multidimensional
format instead of the two-dimensional paradigm of the relational database How is Analysis Services
mul-tidimensional? When selecting a set of relational data, the query identifies a value via row and column
coordinates, while the multidimensional store relies on selecting one or more items from each dimension
to identify the value to be returned Likewise, a result set returned from a relational database is a series
of rows and columns, whereas a result set returned by the multidimensional database can be organized
along many axes depending on what the query specifies
Background on Business Intelligence and Data Warehousing is presented in Chapter 70, ‘‘BI Design.’’ Readers unfamiliar with these areas will find this background helpful for under-standing Analysis Services.
Instead of the two-dimensional table, Analysis Services uses the multidimensional cube to hold data
in the database The cube thus presents an entity that can be queried via multidimensional expressions
(MDX), the Analysis Services equivalent of SQL.
Analysis Services also provides a convenient facility for defining calculations in MDX, which in turn
pro-vides another level of consistency to the Business Intelligence information stream
See Chapter 72, ‘‘Programming MDX Queries,’’ for details on creating queries and calcula-tions in MDX.
Analysis Services uses a combination of caching and pre-calculation strategies to deliver query
perfor-mance that is dramatically better than queries against a data warehouse For example, an existing query
to summarize the last six months of transaction history over some 130 million rows per month takes a
few seconds in Analysis Services, whereas the equivalent data warehouse query requires slightly more
than seven minutes
Trang 10Unified Dimensional Model
The Unified Dimensional Model (UDM) defines the structure of the multidimensional database,
includ-ing attributes presented to the client for query and how data is related, stored, partitioned, calculated,
and extracted from the source databases
At the foundation of the UDM is a data source view that identifies which relational tables provide
data to Analysis Services and the relations between those tables In addition, the data source view
supports giving friendly names to included tables and columns Based on the data source view, measure
groups and dimensions are defined according to data warehouse facts and dimensions Cubes then
define the relations between dimensions and measure groups, forming the basis for multidimensional
queries
Server
The UDM, or database definition, is hosted by the Analysis Services server, as shown in Figure 71-1
FIGURE 71-1
Analysis Services server
Analysis Services Server
Storage
Relational Database(s)
Processing
& Caching
SSIS Pipeline MOLAP
Storage
Data can be kept in a Multidimensional OLAP (MOLAP) store, which generally results in the fastest
query times, but it requires pre-processing of source data Processing normally takes the form
of SQL queries derived from the UDM and sent to the relational database to retrieve underlying
data Alternately, data can be sent directly from the Integration Services pipeline to the MOLAP
store
In addition to storing measures at the detail level, Analysis Services can store pre-calculated summary
data called aggregations For example, if aggregations by month and product line are created as part of
the processing cycle, queries that require that combination of values do not have to read and summarize
the detailed data, but can use the aggregations instead