Hướng dẫn học Microsoft SQL Server 2008 part 151 pptx

The star schema derives its name from its structure: a central fact table and a number of dimension tables clustered around it like the points of a star.. FIGURE 70-1Simple star schema d

Trang 1

These concepts are also in direct support of the Information Architecture Principle described in

Chapter 2, making data ‘‘readily available in a usable format for daily operations and analysis .’’

This chapter describes key concepts and practices behind the BI, enabling the availability and usability

that maximizes your data’s value

Data Warehousing

Data warehousing is a key concept behind both the structure and the discipline of a BI solution While

increasingly powerful tools provided in the latest release of SQL Server ease some of the requirements

around warehousing, it is still helpful to understand these concepts in considering your design

Star schema

The data warehouse is the industry-standard approach to structuring a relational OLAP data store It

begins with the idea of dimensions and measures, whereby a dimension is a categorization or ‘‘group

by’’ in the data, and the measure is the value being summarized For example, in ‘‘net sales by quarter

and division,’’ the measure is ‘‘net sales,’’ and the dimensions are ‘‘quarter’’ (time) and ‘‘division’’

(organization)

Deciding which dimensions and measures to include in the warehouse should be based on the needs

of the business, bringing together an understanding of the types of questions that will be asked and

the semantics of the data being warehoused Interviews and details about existing reports and metrics

can help gain a first approximation, but for most organizations a pilot project is needed to fully define

requirements

Best Practice

Organizations that are not familiar with what BI solutions can deliver have a difficult time understanding

the power they represent Kick-start your effort by developing a simple prototype that demonstrates

some of what is possible using data that everyone understands Choose a small but relevant subset of data

and implement a few dimensions and one or two measures to keep implementation time to a minimum

Business needs can then provide a basis for building the star schema that is the building block of the

warehouse (see Figure 70-1)

The star schema derives its name from its structure: a central fact table and a number of dimension

tables clustered around it like the points of a star Each dimension is connected back to the fact table by

a foreign key relationship

Trang 2

FIGURE 70-1

Simple star schema

dimDrug

factAdim

dimTherapy dimStage

The fact table consists of two types of columns: the keys that relate to each dimension in the star and

the facts (or measures) of interest

Each dimension table consists of a primary key by which it relates back to the fact table, and one or

more attributes that categorize data for that dimension For example, a customer dimension may include

attributes for name, e-mail address, and zip code In general, the dimension represents a denormalization

of the data in the OLTP system For example, the AdventureWorksDW customer dimension is derived

from theAdventureWorkstablesSales.IndividualandPerson.Contact, and fields parsed from

the XML column describing demographics, among others

Snowflake schema

Occasionally, it makes sense to limit denormalization by making one dimension table refer to another,

thus changing the star schema into a snowflake schema For example, Figure 70-2 shows how the

product dimension has been snowflaked inAdventureWorks’ Internet Sales schema Product category

and subcategory information could have been included directly in the productDimProducttable,

but instead separate tables have been included to describe the categorizations Snowflakes are useful

for complex dimensions for which consistency issues might otherwise arise, such as the assignment of

subcategories to categories in Figure 70-2, or for large dimensions where storage size is a concern

Trang 3

FIGURE 70-2

Snowflake dimension

DimProductSubcategory (d

ProductSubcategoryKey

ProductSubcategoryAlterna

EnglishProductSubcategory

SpanishProductSubcategory

FrenchProductSubcategory

ProductCategoryKey

DimProductCategory (dbo.D

ProductCategoryKey

ProductCategoryAlternat

EnglishProductCategoryN

SpanishProductCategory

FrenchProductCategoryN

DimProduct (dbo.DimProd

ProductKey ProductAlternateKey ProductSubcategoryKey WeightUnitMeasureCode SizeUnitMeasureCode EnglishProductName SpanishProductName FrenchProductName StandardCost FinishedGoodsFlag Color

FactInternetSales (dbo.Fa

SalesOrderNumber SalesOrderLineNumber ProductKey OrderDateKey DueDateKey ShipDateKey CustomerKey PromotionKey CurrencyKey SalesTerritoryKey RevisionNumber OrderQuantity UnitPrice ExtendedAmount UnitPriceDiscountPct DiscountAmount ProductStandardCost TotalProductCost SalesAmount TaxAmt

Traditionally, snowflake schema have been discouraged because they add complexity and can slow SQL operations, but recent versions of SQL Server eliminate the majority

of these issues If a dimension can be made more consistent using a snowflake structure, then do so

unless: (1) the procedure required to publish data into the snowflake is too complex or slow to be

sustainable or (2) the schema being designed will be used for extensive SQL queries that will be slowed

and complicated by the snowflake design.

Surrogate keys

Foreign key relationships are the glue that holds the star schema together, but avoid using OLTP keys

to relate fact and dimension tables, even though it is often very convenient to do so Consider a star

schema containing financial data when the accounting package is upgraded, changing all the customer

IDs in the process If the warehouse relation is based on an identity column, changes in the OLTP data

can be accomplished by changes to the relatively small amount of data in the customer dimension If the

OLTP key were used, then the entire fact table would need to be converted

Instead, create local surrogate keys, usually an identity column on dimension tables, to relate to the fact

table This helps avoid ID conflicts and adds robustness and flexibility in a number of scenarios

Consistency

Defining the star schema enables fast OLAP queries against the warehouse, but gaining needed

consis-tency requires following some warehousing rules:

■ When loading data into the warehouse, null and invalid values should be replaced with their reportable equivalents This enables the data’s semantics to be researched carefully once and

Trang 4

then used by a wide audience, leading to a consistent interpretation throughout the

organiza-tion Often, this involves manually adding rows to dimension tables to allow for all the cases

that can arise in the data (e.g., Unknown, Internal, N/A, etc.)

■ Rows in the fact table should never be deleted, and no other operations should be performed

that will lead to inconsistent query results from one day to the next Often this leads to

delaying import of in-progress transactions If a summary of in-progress transactions is

required, keep them in a separate fact table of ‘‘provisional’’ data and ensure that the business

user is presented with distinct summaries of data that will change Nothing erodes confidence

in a system like inconsistent results

A key design consideration is the eventual size of the fact table Fact tables often grow such that the

only practical operation is the insertion of new rows — large-scale delete and update operations become

impractical In fact, database size estimates can ignore the size of dimension tables and use just the size

of the fact tables

Best Practice

Keeping large amounts of history in the warehouse doesn’t imply keeping data forever Plan to archive

from the beginning by using partitioned fact tables that complement the partitioning strategy used in

Analysis Services For example, a large fact table might be broken into monthly partitions, maintaining two

years of history

Loading data

Given the architecture of the star/snowflake schema, adding new data begins at the points and moves

inward, adding rows to the fact table last in order to satisfy the foreign key constraints Usually,

warehouse tables are loaded using Integration Services, but examples are shown here as SQL inserts for

illustration

Loading dimensions

The approach to loading varies with the nature of the source data In the fortunate case where the

source fact data is related by foreign key to the table containing dimension data, only the dimension

data needs to be scanned This example uses the natural primary key in the source data, product code,

to identify rows in theProductsstaging table that have not yet been added to the dimension table:

INSERT INTO Warehouse.dbo.dimProduct (ProductCode, ProductName)

SELECT stage.Code, stage.Name FROM Staging.dbo.Products stage

LEFT OUTER JOIN Warehouse.dbo.dimProduct dim

ON stage.Code=dim.ProductCode

WHERE dim.ProductCode is NULL;

Often, the source dimension data will not be related by foreign key to the source fact data, so loading

the dimension table requires a full scan of the fact data in order to ensure consistency

Trang 5

This next example scans the fact data, picks up a corresponding description from the dimension data

when available, or uses ‘‘Unknown’’ as the description when none is found:

INSERT INTO Warehouse.dbo.dimOrderStatus (OrderStatusID, OrderStatusDesc)

SELECT DISTINCT o.status, ISNULL(mos.Description,’Unknown’) FROM Staging.dbo.Orders o

LEFT OUTER JOIN Warehouse.dbo.dimOrderStatus os

ON o.status=os.OrderStatusID LEFT OUTER JOIN Staging.dbo.map_order_status mos

ON o.status = mos.Number WHERE os.OrderStatusID is NULL;

Finally, a source table may contain both fact and dimension data, which opens the door to

inconsis-tent relationships between dimension attributes The following example adds new codes that appear in

the source data, but guards against multiple product name spellings by choosing one with an aggregate

function Without usingMAXhere, the query may return multiple rows for the same product code:

INSERT INTO Warehouse.dbo.dimProduct (ProductCode, ProductName) SELECT stage.Code, MAX(stage.Name)

FROM Staging.dbo.Orders stage LEFT OUTER JOIN Warehouse.dbo.dimProduct dim ON stage.Code=dim.ProductCode

WHERE dim.ProductCode is NULL;

Loading fact tables

Once all the dimensions have been populated, the fact table can be loaded Dimension primary

keys generally take one of two forms: the key is either a natural key based on dimension data (e.g.,

ProductCode) or it is a surrogate key without any relationship to the data (e.g., theidentity

column) Surrogate keys are more general and adapt well to data from multiple sources, but each

surrogate key requires a join while loading For example, suppose our simple fact table is related to

dimTime,dimCustomer, anddimProduct IfdimCustomeranddimProductuse surrogate keys,

the load might look like the following:

INSERT INTO Warehouse.dbo.factOrder (OrderDate, CustomerID, ProductID, OrderAmount) SELECT o.Date, c.CustomerID, p.ProductID, ISNULL(Amount,0) FROM Staging.dbo.Orders o

INNER JOIN Warehouse.dbo.dimCustomer c

ON o.CustCode = c.CustomerCode INNER JOIN Warehouse.dbo.dimProduct p

ON o.Code = p.ProductCode;

BecausedimTimeis related to the fact table on the date value itself, no join is required to determine

the dimension relationship Measures should be converted into reportable form, eliminating nulls

when-ever possible In this case, a null amount, should it when-ever occur, is best converted to 0

Trang 6

Best Practice

The extract-transform-load (ETL) process consists of a large number of relatively simple steps that evolve

over time as source data changes Centralize ETL logic in a single location as much as possible,

document non-obvious aspects, and place it under source control When some aspect of the process requires

maintenance, this will simplify rediscovering all the components and their revision history Integration

Services and SourceSafe are excellent tools in this regard

Changing data in dimensions

Proper handling of changes to dimension data can be a complex topic, but it boils down to how the

organization would like to track history If an employee changes her last name, is it important to know

both the current and previous values? How about address history for a customer? Or changes in credit

rating?

Following are the four common scenarios for tracking history in dimension tables:

■ Slowly Changing Dimension Type 1: History is not tracked, so any change to

dimen-sion data applies across all time For example, when the customer’s credit rating changes

from excellent to poor, there will be no way to know when the change occurred or that the

rating was ever anything but poor Such tracking makes it difficult to explain why the

cus-tomer’s purchase order was accepted last quarter without prepayment Conversely, this simple

approach will suffice for many dimensions When implementing an Analysis Services database

on OLTP data instead of a data warehouse, this is usually the only option available, as the

OLTP database rarely tracks history

■ Slowly Changing Dimension Type 2: Every change in the source data is tracked as history

by multiple rows in the dimension table For example, the first time a customer appears in

OLTP data, a row is entered into the dimension table for that customer and corresponding fact

rows are related to that dimension row Later, when that customer’s information changes in

the OLTP data, the existing row for that customer is expired, and a new row is entered into

the dimension table with the new attribute data Future fact rows are then associated with this

new dimension table row Because multiple surrogate keys are created for the same customer,

aggregations and distinct counts must use an alternate key Generally, this alternate key will

be the same one used to match rows when loading the dimension (see ‘‘Loading dimensions’’

earlier in this chapter)

■ Slowly Changing Dimension Type 3: Combines both type 1 and 2 concepts, whereby

history on some but not all changes is tracked based on business rules Perhaps employee

transfers within a division are treated as type 1 changes (just updated), while transfers between

divisions are treated as type 2 (a new dimension row is inserted)

■ Rapidly Changing Dimension: Occasionally an attribute (or a few attributes) in a dimension

will change rapidly enough to cause a type 2 approach to generate too many records in the

dimension table Such attributes are often related to status or rankings This approach resolves

the combinatorial explosion by breaking the rapidly changing attributes out into a separate

dimension tied directly to the fact table Thus, instead of tracking changes as separate rows in

the dimension table, the fact table contains the current ranking or status for each fact row

Trang 7

Accommodating possible changes to dimensions complicates our ‘‘loading dimensions’’ example from

the previous section For example, a Type 1 dimension requires that each dimension value encountered

must first be checked to see if it exists If it does not, then it can be inserted as described If it does

exist, then a check is performed to determine whether attributes have changed (e.g., whether the

employee name changed), and an update is performed if required

This type of conditional logic is another reason to use Integration Services to load data, as it simplifies

the task and performs most operations quickly in memory A common practice for detecting changes

is to create and store a checksum of the current data values in the dimension, and then calculate the

checksum on each dimension row read, comparing the two checksums to determine whether a change

has occurred This practice minimizes the amount of data read from the database when performing

comparisons, which can be substantial when a large number of attributes are involved While there is

a small chance that different rows will return the same checksum, the risk/reward of this approach is

frequently judged acceptable

Type 2 dimensions extend the idea of Type 1’s insert-or-update regimen If a dimension row does not

exist, it is still inserted, but if it exists two things need to happen First, the existing row of data is

expired — it continues to be used by all previously loaded fact data, but new fact data will be associated

with the new dimension row Second, the new dimension row is inserted and marked as active

There are a number of ways to accomplish this expire/insert behavior, but a common approach adds

three columns to the dimension table: effective start and end dates, plus an active flag When a row

is initially inserted, the effective start is the date when the row is first seen, the end date is set far in

the future, and the active flag is on Later, when a new version of the row appears, the effective end

date is set and the active flag is cleared It is certainly possible to use only the active flag or date range

alone, but including both provides both easy identification of the current rows and change history for

debugging

Summary

All organizations use BI whether they realize it or not, because every organization needs to measure

what is happening in its business The only alternative is to make every decision based on intuition,

rather than data The quest for BI solutions usually begins with simple queries run against OLTP data

and evolves into an inconsistent jumble of numbers The concepts of BI and data warehousing help

organize that chaos

Storing data in a separate warehouse database avoids the contention, security, history, and consistency

pitfalls associated with directly using OLTP data The discipline of organizing dimensions and facts into

a star schema using report-ready data delivers both the quality and the performance needed to effectively

manage your data

The concepts introduced in this chapter should help prepare and motivate you to read the Analysis

Services and Integration Services chapters in this section

Trang 8

Building Multidimensional Cubes

with Analysis Services

IN THIS CHAPTER

Concepts and terminology of Analysis Services

How to build up the components of a multidimensional database Detailed exploration of dimensions and cubes Dimension checklist Configuring database storage and partitions

Designing aggregations to increase query speed Configuring data integrity options

Anạve view of Analysis Services would be that it is the same data used

in relational databases with a slightly different format Why bother?

One could get by with data relational format only, without needing new

technology One can build a house with only a handsaw as well — it is all about

the right tool for the job

Analysis Services is fast It serves up summaries of billions of rows in a

second — a task that would take relational queries several minutes or longer And

unlike creating summary tables in a relational database, you don’t need to create a

different data structure for each type of summary

Analysis Services is all about simple access to clean, consistent data Building a

database in Analysis Services eliminates the need for joins and other query-time

constructs requiring intimate knowledge of the underlying data structures The

data modeling tools provide methods to handle null and inconsistent data

Com-plex calculations, even those involving period over period comparisons, can be

easily constructed and made to look like just another item available for query to

the user

Analysis Services also provides simple ways to relate data from disparate systems

The facilities provided in the server combined with the rich design environment

provide a compelling toolkit for data analysis and reporting

Analysis Services Quick Start

One quick way to get started with both data warehousing and Analysis Services

is to let the Business Intelligence Development Studio build the Analysis Services

database and associated warehouse tables for you, based on templates shipped

with SQL Server Begin by identifying or creating a SQL Server warehouse

Trang 9

(relational) database Then open Business Intelligence Development Studio and create a new Analysis

Services project

Right-click on the Cubes node in the Solution Explorer and choose New Cube to begin the Cube

Wiz-ard On the Select Creation Method page of the wizard, choose ‘‘Generate tables in the data source’’ and

choose the template from the list that corresponds to your edition of SQL Server Work through the rest

of the wizard choosing measures and dimensions that make sense in your business Be sure to pause at

the Define Time Periods page long enough to define an appropriate time range and periods to make the

time dimension interesting for your application

At the Completing the Wizard page, select the Generate Schema Now option to automatically start the

Schema Generation Wizard Work through the remaining wizard pages, specifying the warehouse

loca-tion and accepting the defaults otherwise At the end of the Schema Generaloca-tion Wizard, all the Analysis

Services and relational objects are created Even if the generated system does not exactly meet a current

need, it provides an interesting example The resulting design can be modified and the schema

regener-ated by right-clicking the project within the Solution Explorer and choosing Generate Relational Schema

at any time

Analysis Services Architecture

Analysis Services builds on the concepts of the data warehouse to present data in a multidimensional

format instead of the two-dimensional paradigm of the relational database How is Analysis Services

mul-tidimensional? When selecting a set of relational data, the query identifies a value via row and column

coordinates, while the multidimensional store relies on selecting one or more items from each dimension

to identify the value to be returned Likewise, a result set returned from a relational database is a series

of rows and columns, whereas a result set returned by the multidimensional database can be organized

along many axes depending on what the query specifies

Background on Business Intelligence and Data Warehousing is presented in Chapter 70, ‘‘BI Design.’’ Readers unfamiliar with these areas will find this background helpful for under-standing Analysis Services.

Instead of the two-dimensional table, Analysis Services uses the multidimensional cube to hold data

in the database The cube thus presents an entity that can be queried via multidimensional expressions

(MDX), the Analysis Services equivalent of SQL.

Analysis Services also provides a convenient facility for defining calculations in MDX, which in turn

pro-vides another level of consistency to the Business Intelligence information stream

See Chapter 72, ‘‘Programming MDX Queries,’’ for details on creating queries and calcula-tions in MDX.

Analysis Services uses a combination of caching and pre-calculation strategies to deliver query

perfor-mance that is dramatically better than queries against a data warehouse For example, an existing query

to summarize the last six months of transaction history over some 130 million rows per month takes a

few seconds in Analysis Services, whereas the equivalent data warehouse query requires slightly more

than seven minutes

Trang 10

Unified Dimensional Model

The Unified Dimensional Model (UDM) defines the structure of the multidimensional database,

includ-ing attributes presented to the client for query and how data is related, stored, partitioned, calculated,

and extracted from the source databases

At the foundation of the UDM is a data source view that identifies which relational tables provide

data to Analysis Services and the relations between those tables In addition, the data source view

supports giving friendly names to included tables and columns Based on the data source view, measure

groups and dimensions are defined according to data warehouse facts and dimensions Cubes then

define the relations between dimensions and measure groups, forming the basis for multidimensional

queries

Server

The UDM, or database definition, is hosted by the Analysis Services server, as shown in Figure 71-1

FIGURE 71-1

Analysis Services server

Analysis Services Server

Storage

Relational Database(s)

Processing

& Caching

SSIS Pipeline MOLAP

Storage

Data can be kept in a Multidimensional OLAP (MOLAP) store, which generally results in the fastest

query times, but it requires pre-processing of source data Processing normally takes the form

of SQL queries derived from the UDM and sent to the relational database to retrieve underlying

data Alternately, data can be sent directly from the Integration Services pipeline to the MOLAP

store

In addition to storing measures at the detail level, Analysis Services can store pre-calculated summary

data called aggregations For example, if aggregations by month and product line are created as part of

the processing cycle, queries that require that combination of values do not have to read and summarize

the detailed data, but can use the aggregations instead

Định dạng
Số trang	10
Dung lượng	1,12 MB