Hands-On Microsoft SQL Server 2008 Integration Services part 58 docx

SQL Server 2008 R2 Data Warehouse Editions Microsoft has just launched SQL Server 2008 R2, which is built upon the strong foundations and successes of SQL Server 2008 and is targeted at

Trang 1

5 4 8 H a n d s - O n M i c r o s o f t S Q L S e r v e r 2 0 0 8 I n t e g r a t i o n S e r v i c e s

In this model, the dimensions are denormalized and a business view of a dimension

is represented as a single table in the model Also, the dimensions have simple primary keys and the fact table has a set of foreign keys pointing to dimensions; the combination of these keys forms a compound primary key for the fact table This structure provides some key benefits to this model by making it simple for users to understand Writing select queries for this model is easier due to simple joins between dimensions and the fact tables Also, the query performance is generally better due

to the reduced number of joins compared to other models Finally, a star schema can easily be extended by simply adding new dimensions as long as the fact entity holds the related data

Snowflake Model

Sometimes it is difficult to denormalize a dimension, or in other words, it makes more sense to keep a dimension in a normalized form, especially when multiple levels

of relationships exist in dimension data or the child member has multiple parents

in a dimension In this model, the dimension suitable for snowflaking is split into its hierarchies and results into multiple tables linked to each other via relationships, generally, one-to-many The many-to-many relationship is also handled using a bridge table between the dimensions, sometimes called a FactLessFact table For example, the AdventureWorksDW2008 products dimension DimProduct is a snowflake dimension that is linked to DimProduct SubCategory, which is further linked to the DimProductCategory table (refer to Figure 12-2) This structure makes much sense to database developers or data modelers and helps users to write useful queries, especially when an OLAP tool such as SSAS supports such a structure and optimizes running of snowflaked queries However, business users might find

it a bit difficult to work with and would prefer a star schema, so you have to find

a balance in choosing when to go for a snowflake schema Though there is some space saving by breaking a dimension into a snowflake schema, it is not high on the preference list because first, the space is not very costly, and second, the dimension tables are not huge and space savings are not expected to be considerable in many cases The snowflaking is done for the functional reasons rather than savings in disk space Finally, the queries written against a snowflake schema tend to use more joins (because more dimension tables are involved) compared to a star schema, and this will affect the query performance You need to test for user acceptance for the speed

at which results are being returned

Trang 2

Building a Star Schema

The very first requirement in designing a data warehouse is your focus on the subject

area for which a business has engaged you to build a star schema It’s easy to sway in

different directions while building a data warehouse, but whatever you do later in the

process, you must always keep a focus on the business value you are delivering with the

model As a first step, capture all the business requirements and the purposes for which

they want to do this activity You might end up meeting several business managers

and process owners to understand the requirements and match them against the data

available in source systems At this stage, you will be creating a high-level data source

mappings document to meet the requirements Once you have identified at a high

Figure 12-2 AdventureWorksDW2008 simplified snowflake schema

Trang 3

level that the requirements can be met with the available data, the next step is to define the dimensions and the measures in the process While defining measures or facts,

it is important that you discuss with the business and understand clearly the level of detail they will be interested in This will decide the grain of your fact data Typically, you would want to keep lowest grain of data so that you have maximum flexibility for future changes In fact, defining the grain is one of the first steps while building a data warehouse, as this is the cornerstone to collating the required information, for instance, defining the roll-up measures

At this stage, you are ready to create a low-level star schema data model and will be defining attributes and fields for the dimension and fact tables As part of the process, you will also define primary keys for dimension tables that will also exist in fact tables

as a foreign key These primary keys will need to be the new set of keys known as the surrogate keys The use of surrogate keys instead of source system keys or business keys provides many benefits in the design; for instance, they provide protection against changes in source systems keys, maintain history (by using SCD transformation), integrate data from multiple source, and handle late arriving members, including facts for which dimension members are missing Generally, a surrogate key will be an auto-incrementing non-null integer value like an identity and will form a clustered index on the dimension However, the Date dimension is a special dimension, commonly used

in the data warehouses, with a primary key based on the date instead of a continuous number; for instance, 20091231 and 20100101 are date-based consecutive values, but not in a serial number order

While working on dimensions, you will need to identify some special dimensions First, look for role-playing dimensions Refer to Figure 12-2 and note that the DateKey

of the DimDate dimension is connected multiple times with the fact table, once each for the OrderDateKey, DueDateKey, and ShipDateKey columns In this case, DimDate is acting as a role-playing dimension Another case is to figure out degenerate dimensions in your data and place them alongside facts in the fact table Next, you need

to identify any indicators or flags used in the facts that are of low cardinality and can be grouped together in one junk dimension table These miscellaneous flags held together

at one place make the user’s life much easier when they need to classify the analysis by some indicators or flags Finally, you will complete the exercise by listing the attributes’ change types, that is, whether they are Type 1 or Type 2 candidates These attributes, especially Type 2, help in maintaining history in the data warehouse Keeping history using SCD transformations has been discussed earlier in the chapter as well as in Chapter 10 Though your journey to implement a data warehouse still has a long way

to go, after implementing the preceding measures, you will have implemented a basic star schema structure From here it will be easy to work with and extend this model further to meets specific business process requirements, such as real-time data delivery

or snowflaking

Trang 4

SQL Server 2008 R2 Features and Enhancements

Several features have been provided in SQL Server 2008 R2 that can help you in

your data warehousing project by either providing new functionality or improving the

performance of commonly used features It is not possible to cover them all here unless

I want to stretch myself beyond the scope of this book, especially when those features

do not reside in the realm of Integration Services But I will still try to cover some

features that most data warehousing projects will use, while other features such as data

compression, sparse columns, new data types, Large UDTs, and minimal logging are

not covered

SQL Server 2008 R2 Data Warehouse Editions

Microsoft has just launched SQL Server 2008 R2, which is built upon the strong

foundations and successes of SQL Server 2008 and is targeted at very large-scale data

warehouses, and at higher mission-critical-scale and self-service business intelligence

SQL Server 2008 R2 has introduced two new premium editions to meet the demands

of large-scale data warehouses

SQL Server 2008 R2 Datacenter

This edition is built on the Enterprise Edition code base, but provides the highest

levels of scalability and manageability SQL Server 2008 R2 Datacenter is designed for

the highest levels of scalability that SQL Server platform can provide, virtualization,

and consolidation, and it delivers a high-performing data platform Typical

implementations include a large-scale data warehouse server that can scale up to

support tens of terabytes of data, provide Master Data services, and implement very

large-scale BI applications such as self-service or power pivot for SharePoint Following are the key features:

As the Enterprise Edition is restricted to up to 25 editions and 4 virtual machines

c

(VMs), the Datacenter Edition is the next level if you need more than 25 instances

or more VMs This also provides application and Multi-Server Management for

enrolling and gaining insights

The Datacenter Edition has no limits on server maximum memory; rather, it is

c

restricted by the limits of the operating system only For example, it can support

up to 2TB of RAM if running on the Windows Server 2008 R2 Datacenter

Edition

Trang 5

It supports more than 8 processors and up to 256 logical processors for the highest c

levels of scale

It has the highest virtualization support for maximum ROI on consolidation and c

virtualization

It provides high-scale complex event processing with SQL Server StreamInsight c

Advanced features such as the Resource Governor, data compression, and backup c

compression are included

SQL Server 2008 R2 Parallel Data Warehouse

Since acquiring DATAllegro, a provider of large-volume, high-performance data warehouse appliances, Microsoft has been working on consolidating hardware and software solutions for high-end data warehousing under a project named Madison The SQL Server 2008 R2 Parallel Data Warehouse is the result of Project Madison The Parallel Data Warehouse is an appliance-based, highly scalable, highly reliable, and high-performance data warehouse solution Using SQL Server 2008 on Windows Server 2008 in a massively parallel processing (MPP) configuration, Parallel Data Warehouse can scale from tens to hundreds of terabytes, providing better and more predictable performance, increased reliability, and lower cost per terabyte It comes with preconfigured hardware and software that is carefully balanced in one appliance, thus making deployment quick and easy Massively parallel processing enables it to perform ultra-fast loading and high-speed backups, thus addressing two of the major challenges facing modern data warehouses: data load and the backup times You can integrate existing SQL Server 2008–based data marts or mini–data warehouses with parallel data warehouses via a hub-and-spoke architecture This product was targeted

to be shipped alongside the SQL Server 2008 R2 release; however, the release of this product has been slightly delayed awaiting customer feedback from the customer Technology Adoption Program (TAP) The parallel data warehouse principles and architecture are detailed later in this chapter

SQL Server 2008 R2 Data Warehouse Solutions

Microsoft has recognized the need to develop data warehouse solutions to build upon the successes of SQL Server 2008 This has resulted in Microsoft partnering with several industry-leading hardware vendors to create best-of-breed balanced configurations combining hardware and software to achieve the highest levels of performance Two such solutions are now available under the names of Fast Track Data Warehouse and Parallel Data Warehouse

Trang 6

Fast Track Data Warehouse

The Fast Track Data Warehouse solution implements a CPU core-balanced approach

on the symmetric multiprocessor (SMP)–based SQL Server data warehouse using a

core set of configurations and database best practice guidelines The fast track reference

architecture is a combination of hardware that is balanced for performance across all the components and software configurations, such as Windows OS configurations, SQL

Server database layout, and indexing, along with a whole raft of other settings, best

practices, and documents to implement all of these objectives The Fast Track Data

Warehouse servers can have two-, four-, or eight-processor configurations and can

scale from 4 terabytes to 30-plus terabytes and even more if compression capabilities

are used The earlier available reference architectures are found to suffer from various

performance issues due to the simple fact that they have not been specifically designed

to suit the needs of one particular problem and hence, suffer from an unbalanced

architecture For example, you may have seen that a server is busier and processing

too much I/O, but still the CPU utilization is not high enough to indicate the work

load This is a simple example of mismatch or unbalance existing between various

components in currently available servers The Fast Track Data Warehouse servers are

built on the use cases of a particular scenario—i.e., they are built to capacity to match

the required workload on the server rather than with a one-size-fits-all approach

This approach of designing a balanced server provides predictable performance and

minimizes the risk of going over spec on the components such as by providing CPU

or storage that will never be utilized The predictable performance and scalability is

achieved by adopting core principles, best practices, and methodologies, some of which

have been listed next

It is built for data warehouse workloads.

warehouse workload is quite different from that on the OLTP servers While

OLTP transactions are made up of small read and write operations, data warehouse

queries usually perform large read and write operations The data warehouse

queries are generally fewer in number, but they are more complex, requiring high

aggregations, and generally have date range restrictions Furthermore, OLTP

transactions generate more random I/O, which causes slow response To overcome

this, a large number of disks have been traditionally used, along with some other

optimization techniques such as building heavy indexes These techniques have

their own maintenance overheads and cause the data warehouse loading time

to increase The Fast Track Data Warehouse uses a new way of optimizing data

warehouse workloads by laying data in a sequential architecture Considering that

a data warehouse workload requires ranges of data, reading sequential data off the

disk drives is much efficient compared to random I/Os All efforts are targeted to

Trang 7

preserving sequential storage The data is preferred to be served from disk rather than from memory, as the performance achieved is much higher with sequential storage This results in fewer indexes, yielding savings in maintenance, decreased loading time, lesser fragmentation of data, and reduced storage requirements

It offers a holistic approach to component architecture.

across all components, starting with disks, disk controllers, fiber channels HBAs, and the Windows operating system, and ranging up to the SQL Server and then

to the CPU core For example, on the basis of how much data can be consumed per CPU core (200 MBps), the number of CPU cores is calculated for a given workload and then the backward calculations are applied to all the components to support the same bandwidth or capacity This balance or the synchronization in response by individual components provides the required throughput to match the capabilities of the data warehouse application

It is optimized for workload type.

designed and built considering the very nature of the database application To capture these details, templates and tools are provided to design and build the fast track server Several industry-leading vendors are participating in this program to provide you out-of-the-box performance reference architecture servers Businesses can benefit from reduced hardware testing and tuning and rapid deployment

Parallel Data Warehouse

When your data growth needs can no longer be satisfied with the scale-up approach

of the Fast Track Data Warehouse, you can choose the scale-out approach of Parallel Data Warehouse, which has been built for very large data warehouse applications using Microsoft SQL Server 2008 The symmetric multiprocessing (SMP) architecture used in the Fast Track Data Warehouse is limited by the capacity of the components such as CPU, memory, and hard disk drives that form part of the single computer This limitation is addressed by scaling out to a configuration consisting of multiple computing nodes that have dedicated resources—i.e., CPUs, memory, and hard disk space, along with an instance of SQL Server—connected in an MPP configuration

Architecture and Hardware

The appliance hardware is built on industry-standard technology and is not proprietary

to one manufacturer, so you can choose from well-known hardware vendors such as

HP, Dell, IBM, EMC2, and Bull This way, you can keep your hardware maintenance costs low, as it will integrate nicely with already-existing infrastructure

Trang 8

As mentioned earlier, the Parallel Data Warehouse is built on Microsoft Windows

2008 Enterprise server nodes with their own dedicated hardware connected via a

high-speed fiber channel link, with each node running an instance of the SQL Server

2008 database server The MPP appliance basically has one control node and several

compute nodes, depending on the data requirements This configuration is extendable

from single-rack to multirack configurations; in the latter case, one rack could act as a

control node The nodes are connected in a configuration called Ultra Shared Nothing

(refer to Figure 12-3), in which the large database tables are partitioned across multiple

nodes to improve the query performance This architecture has no single point of

failure, and redundancy has been built in at all component levels

Applications or users send requests to a control node that balances the requests

intelligently across all the compute nodes Each compute node processes the request

it gets from the control node using its local resources and passes back the results to

the control node, which then collates the results before returning to the requesting

application or user As the data is evenly distributed across multiple nodes and the

nodes process the requests in parallel, queries run many times faster on an MPP

appliance than on an SMP database server

Like a Fast Track Data Warehouse server, an MPP appliance is also built under

tight specifications and carefully balanced configurations to eliminate performance

bottlenecks Reference configurations have been designed for different use case

scenarios taking into account different types of workloads such as data loading,

reporting, and ad hoc queries A control node automatically distributing workload

evenly, compute nodes being able to work on queries autonomously, system resources

balanced against each other, and design of reference configurations on use case

Figure 12-3 Parallel Data Warehouse architecture

Compute nodes

Node -1 Node -2 Node -N

Control node

Node -0

Trang 9

scenarios enable an MPP appliance to achieve predictable performance Scalability follows from here with the simple addition of capacity as the data volumes grow

Hub-and-Spoke Architecture

Another important advantage with an MPP appliance is that it can be deployed

in a hub-and-spoke architecture In this way, you can use an MPP appliance as

a hub while the spokes could be either MPP appliances or standard SQL Server 2008–based symmetric multiprocessing (SMP) servers (see Figure 12-4) Typically, department users will connect to spokes to access data in their required formats In this configuration, the MPP appliance at the hub will host the enterprise data in the lowest granularity and the spokes will contain data for their relevant department in the schema and aggregations they require This is possible because the spokes could host any database application such as the SQL Server 2008 data mart or SQL Server Analysis Services data mart, as best fits the user requirements So, this architecture with

an MPP appliance at the hub and SMP database servers or MPP appliances as spokes

is a specialized configuration in which a grid of computers forms a very large-scale data warehouse in a federated model This grid of computers can be connected via a speed network Also, the nodes of an MPP appliance are connected via a high-speed link and the hub processes data differently in different nodes, enabling parallel

Figure 12-4 A parallel data warehouse in hub-and-spoke architecture

SQL Server 2008 Fast Track Data Warehouse

SQL Server 2008 Analysis Services

SQL Server 2008 Reporting Services

SQL Server 2008 R2 Parallel Data Warehouse SQL Server 2008

Fast Track Data Warehouse SQL Server 2008

Trang 10

high-speed data transfer from node to node between hub and spoke units Data transfer speeds approaching 500GB per minute can be achieved, thus minimizing the overhead

associated with export and load operations

The SQL Server 2008 R2 Parallel Data Warehouse MPP appliance integrates very

well with BI applications such as Integration Services, Reporting Services, and Analysis

Services So, if you have an existing SQL Server 2008 data mart, it can be easily added

as a node in the grid As spokes can be any SQL Server 2008 database application,

this architecture provides a best-fit approach to the problem The enterprise data is

managed at the center in the MPP appliance under the enforcement of IT policies

and standards, while a business unit can still have a spoke that they can manage

autonomously This flexible model is a huge business benefit and provides quick

deployments of data marts, bypassing the sensitive political issues This way, you can

easily expand an enterprise data warehouse by adding an additional node that can be

configured according to the business unit requirements

The Parallel Data Warehouse hub-and-spoke architecture utilizes the available

processing power in the best possible way by distributing work across multiple locations

in the grid While the basic data management such as cleansing, standardization, and

metadata management is done in the hub according to the enterprise policies, the

application of business rules relevant to the business units and the analytical processing

is handled in the relevant spokes The hub-and-spoke model offers benefits such as

parallel high-speed data movement among different nodes in the grid, the distribution

of workload, and the massively parallel architecture of the hub where all nodes work in

parallel autonomously, thus providing outstanding performance

With all these listed benefits and many more that can be realized in individual

deployment scenarios, the hub-and-spoke reference architecture provides the best of

both worlds—i.e., the ease of data management of centralized data warehouses and the

flexibility to build data marts on use-case scenarios as with federated data marts

SQL Server 2008 R2 Data Warehouse

Enhancements

In this section some of the SQL Server 2008 enhancements are covered, such as backup compression, MERGE T-SQL statements, change data capture, and partition-aligned

Indexed views

Backup Compression

Backup compression is a new feature provided in SQL Server 2008 Enterprise Edition

and above and due to its popularity, has since been included in the Standard Edition

and in the SQL Server 2008 R2 release as well Backup compression helps to speed up

Định dạng
Số trang	10
Dung lượng	284,74 KB