Hướng dẫn học Microsoft SQL Server 2008 part 8 doc

■ Data warehouses collect large amounts of data from multiple data stores across the entire enterprise using anextract-transform-load ETL process to convert the data from the various for

Trang 1

Planning Data Stores

The enterprise data architect helps an organization plan the most effective use of information throughout

the organization An organization’s data store configuration includes multiple types of data stores, as

illustrated in the following figure, each with a specific purpose:

■ Operational databases, or OLTP (online transaction processing) databases collect

first-generation transactional data that is essential to the day-to-day operation of

the organization and unique to the organization An organization might have an

operational data store to serve each unit or function within it Regardless of the

organi-zation’s size, an organization with a singly focused purpose may very well have only

one operational database

■ For performance, operational stores are tuned for a balance of data retrieval and

updates, so indexes and locking are key concerns Because these databases receive

first-generation data, they are subject to data update anomalies, and benefit from

normalization A typical organizational data store configuration includes several

operational data stores feeding multiple data marts and a single master data store (see

graphic)

ReferenceDB

Data Warehouse

Sales Data Mart Manufacturing

Data Mart

Manufacturing OLTP Sales OLTP

Alternate Location

Mobile Sales OLTP

■ Caching data stores, sometime called reporting databases, are optional read-only

copies of all or part of an operational database An organization might have multiple

caching data stores to deliver data throughout the organization Caching data stores

continued

Trang 2

might use SQL Server replication or log shipping to populate the database and are

tuned for high-performance data retrieval

■ Reference data stores are primarily read-only, and store generic data required by the

organization but which seldom changes — similar to the reference section of the

library Examples of reference data might be unit of measure conversion factors or ISO

country codes A reference data store is tuned for high-performance data retrieval

■ Data warehouses collect large amounts of data from multiple data stores across the

entire enterprise using anextract-transform-load (ETL) process to convert the data

from the various formats and schema into a common format, designed for ease of

data retrieval Data warehouses also serve as the archival location, storing historical

data and releasing some of the data load from the operational data stores The data

is also pre-aggregated, making research and reporting easier, thereby improving the

accessibility of information and reducing errors

■ Because the primary task of a data warehouse is data retrieval and analysis, the

data-integrity concerns present with an operational data store don’t apply Data warehouses

are designed for fast retrieval and are not normalized like master data stores They

are generally designed using a basic star schema or snowflake design Locks generally

aren’t an issue, and the indexing is applied without adversely affecting inserts or

updates

Chapter 70, ‘‘BI Design,’’ discusses star schemas and snowflake designs used in data ware-housing.

■ The analysis process usually involves more than just SQL queries, and uses data cubes that

consoli-date gigabytes of data into dynamic pivot tables Business intelligence (BI) is the combination of the

ETL process, the data warehouse data store, and the acts of creating and browsing cubes.

■ A common data warehouse is essential for ensuring that the entire organization researches the same

data set and achieves the same result for the same query — a critical aspect of the Sarbanes-Oxley

Act and other regulatory requirements.

■ Data marts are subsets of the data warehouse with pre-aggregated data organized specifically to

serve the needs of one organizational group or one data domain.

■ Master data store, or master data management (MDM), refers to the data warehouse that combines

the data from throughout the organization The primary purpose of the master data store is to

provide a single version of the truth for organizations with a complex set of data stores and multiple

data warehouses.

Smart Database Design

My career has focused on turning around database projects that were previously considered failures and

recommending solutions for ISV databases that are performing poorly In nearly every case, the root

cause of the failure was the database design It was too complex, too clumsy, or just plain inadequate

Trang 3

Without exception, where I found poor database performance, I also found data modelers who insisted

on modeling alone or who couldn’t write SQL queries to save their lives

Throughout my career, what began as an observation was reinforced into a firm conviction The

database schema is the foundation of the database project; and an elegant, simple database design

outperforms a complex database both in terms of the development process and the final performance of

the database application This is the basic idea behind the Smart Database Design

While I believe in a balanced set of goals for any database, including performance, usability, data

integrity, availability, extensibility, and security, all things being equal, the crown goes to the database

that always provides the right answer with lightning speed

Database system

A database system is a complex system By complex, I mean that the system consists of multiple

compo-nents that interact with one another, as shown in Figure 2-1 The performance of one component affects

the performance of other components and thus the entire system Stated another way, the design of one

component will set up other components, and the whole system, to either work well together or to

frus-trate those trying to make the system work

FIGURE 2-1

The database system is the collective effort of the server environment, maintenance jobs, the client

application, and the database

• Four Components

AD-306 • Performance Decisions

The Database System

2006 PASS Community Summit

DB

Instead of randomly trying performance tips (and the Internet has an overwhelming number of SQL

Server performance and optimization tips), it makes more sense to think about the database as a system

and then figure out how the components of the database system affect one another You can then use

this knowledge to apply the performance techniques in a way that provides the most benefit

Trang 4

Every database system contains four broad technologies or components: the database itself, the server

platform, the maintenance jobs, and the client’s data access code, as illustrated in Figure 2-2 Each

com-ponent affects the overall performance of the database system:

■ The server environment is the physical hardware configuration (CPUs, memory, disk spindles,

I/O bus), the operating system, and the SQL Server instance configuration, which together

pro-vide the working environment for the database The server environment is typically optimized

by balancing the CPUs, memory and I/O, and identifying and eliminating bottlenecks

■ The database maintenance jobs are the steps that keep the database running optimally (index

defragmentation, DBCC integrity checks, and maintaining index statistics)

■ The client application is the collection of data access layers, middle tiers, front-end applications,

ETL (extract, transform, and load) scripts, report queries, or SSIS (SQL Server Integration

Services) packages that access the database These can not only affect the user’s perception of

database performance, but can also reduce the overall performance of the database system

■ Finally, the database component includes everything within the data file: the physical schema,

T-SQL code (queries, stored procedures, user-defined functions (UDFs), and views), indexes,

and data

FIGURE 2-2

Smart Database Design is the premise that an elegantphysical schema makes the data intuitively

obvious and enables writing great set-basedqueries that respond well to indexing This in turn creates

short, tight transactions, which improvesconcurrency and scalability, while reducing the aggregate

workload of the database This flow from layer to layer becomes a methodology for designing and

optimizing databases

Adv Scalability

Concurrency

Indexing

Set-Based

Schema

Enables

All four database components must function well together to produce a high-performance database

sys-tem; if one of the components is weak, then the database system will fail or perform poorly

However, of these four components, the database itself is the most difficult component to design and

the one that drives the design of the other three components For example, the database workload

Trang 5

determines the hardware requirements Maintenance jobs and data access code are both designed around

the database; and an overly complex database will complicate both the maintenance jobs and the data

access code

Physical schema

The base layer of Smart Database Design is the database’s physical schema The physical schema includes

the database’s tables, columns, primary and foreign keys, and constraints Basically, the ‘‘physical’’

schema is what the server creates when you run data definition language (DDL) commands Designing

an elegant, high-performance physical schema typically involves a team effort and requires numerous

design iterations and reviews

Well-designed physical schemas avoid overcomplexity by generalizing similar types of objects, thereby

creating a schema with fewer entities While designing the physical schema, make the data obvious to

the developer and easy to query The prime consideration when converting the logical database design

into a physical schema is how much work is required in order for a query to navigate the data

struc-tures while maintaining a correctly normalized design Not only is the schema then a joy to use, but it

also makes it easier to code correct queries, reducing the chance of data integrity errors caused by faulty

queries

Other hallmarks of a well-designed schema include the following:

■ The primary and foreign keys are designed for raw physical performance

■ Optional data (e.g., second address lines, name suffixes) is designed using patterns (nullable

columns, surrogate nulls, or missing rows) that protect the integrity of the data both within the database and through the query

Conversely, a poorly designed (either non-normalized or overly complex) physical schema encourages

developers to write iterative code, code that uses temporary buckets to manipulate data, or code that

will be difficult to debug or maintain

Agile Modeling

Agile development is popular for good reasons It gets the job done more quickly and often produces

a better result than traditional methods Agile development also fits well with database design and

development

The traditional waterfall process steps through four project phases: requirements gathering, design,

develop-ment, and implementation While this method may work well for some endeavors, when creating software,

the users often don’t know what they want until they see it, which pushes discovery beyond the requirements

gathering phase and into the development phase

Agile development addresses this problem by replacing the single long waterfall with numerous short cycles

or iterations Each iteration builds out a working model that can be tested, and enables users to play with the

continued

Trang 6

software and further discover their needs When users see rapid progress and trust that new features can be

added, they become more willing to allow features to be planned into the life cycle of the software, instead

of insisting that every feature be implemented in the next version

When I’m developing a database, each iteration is usually 2–5 days long and is a mini cycle of discovery,

coding, unit testing, and more discoveries with the client A project might consist of a dozen of these tight

iterations; and with each iteration, more features are fleshed out in the database and code

Set-based queries

SQL Server is designed to handle data in sets SQL is a declarative language, meaning that the SQL

query describes the problem, and the Query Optimizer generates an execution plan to resolve the

problem as a set

Application programmers typically develop while-loops that handle data one row at a time Iterative

code is fine for application tasks such as populating a grid or combo box, but it is inappropriate for

server-side code Iterative T-SQL code, typically implemented via cursors, forces the database engine

to perform thousands of wasteful single-row operations, instead of handling the problem in one larger,

more efficient set The performance cost of these single-row operations is huge Depending on the task,

SQL cursors perform about half as well as set-based code, and the performance differential grows with

the size of the data This is why set-based queries, based on an obvious physical schema, are so critical

to database performance

A good physical schema and set-based queries set up the database for excellent indexing, further

improving the performance of the query (see Figure 2-2)

However, queries cannot overcome the errors of a poor physical schema and won’t solve the

perfor-mance issues of poorly written code It’s simply impossible to fix a clumsy database design by throwing

code at it Poor database designs tend to require extra code, which performs poorly and is difficult to

maintain Unfortunately, poorly designed databases also tend to have code that is tightly coupled (refers

directly to tables), instead of code that accesses the database’s abstraction layer (stored procedures and

views) This makes it all that much harder to refactor the database

Indexing

An index is an organized pointer used to locate information in a larger collection An index is only

useful when it matches the needs of a question In this case, it becomes the shortcut between a

ques-tion and the right answer The key is to design the fewest number of shortcuts between the right

questions and the right answers

A sound indexing strategy identifies a handful of queries that represent 90% of the workload and, with

judicious use of clustered indexes and covering indexes, solves the queries without expensive bookmark

lookup operations

An elegant physical schema, well-written set-based queries, and excellent indexing reduce transaction

duration, which implicitly improves concurrency and sets up the database for scalability

Trang 7

Nevertheless, indexes cannot overcome the performance difficulties of iterative code Poorly written SQL

code that returns unnecessary columns is much more difficult to index and will likely not take

advan-tage of covering indexes Moreover, it’s extremely difficult to properly index an overly complex or

non-normalized physical schema

Concurrency

SQL Server, as an ACID-compliant database engine, supports transactions that are atomic, consistent,

isolated, and durable Whether the transaction is a single statement or an explicit transaction within

BEGIN TRAN .COMMIT TRANstatements, locks are typically used to prevent one transaction from

seeing another transaction’s uncommitted data Transaction isolation is great for data integrity, but

locking and blocking hurt performance

Multi-user concurrency can be tuned by limiting the extraneous code within logical transactions, setting

the transaction isolation level no higher than required, keeping trigger code to a minimum, and perhaps

using snapshot isolation

A database with an excellent physical schema, well-written set-based queries, and the right set of indexes

will have tight transactions and perform well with multiple users

When a poorly designed database displays symptoms of locking and blocking issues, no amount of

transaction isolation level tuning will solve the problem The sources of the concurrency issue are

the long transactions and additional workload caused by the poor database schema, lack of set-based

queries, or missing indexes Concurrency tuning cannot overcome the deficiencies of a poor database

design

Advanced scalability

With each release, Microsoft has consistently enhanced SQL Server for the enterprise These technologies

can enhance the scalability of heavy transaction databases

The Resource Governor, new in SQL Server 2008, can restrict the resources available for different sets of

queries, enabling the server to maintain the SLA agreement for some queries at the expense of other less

critical queries

Indexed views were introduced in SQL Server 2000 They actually materialize the view as a clustered

index and can enable queries to select from joined data without hitting the joined tables, or to

pre-aggregate data In effect, an indexed view is a custom covering index that can cover across multiple

tables

Partitioned tables can automatically segment data across multiple filegroups, which can serve as an

auto-archive device By reducing the size of the active data partition, the requirements for maintaining the

data, such as defragging the indexes, are also reduced

Service Broker can collect transactional data and process it after the fact, thereby providing an

‘‘over-time’’ load leveling as it spreads a five-second peak load over a one-minute execution without delaying

the calling transaction

While these high-scalability features can extend the scalability of a well-designed database, they are

lim-ited in their ability to add performance to a poorly designed database, and they cannot overcome long

Trang 8

transactions caused by a lack of indexes, iterative code, or all the multiple other problems caused by an

overly complex database design

The database component is the principle factor determining the overall monetary cost of the database A

well-designed database minimizes hardware costs, simplifies data access code and maintenance jobs, and

significantly lowers both the initial and the total cost of the database system

A performance framework

By describing the dependencies between the schema, queries, indexing, transactions, and scalability,

Smart Database Design is a framework for performance

The key to mastering Smart Database Design is understanding the interaction, or cause-and-effect

relationship, between these hierarchical layers (schema, queries, indexing, concurrency) Each layer

enables the next layer; conversely, no layer can overcome deficiencies in lower layers The practical

application of Smart Database Design takes advantage of these dependencies when developing or

optimizing a database by employing the right best practices within each layer to support the next layer

Reducing the aggregate workload of the database component has a positive effect on the rest of the

database system An efficient database component reduces the performance requirements of the server

platform, increasing capacity Maintenance jobs are easier to plan and also execute faster when the

database component is designed well There is less client access code to write and the code that needs

to be written is easier to write and maintain The result is an overall database system that’s simpler to

maintain, cheaper to run, easier to connect to from the data access layer, and that scales beautifully

Although it’s not a perfect analogy, picturing a water fountain on a hot summer day can help

demon-strate how shorter transactions improve overall database performance If everyone takes a small, quick

sip from the fountain, then no queue forms; but as soon as someone fills up a liter-sized Big Gulp cup,

others begin to wait Regardless of the amount of hardware resources available to a database, time is

finite, and the greatest performance gain is obtained by eliminating the excess work of wastefully long

transactions, or throwing away the Big Gulp cup

The quick sips of a well-designed query hitting an elegant, properly indexed database will outperform

and be significantly easier on the budget than the Bug Gulp cup, with its poorly written query or cursor,

on a poorly designed database missing an index

Striving for database design excellence is a smart business move with an excellent estimated return on

investment From my experience, every day spent on database design saves two to three months of

development and maintenance time In the long term, it’s far cheaper to design the database correctly

than to throw money or labor at project overruns or hardware upgrades

The cause-and-effect relationship between the layers helps diagnose performance problems as well

When a system is experiencing locking and blocking problems, the cause is likely found in the indexing

or query layers I’ve seen databases that were drowning under the weight of poorly written code

However, the root cause wasn’t the code; it was the overly complex, anti-normalized database design

that was driving the developers to write horrid code

The bottom line? Designing an elegant database schema is the first step in maximizing the performance

of the overall database system, while reducing costs

Trang 9

Issues and objections

I’ve heard objections to the Smart Database Design framework and I like to address them here Some

say that buying more hardware is the best way to improve performance I disagree More hardware

only masks the problem until it explodes later Performance problems tend to grow exponentially as

DB size grows, whereas hardware performance grows more or less linearly over time One can almost

predict when even the ‘‘best’’ hardware available no longer suffices to get acceptable performance In

several cases, I’ve seen companies spend incredible amounts to upgrade their hardware and they saw

little or no improvement because the bottleneck was the transaction locking and blocking and poor

code Sometimes, a faster CPU only waits faster Strategically, reducing the workload is cheaper than

increasing the capacity of the hardware

Some claim that fixing one layer can overcome deficiencies in lower layers It’s true that a poor schema

will perform better when properly indexed than without indexes However, adding the indexes doesn’t

really solve the deficiencies, it only masks the deficiencies The code is still doing extra work to

compen-sate for the poor schema The cost of developing code and designing correct indexes is still higher for

the poor schema Any data integrity or extensibility risks are still there

Some argue that they would like to apply Smart Database Design but they can’t because the database is

a third-party database and they can’t modify the schema or the code True, for most third-party

prod-ucts, the database schema and queries are not open for optimization, and this can be very frustrating if

the database needs optimization However, most vendors are interested in improving their product and

keeping their clients happy Both clients and vendors have contracted with me to help identify areas of

opportunity and suggest solutions for the next revision

Some say they’d like to apply Smart Database Design but they can’t because any change to the schema

would break hundreds of other objects It’s true — databases without abstraction layers are expensive

to alter An abstraction layer decouples the database from the client applications, making it possible

to change the database component without affecting the client applications In the absence of a

well-designed abstraction layer, the first step toward gaining system performance is to create one As

expensive as it may seem to refactor the database and every application so that all communications

go through an abstraction layer, the cost of not doing so could very well be that IT can’t respond to

the organization’s needs, forcing the company to outsource or develop wasteful extra databases At the

worst, the failure of the database to be extensible could force the end of the organization

In both the case of the third-party database and the lack of abstraction, it’s still a good idea to optimize

at the lowest level possible, and then move up the layers; but the best performance gains are made when

you can start optimizing at the lowest level of the database component, the physical schema

Some say that a poorly designed database can be solved by adding more layers of code and converting the

database to an SOA-style application I disagree The database should be refactored with a clean normalized

design and a proper abstraction layer This will reduce the overall workload and solve a host of usability

and performance issues much better than simply wrapping a poorly designed database with more code

Summary

When introducing the optimization chapter in her book Inside SQL Server 2000, Kalen Delaney correctly

writes that optimization can’t be added to a database after it has been developed; it has to be designed

into the database from the beginning

Trang 10

This chapter presented the concept of the Information Architecture Principle, unpacked the six database

objectives, and then discussed the Smart Database Design, showing the dependencies between the layers

and how each layer enables the next layer

In a chapter packed with ideas, I’d like to highlight the following:

■ The database architect position should be equally involved in the enterprise-level design and

the project-level designs

■ Any database design or implementation can be measured by six database objectives: usability,

extensibility, data integrity, performance, availability, and security These objectives don’t have

to compete — it’s possible to design an elegant database that meets all six objectives

■ Each day spent on the database design will save three months later

■ Extensibility is the most expensive database objective to correct after the fact A brittle

database — one that has ad hoc SQL directly accessing the table from the client — is the

worst design possible It’s simply impossible to fix a clumsy database design by throwing code

at it

■ Smart Database Design is the premise that an elegant physical schema makes the data intuitively

obvious and enables writing great set-based queries that respond well to indexing This in turn

creates short, tight transactions, which improves concurrency and scalability while reducing the

aggregate workload of the database This flow from layer to layer becomes a methodology for

designing and optimizing databases

■ Reducing the aggregate workload of the database has a greater positive effect than buying more

hardware

From this overview of data architecture, the next chapter digs deeper into the concepts and patterns of

relational database design, which are critical for usability, extensibility, data integrity, and performance

Định dạng
Số trang	10
Dung lượng	577,39 KB