■ Data warehouses collect large amounts of data from multiple data stores across the entire enterprise using anextract-transform-load ETL process to convert the data from the various for
Trang 1Planning Data Stores
The enterprise data architect helps an organization plan the most effective use of information throughout
the organization An organization’s data store configuration includes multiple types of data stores, as
illustrated in the following figure, each with a specific purpose:
■ Operational databases, or OLTP (online transaction processing) databases collect
first-generation transactional data that is essential to the day-to-day operation of
the organization and unique to the organization An organization might have an
operational data store to serve each unit or function within it Regardless of the
organi-zation’s size, an organization with a singly focused purpose may very well have only
one operational database
■ For performance, operational stores are tuned for a balance of data retrieval and
updates, so indexes and locking are key concerns Because these databases receive
first-generation data, they are subject to data update anomalies, and benefit from
normalization A typical organizational data store configuration includes several
operational data stores feeding multiple data marts and a single master data store (see
graphic)
ReferenceDB
Data Warehouse
Sales Data Mart Manufacturing
Data Mart
Manufacturing OLTP Sales OLTP
Alternate Location
Mobile Sales OLTP
■ Caching data stores, sometime called reporting databases, are optional read-only
copies of all or part of an operational database An organization might have multiple
caching data stores to deliver data throughout the organization Caching data stores
continued
Trang 2might use SQL Server replication or log shipping to populate the database and are
tuned for high-performance data retrieval
■ Reference data stores are primarily read-only, and store generic data required by the
organization but which seldom changes — similar to the reference section of the
library Examples of reference data might be unit of measure conversion factors or ISO
country codes A reference data store is tuned for high-performance data retrieval
■ Data warehouses collect large amounts of data from multiple data stores across the
entire enterprise using anextract-transform-load (ETL) process to convert the data
from the various formats and schema into a common format, designed for ease of
data retrieval Data warehouses also serve as the archival location, storing historical
data and releasing some of the data load from the operational data stores The data
is also pre-aggregated, making research and reporting easier, thereby improving the
accessibility of information and reducing errors
■ Because the primary task of a data warehouse is data retrieval and analysis, the
data-integrity concerns present with an operational data store don’t apply Data warehouses
are designed for fast retrieval and are not normalized like master data stores They
are generally designed using a basic star schema or snowflake design Locks generally
aren’t an issue, and the indexing is applied without adversely affecting inserts or
updates
Chapter 70, ‘‘BI Design,’’ discusses star schemas and snowflake designs used in data ware-housing.
■ The analysis process usually involves more than just SQL queries, and uses data cubes that
consoli-date gigabytes of data into dynamic pivot tables Business intelligence (BI) is the combination of the
ETL process, the data warehouse data store, and the acts of creating and browsing cubes.
■ A common data warehouse is essential for ensuring that the entire organization researches the same
data set and achieves the same result for the same query — a critical aspect of the Sarbanes-Oxley
Act and other regulatory requirements.
■ Data marts are subsets of the data warehouse with pre-aggregated data organized specifically to
serve the needs of one organizational group or one data domain.
■ Master data store, or master data management (MDM), refers to the data warehouse that combines
the data from throughout the organization The primary purpose of the master data store is to
provide a single version of the truth for organizations with a complex set of data stores and multiple
data warehouses.
Smart Database Design
My career has focused on turning around database projects that were previously considered failures and
recommending solutions for ISV databases that are performing poorly In nearly every case, the root
cause of the failure was the database design It was too complex, too clumsy, or just plain inadequate
Trang 3Without exception, where I found poor database performance, I also found data modelers who insisted
on modeling alone or who couldn’t write SQL queries to save their lives
Throughout my career, what began as an observation was reinforced into a firm conviction The
database schema is the foundation of the database project; and an elegant, simple database design
outperforms a complex database both in terms of the development process and the final performance of
the database application This is the basic idea behind the Smart Database Design
While I believe in a balanced set of goals for any database, including performance, usability, data
integrity, availability, extensibility, and security, all things being equal, the crown goes to the database
that always provides the right answer with lightning speed
Database system
A database system is a complex system By complex, I mean that the system consists of multiple
compo-nents that interact with one another, as shown in Figure 2-1 The performance of one component affects
the performance of other components and thus the entire system Stated another way, the design of one
component will set up other components, and the whole system, to either work well together or to
frus-trate those trying to make the system work
FIGURE 2-1
The database system is the collective effort of the server environment, maintenance jobs, the client
application, and the database
• Four Components
AD-306 • Performance Decisions
The Database System
2006 PASS Community Summit
DB
Instead of randomly trying performance tips (and the Internet has an overwhelming number of SQL
Server performance and optimization tips), it makes more sense to think about the database as a system
and then figure out how the components of the database system affect one another You can then use
this knowledge to apply the performance techniques in a way that provides the most benefit
Trang 4Every database system contains four broad technologies or components: the database itself, the server
platform, the maintenance jobs, and the client’s data access code, as illustrated in Figure 2-2 Each
com-ponent affects the overall performance of the database system:
■ The server environment is the physical hardware configuration (CPUs, memory, disk spindles,
I/O bus), the operating system, and the SQL Server instance configuration, which together
pro-vide the working environment for the database The server environment is typically optimized
by balancing the CPUs, memory and I/O, and identifying and eliminating bottlenecks
■ The database maintenance jobs are the steps that keep the database running optimally (index
defragmentation, DBCC integrity checks, and maintaining index statistics)
■ The client application is the collection of data access layers, middle tiers, front-end applications,
ETL (extract, transform, and load) scripts, report queries, or SSIS (SQL Server Integration
Services) packages that access the database These can not only affect the user’s perception of
database performance, but can also reduce the overall performance of the database system
■ Finally, the database component includes everything within the data file: the physical schema,
T-SQL code (queries, stored procedures, user-defined functions (UDFs), and views), indexes,
and data
FIGURE 2-2
Smart Database Design is the premise that an elegantphysical schema makes the data intuitively
obvious and enables writing great set-basedqueries that respond well to indexing This in turn creates
short, tight transactions, which improvesconcurrency and scalability, while reducing the aggregate
workload of the database This flow from layer to layer becomes a methodology for designing and
optimizing databases
Adv Scalability
Concurrency
Indexing
Set-Based
Schema
Enables
Enables
Enables
Enables
All four database components must function well together to produce a high-performance database
sys-tem; if one of the components is weak, then the database system will fail or perform poorly
However, of these four components, the database itself is the most difficult component to design and
the one that drives the design of the other three components For example, the database workload
Trang 5determines the hardware requirements Maintenance jobs and data access code are both designed around
the database; and an overly complex database will complicate both the maintenance jobs and the data
access code
Physical schema
The base layer of Smart Database Design is the database’s physical schema The physical schema includes
the database’s tables, columns, primary and foreign keys, and constraints Basically, the ‘‘physical’’
schema is what the server creates when you run data definition language (DDL) commands Designing
an elegant, high-performance physical schema typically involves a team effort and requires numerous
design iterations and reviews
Well-designed physical schemas avoid overcomplexity by generalizing similar types of objects, thereby
creating a schema with fewer entities While designing the physical schema, make the data obvious to
the developer and easy to query The prime consideration when converting the logical database design
into a physical schema is how much work is required in order for a query to navigate the data
struc-tures while maintaining a correctly normalized design Not only is the schema then a joy to use, but it
also makes it easier to code correct queries, reducing the chance of data integrity errors caused by faulty
queries
Other hallmarks of a well-designed schema include the following:
■ The primary and foreign keys are designed for raw physical performance
■ Optional data (e.g., second address lines, name suffixes) is designed using patterns (nullable
columns, surrogate nulls, or missing rows) that protect the integrity of the data both within the database and through the query
Conversely, a poorly designed (either non-normalized or overly complex) physical schema encourages
developers to write iterative code, code that uses temporary buckets to manipulate data, or code that
will be difficult to debug or maintain
Agile Modeling
Agile development is popular for good reasons It gets the job done more quickly and often produces
a better result than traditional methods Agile development also fits well with database design and
development
The traditional waterfall process steps through four project phases: requirements gathering, design,
develop-ment, and implementation While this method may work well for some endeavors, when creating software,
the users often don’t know what they want until they see it, which pushes discovery beyond the requirements
gathering phase and into the development phase
Agile development addresses this problem by replacing the single long waterfall with numerous short cycles
or iterations Each iteration builds out a working model that can be tested, and enables users to play with the
continued
Trang 6software and further discover their needs When users see rapid progress and trust that new features can be
added, they become more willing to allow features to be planned into the life cycle of the software, instead
of insisting that every feature be implemented in the next version
When I’m developing a database, each iteration is usually 2–5 days long and is a mini cycle of discovery,
coding, unit testing, and more discoveries with the client A project might consist of a dozen of these tight
iterations; and with each iteration, more features are fleshed out in the database and code
Set-based queries
SQL Server is designed to handle data in sets SQL is a declarative language, meaning that the SQL
query describes the problem, and the Query Optimizer generates an execution plan to resolve the
problem as a set
Application programmers typically develop while-loops that handle data one row at a time Iterative
code is fine for application tasks such as populating a grid or combo box, but it is inappropriate for
server-side code Iterative T-SQL code, typically implemented via cursors, forces the database engine
to perform thousands of wasteful single-row operations, instead of handling the problem in one larger,
more efficient set The performance cost of these single-row operations is huge Depending on the task,
SQL cursors perform about half as well as set-based code, and the performance differential grows with
the size of the data This is why set-based queries, based on an obvious physical schema, are so critical
to database performance
A good physical schema and set-based queries set up the database for excellent indexing, further
improving the performance of the query (see Figure 2-2)
However, queries cannot overcome the errors of a poor physical schema and won’t solve the
perfor-mance issues of poorly written code It’s simply impossible to fix a clumsy database design by throwing
code at it Poor database designs tend to require extra code, which performs poorly and is difficult to
maintain Unfortunately, poorly designed databases also tend to have code that is tightly coupled (refers
directly to tables), instead of code that accesses the database’s abstraction layer (stored procedures and
views) This makes it all that much harder to refactor the database
Indexing
An index is an organized pointer used to locate information in a larger collection An index is only
useful when it matches the needs of a question In this case, it becomes the shortcut between a
ques-tion and the right answer The key is to design the fewest number of shortcuts between the right
questions and the right answers
A sound indexing strategy identifies a handful of queries that represent 90% of the workload and, with
judicious use of clustered indexes and covering indexes, solves the queries without expensive bookmark
lookup operations
An elegant physical schema, well-written set-based queries, and excellent indexing reduce transaction
duration, which implicitly improves concurrency and sets up the database for scalability
Trang 7Nevertheless, indexes cannot overcome the performance difficulties of iterative code Poorly written SQL
code that returns unnecessary columns is much more difficult to index and will likely not take
advan-tage of covering indexes Moreover, it’s extremely difficult to properly index an overly complex or
non-normalized physical schema
Concurrency
SQL Server, as an ACID-compliant database engine, supports transactions that are atomic, consistent,
isolated, and durable Whether the transaction is a single statement or an explicit transaction within
BEGIN TRAN .COMMIT TRANstatements, locks are typically used to prevent one transaction from
seeing another transaction’s uncommitted data Transaction isolation is great for data integrity, but
locking and blocking hurt performance
Multi-user concurrency can be tuned by limiting the extraneous code within logical transactions, setting
the transaction isolation level no higher than required, keeping trigger code to a minimum, and perhaps
using snapshot isolation
A database with an excellent physical schema, well-written set-based queries, and the right set of indexes
will have tight transactions and perform well with multiple users
When a poorly designed database displays symptoms of locking and blocking issues, no amount of
transaction isolation level tuning will solve the problem The sources of the concurrency issue are
the long transactions and additional workload caused by the poor database schema, lack of set-based
queries, or missing indexes Concurrency tuning cannot overcome the deficiencies of a poor database
design
Advanced scalability
With each release, Microsoft has consistently enhanced SQL Server for the enterprise These technologies
can enhance the scalability of heavy transaction databases
The Resource Governor, new in SQL Server 2008, can restrict the resources available for different sets of
queries, enabling the server to maintain the SLA agreement for some queries at the expense of other less
critical queries
Indexed views were introduced in SQL Server 2000 They actually materialize the view as a clustered
index and can enable queries to select from joined data without hitting the joined tables, or to
pre-aggregate data In effect, an indexed view is a custom covering index that can cover across multiple
tables
Partitioned tables can automatically segment data across multiple filegroups, which can serve as an
auto-archive device By reducing the size of the active data partition, the requirements for maintaining the
data, such as defragging the indexes, are also reduced
Service Broker can collect transactional data and process it after the fact, thereby providing an
‘‘over-time’’ load leveling as it spreads a five-second peak load over a one-minute execution without delaying
the calling transaction
While these high-scalability features can extend the scalability of a well-designed database, they are
lim-ited in their ability to add performance to a poorly designed database, and they cannot overcome long
Trang 8transactions caused by a lack of indexes, iterative code, or all the multiple other problems caused by an
overly complex database design
The database component is the principle factor determining the overall monetary cost of the database A
well-designed database minimizes hardware costs, simplifies data access code and maintenance jobs, and
significantly lowers both the initial and the total cost of the database system
A performance framework
By describing the dependencies between the schema, queries, indexing, transactions, and scalability,
Smart Database Design is a framework for performance
The key to mastering Smart Database Design is understanding the interaction, or cause-and-effect
relationship, between these hierarchical layers (schema, queries, indexing, concurrency) Each layer
enables the next layer; conversely, no layer can overcome deficiencies in lower layers The practical
application of Smart Database Design takes advantage of these dependencies when developing or
optimizing a database by employing the right best practices within each layer to support the next layer
Reducing the aggregate workload of the database component has a positive effect on the rest of the
database system An efficient database component reduces the performance requirements of the server
platform, increasing capacity Maintenance jobs are easier to plan and also execute faster when the
database component is designed well There is less client access code to write and the code that needs
to be written is easier to write and maintain The result is an overall database system that’s simpler to
maintain, cheaper to run, easier to connect to from the data access layer, and that scales beautifully
Although it’s not a perfect analogy, picturing a water fountain on a hot summer day can help
demon-strate how shorter transactions improve overall database performance If everyone takes a small, quick
sip from the fountain, then no queue forms; but as soon as someone fills up a liter-sized Big Gulp cup,
others begin to wait Regardless of the amount of hardware resources available to a database, time is
finite, and the greatest performance gain is obtained by eliminating the excess work of wastefully long
transactions, or throwing away the Big Gulp cup
The quick sips of a well-designed query hitting an elegant, properly indexed database will outperform
and be significantly easier on the budget than the Bug Gulp cup, with its poorly written query or cursor,
on a poorly designed database missing an index
Striving for database design excellence is a smart business move with an excellent estimated return on
investment From my experience, every day spent on database design saves two to three months of
development and maintenance time In the long term, it’s far cheaper to design the database correctly
than to throw money or labor at project overruns or hardware upgrades
The cause-and-effect relationship between the layers helps diagnose performance problems as well
When a system is experiencing locking and blocking problems, the cause is likely found in the indexing
or query layers I’ve seen databases that were drowning under the weight of poorly written code
However, the root cause wasn’t the code; it was the overly complex, anti-normalized database design
that was driving the developers to write horrid code
The bottom line? Designing an elegant database schema is the first step in maximizing the performance
of the overall database system, while reducing costs
Trang 9Issues and objections
I’ve heard objections to the Smart Database Design framework and I like to address them here Some
say that buying more hardware is the best way to improve performance I disagree More hardware
only masks the problem until it explodes later Performance problems tend to grow exponentially as
DB size grows, whereas hardware performance grows more or less linearly over time One can almost
predict when even the ‘‘best’’ hardware available no longer suffices to get acceptable performance In
several cases, I’ve seen companies spend incredible amounts to upgrade their hardware and they saw
little or no improvement because the bottleneck was the transaction locking and blocking and poor
code Sometimes, a faster CPU only waits faster Strategically, reducing the workload is cheaper than
increasing the capacity of the hardware
Some claim that fixing one layer can overcome deficiencies in lower layers It’s true that a poor schema
will perform better when properly indexed than without indexes However, adding the indexes doesn’t
really solve the deficiencies, it only masks the deficiencies The code is still doing extra work to
compen-sate for the poor schema The cost of developing code and designing correct indexes is still higher for
the poor schema Any data integrity or extensibility risks are still there
Some argue that they would like to apply Smart Database Design but they can’t because the database is
a third-party database and they can’t modify the schema or the code True, for most third-party
prod-ucts, the database schema and queries are not open for optimization, and this can be very frustrating if
the database needs optimization However, most vendors are interested in improving their product and
keeping their clients happy Both clients and vendors have contracted with me to help identify areas of
opportunity and suggest solutions for the next revision
Some say they’d like to apply Smart Database Design but they can’t because any change to the schema
would break hundreds of other objects It’s true — databases without abstraction layers are expensive
to alter An abstraction layer decouples the database from the client applications, making it possible
to change the database component without affecting the client applications In the absence of a
well-designed abstraction layer, the first step toward gaining system performance is to create one As
expensive as it may seem to refactor the database and every application so that all communications
go through an abstraction layer, the cost of not doing so could very well be that IT can’t respond to
the organization’s needs, forcing the company to outsource or develop wasteful extra databases At the
worst, the failure of the database to be extensible could force the end of the organization
In both the case of the third-party database and the lack of abstraction, it’s still a good idea to optimize
at the lowest level possible, and then move up the layers; but the best performance gains are made when
you can start optimizing at the lowest level of the database component, the physical schema
Some say that a poorly designed database can be solved by adding more layers of code and converting the
database to an SOA-style application I disagree The database should be refactored with a clean normalized
design and a proper abstraction layer This will reduce the overall workload and solve a host of usability
and performance issues much better than simply wrapping a poorly designed database with more code
Summary
When introducing the optimization chapter in her book Inside SQL Server 2000, Kalen Delaney correctly
writes that optimization can’t be added to a database after it has been developed; it has to be designed
into the database from the beginning
Trang 10This chapter presented the concept of the Information Architecture Principle, unpacked the six database
objectives, and then discussed the Smart Database Design, showing the dependencies between the layers
and how each layer enables the next layer
In a chapter packed with ideas, I’d like to highlight the following:
■ The database architect position should be equally involved in the enterprise-level design and
the project-level designs
■ Any database design or implementation can be measured by six database objectives: usability,
extensibility, data integrity, performance, availability, and security These objectives don’t have
to compete — it’s possible to design an elegant database that meets all six objectives
■ Each day spent on the database design will save three months later
■ Extensibility is the most expensive database objective to correct after the fact A brittle
database — one that has ad hoc SQL directly accessing the table from the client — is the
worst design possible It’s simply impossible to fix a clumsy database design by throwing code
at it
■ Smart Database Design is the premise that an elegant physical schema makes the data intuitively
obvious and enables writing great set-based queries that respond well to indexing This in turn
creates short, tight transactions, which improves concurrency and scalability while reducing the
aggregate workload of the database This flow from layer to layer becomes a methodology for
designing and optimizing databases
■ Reducing the aggregate workload of the database has a greater positive effect than buying more
hardware
From this overview of data architecture, the next chapter digs deeper into the concepts and patterns of
relational database design, which are critical for usability, extensibility, data integrity, and performance