The 5% is not a random sample but every twentieth page, so it should give consistent results: EXEC sp_estimate_data_compression_savings @schema_name = ‘Production’, @object_name = ‘BillO
Trang 1Abbreviated result when executed in theAdventureWorksdatabase:
- - -
1509580416 Person Person 2 IX_Person_LastName NONCLUSTERED 1 NONE
.
Estimating data compression
Because every object can yield a different compression ratio, it’s useful to have some idea of how much
compression is possible before actually performing the compression Toward this end, SQL Server
2008 includes the ability to pre-estimate the potential data reduction of data compression using the
sp_estimate_data_compression_savingssystem stored procedure
Specifically, this system stored procedure will copy 5% of the data to be compressed intotempdband
compress it The 5% is not a random sample but every twentieth page, so it should give consistent
results:
EXEC sp_estimate_data_compression_savings
@schema_name = ‘Production’,
@object_name = ‘BillOfMaterials’,
@index_id = NULL,
@partition_number = NULL,
@data_compression = ‘page’;
The result displays the following columns for each object (base table and index):
object_name schema_name index_id partition_number size_with_current_compression_setting(KB) size_with_requested_compression_setting(KB) sample_size_with_current_compression_setting(KB) sample_size_with_requested_compression_setting(KB)
The Data Compression Wizard, shown in Figure 67-4, uses this same system stored procedure to
esti-mate the compression Select the type of compression to estiesti-mate and press the Calculate button
Enabling data compression
Data compression alters the structure of the data on the disk, so it makes sense that data compression is
enabled using aCREATEorALTERstatement
Using the UI, the only way to adjust an object’s data compression is by using the same Data
Compres-sion Wizard used previously to estimate the compresCompres-sion gain
Trang 2Data Compression 67
FIGURE 67-4
The Data Compression Wizard will estimate the compression ratio and apply the selected type of data
compression
With T-SQL, compression may be initially set when the object is created by adding the data
compres-sion setting to theCREATEstatement with the following option:
WITH (DATA_COMPRESSION = [none, row, or page])
Use the following to create a new table with row compression:
CREATE TABLE CTest (col1 INT, Col2 CHAR(100))
WITH (Data_Compression = Row);
To change the compression setting for an existing object, use theALTERstatement:
ALTER object REBUILD
WITH (DATA_COMPRESSION = [none, row, or page])
For instance, the following code changes theBillOfMaterialstable to page compression:
ALTER TABLE ‘Production’ ‘BillOfMaterials’
Rebuild with (data_compression = Page);
1423
www.getcoolebook.com
Trang 3Whole Database Compression
I’m a big fan of data compression, so I’ve expended some effort in trying to make compression more
accessible to the busy DBA by creating two stored procedures that automate estimating and applying data
compression for the whole database
The first stored procedure, db_compression_estimate, estimates the row and page compression gain
for every object and index in the database For AdventureWorks2008 on my VPC it runs in about 2:35,
producing the following results:
The db_compression (@minCompression) stored procedure automatically compresses using a few
intelligent choices: It checks the size of the object and the current compression setting, and compares
it to potential row and page compression gains If the object is eight pages or less, no compression is
applied For larger objects, the stored procedure calls sp_estimate_data_compression_savings to
estimate the savings with row and page compression If the estimated gain is equal to or greater than the
@minCompressionparameter (default 25%), it enables row or page compression, whichever offers greater
gain If row and page have the same gain, then it enables row compression
If the estimated gain is less than the @mincompression parameter, then its alters the object to set
compression to none
If the stored procedure is rerun and the gains have changed, it will change the object to the compression
method (or no compression) that is now the recommended option
The db_compression_estimate and db_compression stored procedures may be downloaded from
www.sqlserverbible.comor codeplex.com This is the first version of these stored procedures; check
back or watch my blog for any updates
Data compression strategies
Data compression is new to SQL Server and at this early stage, applying compression is more an art
than science With this in mind, here are my recommendations on how to best use data compression:
1 Establish a performance baseline.
2 Run thedb_compressstored procedure
Trang 4Data Compression 67
3 If specific procedures or queries run noticeably slower, decide, on a case-by-case basis, if the
space savings and I/O reduction is worth the performance hit, and adjust compression as
needed
4 Carefully monitor the use of data compression on high-transaction tables, in case the CPU
overhead exceeds the I/O performance gains
In practice I’ve seen row compression alone offer disk space gains up to 50%, but sometimes it actually
increases the size of the data Seldom does row compression alone beat page compression, but they
often provide the same result When row compression and page compression offer the same
compres-sion ratio, it’s better to apply only row comprescompres-sion and save the CPU from having to perform the
additional page compression
For small lookup tables that are frequently accessed by queries, use row compression but avoid page
compression — the CPU overhead versus compression benefit isn’t worth it in this case
If the object is partitioned using partition tables (covered in the next chapter), carefully consider data
compression on a per-partition basis — especially for sliding window–style partitioning
Summary
Data compression is the sleeper feature of SQL Server 2008 With both row compression and page
compression, including both prefix and dictionary compression, SQL Server offers the granularity to
tune data compression Using data compression carefully, you’ll be able to push the envelope for an
I/O bound, high-transaction database
The next chapter continues the thread of technologies used for highly scalable database design with a
look at several types of partitioning
1425
www.getcoolebook.com
Trang 6IN THIS CHAPTER
Scaling out with multiple tables and multiple servers.
Distributed partition views Table partitioning
Custom partitioning design
Divide and conquer
Dividing a terabyte table can be as effective as dividing an enemy tank division or
dividing the opposing political party
Dividing data brings several benefits:
■ It’s significantly easier to maintain, back up, and defragment a divided
data set
■ The divided data sets mean smaller indexes, fewer intermediate pages,
and faster performance
■ The divided data sets can reside on separate physical servers, thus
scaling out and lowering costs and improving performance
However, dividing, or partitioning, data has its own set of problems to conquer
E F Codd recognized the potential issues with physical partitioning of data in
October 1985 in his famous ‘‘Is Your DBMS Really Relational?’’ article, which
out-lined 12 rules, or criteria, for a relational database Rule 11 specifically deals with
partitioned data:
Rule 11: Distribution independence
The distribution of portions of the database to various locations should be invisible to
users of the database Existing applications should continue to operate successfully:
1 when a distributed version of the DBMS is first introduced; and
2 when existing distributed data are redistributed around the system.
In layperson’s terms, rule 11 says that if the complete set of data is spread over
multiple tables or multiple servers, then the software must be able to search for
any piece of that data regardless of its physical location
1427
www.getcoolebook.com
Trang 7There are several ways to try to solve this problem SQL Server offers a couple of technologies that
han-dle partitioning: partitioned views and partitioned tables And later in this chapter, I offer a design pattern
that I’ve had some success with
Partitioning Strategies
The partitions are most effective when the partition key is a column often used to select a range of data,
so that a query has a good chance of addressing only one of the segments For example:
■ A company manages sales from five distinct sales offices; splitting the order table by sales region will likely enable each sales region’s queries to access only that region’s partition
■ A manufacturing company partitions a large activity-tracking table into several smaller tables, one for each department, knowing that each of the production applications tends to query a single department’s data
■ A financial company has several terabytes of historical data and must be able to easily query across current and old data However, the majority of current activity deals with only the current data Segmenting the data by era enables the current-activity queries to access a much smaller table
Best Practice
Very large, frequently accessed tables, with data that can logically be divided horizontally for the most
common queries, are the best candidates for partitioning If the table doesn’t meet this criteria, don’t partition
the table
In the access of data, the greatest bottleneck is reading the data from the drive The primary benefit of
partitioning tables is that a smaller partitioned table will have a greater percentage of the table cached in
memory
Partitioning can be considered from two perspectives:
■ Horizontal partitioning means splitting the table by rows For example, if you have a large
5,000-row spreadsheet and split it so that rows 1 through 2,500 remain in the original spreadsheet and move rows 2,501 to 5,000 to a new additional spreadsheet, that move would illustrate horizontal partitioning
■ Vertical partitioning splits the table by columns, segmenting some columns into a different
table Sometimes this makes sense from a logical modeling point of view, if the vertical parti-tioning segments columns that belong only to certain subtypes But strictly speaking, vertical partitioning is less common and not considered a best practice
All the partitioning methods discussed in this chapter involve horizontal partitioning
Trang 8Partitioning 68
A Brief History of SQL Server Partitioning
Microsoft introduced partitioned views and distributed partitioned views with SQL Server 2000 and improved
their performance with SQL Server 2005, but the big news regarding partitioning in SQL Server 2005 was
the new partitioned tables
SQL Server 2008 doesn’t change the feature set or syntax for partitioned views or partitioned tables, but the
new version significantly improves how the Query Processor uses parallelism with partitioned tables
Considerable research is still ongoing regarding SQL Server scale-out and partitioning Microsoft has already
publicly demonstrated Synchronicity — an incredible scale-out middle layer technology for SQL Server
Partitioned Views
Of the possible ways to partition data using SQL Server, the most straightforward solution is partitioned
views
To partition a view is to split the table into two or more smaller separate tables based on a partition key
and then make the data accessible, meeting Codd’s eleventh rule, using a view The individual tables can
all exist on the same server, making them local partitioned views
With the data split into several partition tables, of course, each individual table may be directly queried
A more sophisticated and flexible approach is to access the whole set of data by querying a view that
unites all the partition tables — this type of view is called a partitioned view.
The SQL Server query processor is designed specifically to handle such a partitioned view If a query
accesses the union of all the partition tables, the query processor will retrieve data only from the
required partition tables
A partitioned view not only handles selects; data can be inserted, updated, and deleted through the
partitioned view The query processor will engage only the individual table(s) necessary
SQL Server supports two types of partition views: local and distributed
■ A local-partition view unites data from multiple local partition tables on a single server.
■ A distributed-partition view, also known as a federated database, spreads the partition tables
across multiple servers and connects them using linked servers, and views that include
distributed queries
The individual tables underneath the partitioned view are called partition tables, not to be confused with partitioned tables, a completely different technology, covered in the next major section in this chapter.
1429
www.getcoolebook.com
Trang 9Local-partition views
Local-partition views access only local tables For a local-partition view to be configured, the following
elements must be in place:
■ The data must be segmented into multiple tables according to a single column, known as the
partition key.
■ Each partition table must have a check constraint restricting the partition-key data to a single value SQL Server uses the check constraint to determine which tables are required by a query
■ The partition key must be part of the primary key
■ The partition view must include a union statement that pulls together data from all the partition tables
Segmenting the data
To implement a partitioned-view design for a database and segment the data in a logical fashion, the
first step is to move the data into the partitioned tables
As an example, theOrderandOrderDetailtables in theOBXKitessample database can be
partitioned by sales location In the sample database, the data breaks down as follows:
SELECT LocationCode, Count(OrderNumber) AS Count FROM Location
JOIN [Order]
ON [Order].LocationID = Location.LocationID GROUP BY LocationCode
Result:
-
To partition the sales data, theOrderandOrderDetailtables will be split into a table for each
location The first portion of the script creates the partition tables They differ from the original tables
only in the primary-key definition, which becomes a composite primary key consisting of the original
primary key and theLocationCode In theOrderDetailtable theLocationCodecolumn is
added so it can serve as the partition key, and theOrderIDcolumn foreign-key constraint points to the
partition table
The script then progresses to populating the tables from the non-partitioned tables To select the correct
OrderDetailrows, the table needs to be joined with theOrderCHtable
For brevity’s sake, only the Cape Hatteras (CH) location is shown here The chapter’s sample code script
includes similar code for the Jockey Ridge and Kill Devil Hills locations The differences between the
partition table and the original tables, and the code that differs among the various partitions, are shown
in bold:
Order Table CREATE TABLE dbo.OrderCH (
Trang 10Partitioning 68
LocationCode CHAR(5) NOT NULL,
OrderID UNIQUEIDENTIFIER NOT NULL Not PK
ROWGUIDCOL DEFAULT (NEWID()),
OrderNumber INT NOT NULL,
ContactID UNIQUEIDENTIFIER NULL
FOREIGN KEY REFERENCES dbo.Contact,
OrderPriorityID UNIQUEIDENTIFIER NULL
FOREIGN KEY REFERENCES dbo.OrderPriority,
EmployeeID UNIQUEIDENTIFIER NULL
FOREIGN KEY REFERENCES dbo.Contact,
LocationID UNIQUEIDENTIFIER NOT NULL
FOREIGN KEY REFERENCES dbo.Location,
OrderDate DATETIME NOT NULL DEFAULT (GETDATE()),
Closed BIT NOT NULL DEFAULT (0) set to true when Closed
)
ON [Primary]
go
PK
ALTER TABLE dbo.OrderCH
ADD CONSTRAINT
PK_OrderCH PRIMARY KEY NONCLUSTERED
(LocationCode, OrderID)
Check Constraint
ALTER TABLE dbo.OrderCH
ADD CONSTRAINT
OrderCH_PartitionCheck CHECK (LocationCode = ‘CH’)
go
Order Detail Table
CREATE TABLE dbo.OrderDetailCH (
LocationCode CHAR(5) NOT NULL,
OrderDetailID UNIQUEIDENTIFIER NOT NULL Not PK
ROWGUIDCOL DEFAULT (NEWID()),
OrderID UNIQUEIDENTIFIER NOT NULL, Not FK
ProductID UNIQUEIDENTIFIER NULL
FOREIGN KEY REFERENCES dbo.Product,
NonStockProduct NVARCHAR(256),
Quantity NUMERIC(7,2) NOT NULL,
UnitPrice MONEY NOT NULL,
ExtendedPrice AS Quantity * UnitPrice,
ShipRequestDate DATETIME,
ShipDate DATETIME,
ShipComment NVARCHAR(256)
)
ON [Primary]
go
1431
www.getcoolebook.com