Internally, SQL Server non-clustered indexes are b-tree indexes and point to the base table, which is either a clustered index or a heap.. If the base table is a clustered index, then th
Trang 1Indexes only become useful as they serve the needs of a query, so designing indexes means thinking
about how the query will navigate the indexes to reach the data ‘‘Zen and the Art of Indexing’’
means that you see the query path in your mind’s eye and design the shortest path from the query to
the data
What’s New With Indexes?
Indexing is critical to SQL Server performance, and Microsoft has steadily invested in SQL Server’s indexing
capabilities Back in SQL Server 2005, my favorite new feature was included columns for non-clustered
indexes, which made non-clustered indexes more efficient as covering indexes
With SQL Server 2008, Microsoft has again added several significant new indexing features
Filtered indexes means that a non-clustered index can be created that indexes only a subset of the data This
is perfect for situations like a manufacturing orders table with 2% active orders
The new star-join optimization uses bitmap filters for up to seven times performance gains when joining a
single table (fact table) with several lookup (dimension) tables
The new Forceseek table hint, as the name implies, forces the Query Optimizer to choose a seek operation
instead of a scan
Indexing Basics
You can’t master indexing without a solid understanding of how indexes work Please don’t skip this
section To apply the strategies described later in this chapter, you must grok the b-tree
The b-tree index
Conventional wisdom says that SQL Server has two types of indexes: clustered and non-clustered; but a
closer look reveals that SQL Server has in fact only one type of index: the b-tree, or balanced tree, index,
because internally both clustered and non-clustered indexes are b-tree indexes
B-tree indexes exist on index pages and have a root level, one or more intermediate levels, and a leaf or
node level The columns actually sorted by the b-tree index are called the index’s key columns, as shown
in Figure 64-1 The difference between clustered and non-clustered indexes is the amount and type of
data stored at the leaf level
While this chapter discusses the strategies of designing and optimizing indexes and does include some code examples that demonstrate creating indexes, the sister Chapter 20,
‘‘Creating the Physical Database Schema,’’ details the actual syntax and Management Studio methods of
creating indexes.
Over time, indexes typically become fragmented, which significantly hurts performance For more
infor-mation on index maintenance, turn to Chapter 42, ‘‘Maintaining the Database.’’
Trang 2After you’ve read this chapter, I highly recommend digging deeper into the internals of SQL Server’s
indexes with my favorite SQL Server book, Kalen Delaney’s SQL Server 2008 Internals (Microsoft
Press, 2009).
FIGURE 64-1
The b-tree index is the most basic element of SQL Server This figure illustrates a simplified view of a
clustered index with an identity column as the clustered index key The first name is the data column
Data Columns
Key Columns
Balanced Tree Index
1-3 4-6
1 2 3
Matt Paul Beth
1-6
7-12
7-9 10-12
4 5 6
Nick Steve Zack 7
8 9
Tom Hank Greg 10
11 12
Susan Albert Ingrid
Clustered indexes
In SQL Server, when all the data columns are attached to the b-tree index’s leaf level, it’s called a
clustered index, and some might call it a table or base table (refer to Figure 64-1) A clustered index is
often called the physical sort order of the table, which is mostly, or at least logically, true
Logically, the clustered index pages will have the data in the clustered index sort order; but physically,
on the disk, those pages are a linked list — each page links to the next page and the previous page in
the list In a perfect world the pages would be in the same order as the list, but in reality they are often
moved around due to page splits and fragmentation (more on page splits later in this chapter) In this
case, the links probably jump around a bit
A table may only have one physical sort order, and therefore, only one clustered index The
quintessen-tial example of a clustered index is a telephone book (the old-fashioned printed kind, not the Internet
search type) The telephone book itself is a clustered index The last name and first name columns are
the index keys, and the rest of the data (address, phone number) is attached to the index
A telephone book even simulates a b-tree index Open a telephone book to the middle Choose the side
with the name you want to find, and then split that side in half In a few halves and splits, you’ll be at
the page with the name you’re looking for Your eye can now quickly scan that page and find the last
name and first name you want Because the address and phone number are printed right next to the
names, no more searching is needed
Trang 3Non-clustered indexes
SQL Server can also create non-clustered indexes, which are similar to the indexes in the back of a book.
This type of index is keyed, or sorted, by the keywords, and the page numbers are pointers to the
book’s content
Internally, SQL Server non-clustered indexes are b-tree indexes and point to the base table, which is
either a clustered index or a heap If the base table is a clustered index, then the clustered index keys
(every sort-by column) are included at every level of the non-clustered index b-tree and leaf level If the
base table is a heap, then the heap RID (row ID) is used
For example, the non-clustered index illustrated in Figure 64-2 uses the first name column as its key
column, so that’s the data sorted by the b-tree The non-clustered index points to the base table by
including the clustered index key column In Figure 64-2, the clustered index key column is the identity
column used in Figure 64-1
Since SQL Server 2005, additional unsorted columns can be included in the leaf level The employee’s
title and department columns could be added to the previous index, which is extremely useful in
designing covering indexes (described in the next section)
A SQL Server table may have up to 999 non-clustered indexes, but I’ve never seen a well-normalized
table that required more than a dozen well-designed indexes
FIGURE 64-2
This simplified illustration of a non-clustered index has a b-tree index with first name as the key
column The non-clustered index includes pointers to the clustered index key column
Clustered Keys or Heap RowID (2005) Included Columns
Key Columns Balanced Tree Index
(2008) Filtered
A-G H-M
11
A-M N-Z
N-St Su-Z
3 9 8 12 1 4 2 5 10 7 6
Albert Beth Greg Hank Ingrid Matt Nick Paul Steve Susan Tom Zack
Composite indexes
A composite index is a clustered or non-clustered index that is keyed, or sorted, on multiple columns.
Composite indexes are common in production
Trang 4The order of the columns in a composite index is important In order for a search to take advantage
of a composite index it must include the index columns from left to right If the composite index is
lastname,firstname, a search forfirstnamecan’t seek quickly through the b-tree index, but a
search forlastname, orlastnameandfirstname, will use the b-tree
Various methods of indexing for multiple columns are examined in Query Paths 9 through
11 later in this chapter.
A similar problem is searching for words within a column but not at the beginning of the text string
stored in the column For these word searches, SQL Server can use Integrated Full-Text Search (iFTS),
covered in Chapter 19, ‘‘Using Integrated Full-Text Search.’’
Unique indexes and constraints
Because primary keys are the unique method of identifying any row, indexes and primary keys are
intertwined — in fact, a primary key must be indexed By default, creating a primary key automatically
creates a unique clustered index, but it can optionally create a unique non-clustered index instead
A unique index limits data to being unique so it’s like a constraint; and a unique constraint builds a
unique index to quickly check the data In fact, a unique constraint and a unique index are the exact
same thing — creating either one builds a unique constraint/index
The only difference between a unique constraint/index and a primary key is that a primary key cannot
allow nulls, whereas a unique constraint/index can permit a single null value
The page split problem
Every b-tree index must maintain the key column data in the correct sort order Inserts, updates, and
deletes will affect that data As the data is inserted or modified, if the index page to which a value needs
to be added is full, then SQL Server must split the page into two less-than-full pages so it can insert
the value in the correct position Turning again to the telephone book example, if several new Nielsens
moved into the area and the Nie page 515 had to now accommodate 20 additions, a simulated page
split would take several steps:
1 Cut page 515 in half making two pages; call them 515a and 515b.
2 Print out and tape the new Nielsens to page 515a.
3 Tape page 515b inside the back cover of the telephone book.
4 Make a note on page 515a that the Nie listing continues on page 515b located at the end of
the book, and a note on page 515b indicating that the listing continues on page 515a
Pages splits cause several performance-related problems:
■ The page split operation is expensive because it involves several steps and moving data I’ve
personally seen page splits reduce an intensive insert process’ performance by 90 percent
■ If, after the page split, there still isn’t enough room, then the page will be split again This can
occur repeatedly depending on certain circumstances
■ The data structure is left fragmented and can no longer be read in a single contiguous pass
The data structure has more empty space, which means less data is read with every page read and less
data is stored in the buffer per page
Trang 5Index selectivity
Another aspect of index tuning is the selectivity of the index An index that is very selective has more
distinct index values and selects fewer data rows per index value A primary key or unique index has
the highest possible selectivity; each index key only relates to one row
An index with only a few distinct values spread across a large table is less selective Indexes that are less
selective may not even be useful as indexes A column with three values spread throughout the table is a
poor candidate for an index A bit column has low selectivity and cannot be indexed directly
SQL Server uses its internal index statistics to track the selectivity of an index.DBCC Show_Statistic
reports the last date on which the statistics were updated, and basic information about the index
statistics, including the usefulness of the index A low density indicates that the index is very selective
A high density indicates that a given index node points to several table rows and that the index may be
less useful, as shown in this code sample:
Use CHA2;
DBCC Show_Statistics (Customer, IxCustomerName);
Result (formatted and abridged; the full listing includes details for every value in the index):
Statistics for INDEX ‘IxCustomerName’
Updated Rows Sampled Steps Density key length - - - - -
All density Average Length Columns - - -3.0303031E-2 6.6904764 LastName
2.3809524E-2 11.547619 LastName, FirstName DBCC execution completed If DBCC printed error messages, contact your system administrator
Sometimes changing the order of the key columns can improve the selectivity of an index and its
perfor-mance Be careful, however, because other queries may depend on the order for their perforperfor-mance
Unordered heaps
It’s also possible to create a table without a clustered index, in which case the data is stored in an
unordered heap Instead of being identified by the clustered index key columns, the rows are identified
internally using the heap’s RowID The RowID is an actual physical location composed of three values,
FileID:PageNum:SlotNum, and cannot be directly queried Any non-clustered indexes store the
heap’s RowID in all levels of the index to point to the heap instead of using the clustered index key
columns to point to the clustered index
Because a heap does not include a clustered index, a heap’s primary key must be a non-clustered index
Trang 6Why Use Heaps?
Ibelieve heaps add no value and nearly always require a bookmark lookup (explained in Query Path 5), so
I avoid creating heaps
Developers who like heaps tend to be the same developers who prefer natural primary keys (as opposed to
surrogate primary keys) Natural primary keys are nearly always unordered When natural primary keys are
used for clustered indexes they generate a lot of page splits, which kills performance Heaps simply add new
rows at the end of the heap and they avoid the natural primary key page split problem
Some developers claim that heaps are faster than clustered indexes for inserts This is true only when the
clustered index is designed in a way that generates page splits Comparing insert performance between heaps
and clustered surrogate primary keys, there is little measurable difference, or the clustered index is slightly
faster
Heaps are organized by RIDs — row IDs (includes file, page, and row) Any seek operation (detailed soon)
into a heap must use a non-clustered index and a bookmark lookup (detailed in Query Path 5 later in this
chapter)
Query operations
Although there are dozens of logical and physical query execution operations, SQL Server uses three
primary operations to actually fetch the data:
■ Table scan: Reads the entire heap and, most likely, passes all the data to a secondary filter
operation
■ Index scan: Reads the entire leaf level (every row) of the clustered index or non-clustered
index The index scan operation might filter the rows and return only those rows that meet the
criteria, or it might pass all the rows to another filter operation depending on the complexity
of the criteria The data may or may not be ordered
■ Index seek: Locates specific row(s) data using the b-tree and returns only the selected rows in
an ordered list, as illustrated in Figure 64-3
The Query Optimizer chooses the fetch operation with the least cost Sequentially reading the data is
a very efficient task, so an index scan and filter operation may actually be cheaper than an index seek
with a bookmark lookup (see Query Path 5 below) involving hundreds of random I/O index seeks It’s
all about correctly guessing the number of rows touched and returned by each operation in the query
execution plan
Path of the Query
Indexes exist to serve queries — an index by itself serves no purpose The best way to understand how
to design efficient indexes is to observe and learn from the various possible paths queries take through
the indexes to locate data
Trang 7FIGURE 64-3
An index-seek operation navigates the b-tree index, selects a beginning row, and then scans all the
required rows
Seek
Clustered Index SeekD
Scan
There are ten kata (a Japanese word for martial arts choreographed patterns or movements), or query
paths, with different combinations of indexes combined with index seeks and scans These kata begin
with a simple index scan and progress toward more complex query paths
Not every query path is an efficient query path There are nine good paths, and three paths that should
be avoided
A good test table for observing the twelve query paths in theAdventureWorks2008database is the
Production.WorkOrdertable It has 72,591 rows, only 10 columns, and a single-column clustered
primary key Here’s the table definition:
CREATE TABLE [Production].[WorkOrder](
[WorkOrderID] [int] IDENTITY(1,1) NOT NULL, [ProductID] [int] NOT NULL,
[OrderQty] [int] NOT NULL, [StockedQty] AS (isnull([OrderQty]-[ScrappedQty],(0))), [ScrappedQty] [smallint] NOT NULL,
[StartDate] [datetime] NOT NULL, [EndDate] [datetime] NULL, [DueDate] [datetime] NOT NULL, [ScrapReasonID] [smallint] NULL, [ModifiedDate] [datetime] NOT NULL, CONSTRAINT [PK_WorkOrder_WorkOrderID] PRIMARY KEY CLUSTERED ([WorkOrderID] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY];
Trang 8As installed, theWorkOrdertable has the three indexes, each with one column as identified in the
index name:
■ PK_WorkOrder_WorkOrderID(clustered)
■ IX_WorkORder_ProductID(non-unique, non-clustered)
■ IX_WorkOrder_ScrapReasonID(non-unique, non-clustered)
Performance data for each kata, listed in Table 64-1, was captured by watching the T-SQL➪
SQL:StmtComplete and Performance➪ Showplan XML Statistics Profile events in Profiler, and
examining the query execution plan
The key performance indicators are the query execution plan optimizer costs (Cost), and the number of
logical reads (Reads)
For the duration column, I ran each query multiple times and averaged the results Of course, your SQL
Server machine is probably beefier than my notebook I urge you to run the script on your own
SQL Server instance, take your own performance measurements, and study the query execution plans
The Rows per ms column is calculated from the number of rows returned and the average duration
Before executing each query path, the following code clears the buffers:
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;
Query Path 1: Fetch All
The first query path sets a baseline for performance by simply requesting all the data from the base
table:
SELECT *
FROM Production.WorkOrder;
Without aWHEREclause and every column selected, the query must read every row from the clustered
index A clustered index scan (illustrated in Figure 64-4) sequentially reads every row
This query is the longest query of all the query paths, so it might seem to be a slow query, but when
comparing the number of rows returned per millisecond, the index scan returns the highest number of
rows per millisecond of any query path
Query Path 2: Clustered Index Seek
The second query path adds aWHEREclause to the first query and filters the result to a single row using
a clustered key value:
SELECT *
FROM Production.WorkOrder
WHERE WorkOrderID = 1234;
Trang 9TABLE 64-1
Query Path Performance
Path Kata Plan Rows Cost Reads Index (ms) per ms
1 Fetch All C Ix Scan 72,591 485 526 1,196 60.71
2 Clustered Index
Seek
C Ix Seek 1 003 2 7 14
3 Range Seek Query
(narrow)
C Ix Seek (Seek keys start-end)
Range Seek Query (wide)
C Ix Seek (Seek keys start-end)
72,591 485 526 1,257 57.73
4 Filter by non-Key
Column
C Ix Scanfilter (predicate)
55 519 526 NC (include
all columns)
170 32
5 Bookmark Lookup
(Select *)
NC Ix SeekBML 9 037 29 226 04 Bookmark Lookup
(Select clustered key, non-key col)
NC Ix SeekBML 9 037 29 128 07
6 Covering Index
(narrow)
NC Ix Seek (Seek Predicate)
Covering Index (wide)
NC Ix Seek (Seek Predicate)
1,105 005 6 106 10.46
NC Seek Selecting Clustered Key (narrow)
NC Ix Seek (Seek Predicate)
NC Seek Selecting Clustered Key (wide)
NC Ix Seek (Seek Predicate)
1,105 004 4 46 24.02
Filter by Include Column
NC Ix Seek (Seek Predicate + Predicate)
7 Filter by 2 x NC
Indexes
2 x NC Ix Seek (PredicateMerge Join
8 Filter by Ordered
NC Composite Index
NC Ix Seek (Seek Predicate w/ 2 prefixes)
9 Filter by Unordered
NC Composite Index
NC Ix Scan 118 209 173 NC by missing
key, include C Key
72 1.64
10 Filter by Expression NC Ix Scan 9 209 173 111 08
Trang 10FIGURE 64-4
The clustered index scan sequentially reads all the rows from the clustered index
Clustered Index PK_WorkOrder_WorkOrderID
The Query Optimizer offers two clues that there’s only one row that meets theWHEREclause
crite-ria: statistics and the fact thatWorkOrderIDis the primary key constraint so it must be unique
WorkOrderIDis also the clustered index key, so the Query Optimizer knows there’s a great index
available to locate a single row The clustered index seek operation navigates the clustered index b-tree
and quickly locates the desired row, as illustrated in Figure 64-5
FIGURE 64-5
A clustered index seek navigates the b-tree index and locates the row in a snap
Clustered Index PK_WorkOrder_WorkOrderID