SQL Server warns in the result that null values are ignored by aggregate functions, which are examined in more detail soon: SELECT SUMAmount AS [Sum], AVGAmount AS [Avg], MINAmount AS [M
Trang 1Because SQL is now returning information from a set, rather than building a record set of rows, as soon
as a query includes an aggregate function, every column (in the column list, in the expression, or in the
ORDER BY) must participate in an aggregate function This makes sense because if a query returned
the total number of order sales, then it could not return a single order number on the summary row
Because aggregate functions are expressions, the result will have a null column name Therefore, use an
alias to name the column in the results
To demonstrate the mathematical aggregate functions, the following query produces aSUM(),AVG(),
MIN(), andMAX()of theamountcolumn SQL Server warns in the result that null values are ignored
by aggregate functions, which are examined in more detail soon:
SELECT SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max]
FROM RawData ; Result:
Warning: Null value is eliminated by an aggregate
or other SET operation
There’s actually more to theCOUNT()function than appears at first glance The next query exercises
four variations of theCOUNT()aggregate function:
SELECT COUNT(*) AS CountStar, COUNT(RawDataID) AS CountPK, COUNT(Amount) AS CountAmount, COUNT(DISTINCT Region) AS Regions FROM RawData;
Result:
CountStar CountPK CountAmount Regions
Warning: Null value is eliminated by an aggregate
or other SET operation
To examine this query in detail, the first column,COUNT(*), counts every row, regardless of any values
in the row.COUNT(RawDataID)counts all the rows with a non-null value in the primary key Because
primary keys, by definition, can’t have any nulls, this column also counts every row These two methods
of counting rows have the same query execution plan, same performance, and same result
The third column,COUNT(Amount), demonstrates why every aggregate query includes a warning
It counts the number of rows with an actual value in theAmountcolumn, and it ignores any rows
Trang 2with a null value in theAmountcolumn Because there are four rows with null amounts, this
COUNT(Amount)finds only 20 rows
COUNT(DISTINCT region)is the oddball of this query Instead of counting rows, it counts the
unique values in the region column TheRawDatatable data has four regions:MidWest,NorthEast,
South, andWest Therefore,COUNT(DISTINCT region)returns4 Note thatCOUNT(DISTINCT *)
is invalid; it requires a specific column
Aggregates, averages, and nulls
Aggregate functions ignore nulls, which creates a special situation when calculating averages ASUM()
orAVG()aggregate function will not error out on a null, but simply skip the row with a null For this
reason, aSUM()/COUNT(*)calculation may provide a different result from anAVG()function The
COUNT(*)function includes every row, whereas theAVG()function might divide using a smaller count
of rows
To test this behavior, the next query uses three methods of calculating the average amount, and each
method generates a different result:
SELECT AVG(Amount) AS [Integer Avg],
SUM(Amount) / COUNT(*) AS [Manual Avg],
AVG(CAST((Amount) AS NUMERIC(9, 5))) AS [Numeric Avg]
FROM RawData;
Result:
Integer Avg Manual Avg Numeric Avg
- -
The first column performs the standardAVG()aggregate function and divides the sum of the amount
(946) by the number of rows with a non-null value for the amount (20)
TheSUM(AMOUNT)/COUNT(*)calculation in column two actually divides 946 by the total number of
rows in the table (24), yielding a different answer
The last column provides the best answer It uses theAVG()function so it ignores null values, but it
also improves the precision of the answer The trick is that the precision of the aggregate function is
determined by the data type precision of the source values SQL Server’s Query Optimizer first converts
theAmountvalues to anumeric(9,5)data type and then passes the values to theAVG()function
Using aggregate functions within the Query Designer
When using Management Studio’s Query Designer (select a table in the Object Explorer➪ Context
Menu➪ Edit Top 200 Rows), a query can be converted into an aggregate query using the Group By
toolbar button, as illustrated in Figure 12-2
Trang 3FIGURE 12-2
Performing an aggregate query within Management Studio’s Query Designer The aggregate function
for the column is selected using the drop-down box in theGroup Bycolumn
For more information on using the Query Designer to build and execute queries, turn to Chapter 6, ‘‘Using Management Studio.’’
Beginning statistics
Statistics is a large and complex field of study, and while SQL Server does not pretend to replace a full
statistical analysis software package, it does calculate standard deviation and variance, both of which are
important for understanding the bell-curve spread of numbers
An average alone is not sufficient to summarize a set of values (in the lexicon of statistics, a ‘‘set’’
is referred to as a population) The value in the exact middle of a population is the statistical mean
or median (which is different from the average or arithmetic mean) The difference, or how widely
dispersed the values are from the mean, is called the population’s variance For example, the populations
Trang 4(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) and (4, 4, 5, 5, 5, 5, 6, 6) both average to 5, but the values in the first set
vary widely from the median, whereas the second set’s values are all close to the median The standard
deviation is the square root of the variance and describes the shape of the bell curve formed by the
population
The following query uses theStDevP()andVarP()functions to return the statistical variance and the
standard deviation of the entire population of theRawDatatable:
SELECT
StDevP(Amount) as [StDev],
VarP(Amount) as [Var]
FROM RawData;
Result:
-24.2715883287435 589.11
To perform extensive statistical data analysis, I recommend exporting the query result set
to Excel and tapping Excel’s broad range of statistical functions.
The statistical formulas differ slightly when calculating variance and standard deviation from the entire
population versus a sampling of the population If the aggregate query includes the entire population,
then use theStDevP()andVarP()aggregate functions, which use the bias or n method of calculating
the deviation
However, if the query is using a sampling or subset of the population, then use the StDev()and
Var()aggregate functions so that SQL Server will use the unbiased or n-1 statistical method Because
GROUP BYqueries slice the population into subsets, these queries should always useStDevP()and
VarP()functions
All of these aggregate functions also work with the OVER() clause; see Chapter 13,
‘‘Windowing and Ranking.’’
Grouping within a Result Set
Aggregate functions are all well and good, but how often do you need a total for an entire table? Most
aggregate requirements will include a date range, department, type of sale, region, or the like That
presents a problem If the only tool to restrict the aggregate function were theWHEREclause, then
database developers would waste hours replicating the same query, or writing a lot of dynamic SQL
queries and the code to execute the aggregate queries in sequence
Fortunately, aggregate functions are complemented by theGROUP BYfunction, which automatically
par-titions the data set into subsets based on the values in certain columns Once the data set is divided into
Trang 5subgroups, the aggregate functions are performed on each subgroup The final result is one summation
row for each group, as shown in Figure 12-3
A common example is grouping the sales result by salesperson ASUM()function without the grouping
would produce theSUM()of all sales Writing a query for each salesperson would provide aSUM()for
each person, but maintaining that over time would be cumbersome The grouping function automatically
creates a subset of data grouped for each unique salesperson, and then theSUM()function is calculated
for each salesperson’s sales Voil`a
FIGURE 12-3
Thegroup byclause slices the data set into multiple subgroups
group group group group
row row row row Data
Source(s)
Where
From
Col(s) Expr(s) Data Set
Data Set Having
Order
By Predicate
Simple groupings
Some queries use descriptive columns for the grouping, so the data used by theGROUP BYclause is
the same data you need to see to understand the groupings For example, the next query groups by
category:
SELECT Category, Count(*) as Count, Sum(Amount) as [Sum], Avg(Amount) as [Avg], Min(Amount) as [Min], Max(Amount) as [Max]
FROM RawData
GROUP BY Category;
Result:
Category Count Sum Avg Min Max -
The first column of this query returns theCategorycolumn While this column does not have an
aggregate function, it still participates within the aggregate because that’s the column by which the query
is being grouped It may therefore be included in the result set because, by definition, there can be
only a single category value in each group Each row in the result set summarizes one category, and the
aggregate functions now calculate the row count, sum average, minimum value, and maximum value for
each category
Trang 6SQL is not limited to grouping by a column It’s possible to group by an expression, but note that the
exact same expression must be used in theSELECTlist, not the individual columns used to generate
the expression
Nor is SQL limited to grouping by a single column or expression Grouping by multiple columns and
expressions is quite common The following query is an example of grouping by two expressions that
calculate year number and quarter fromSalesDate:
SELECT Year(SalesDate) as [Year], DatePart(q,SalesDate) as [Quarter],
Count(*) as Count,
Sum(Amount) as [Sum],
Avg(Amount) as [Avg],
Min(Amount) as [Min],
Max(Amount) as [Max]
FROM RawData
GROUP BY Year(SalesDate), DatePart(q,SalesDate);
Result:
Year Quarter Count Sum Avg Min Max
-
For the purposes of a GROUP BY , null values are considered equal to other nulls and are
grouped together into a single result row.
Grouping sets
Normally, SQL Server groups by every unique combination of values in every column listed in the
GROUP BYclause Grouping sets is a variation of that theme that’s new for SQL Server 2008 With
grouping sets, a summation row is generated for each unique value in each set You can think of
grouping sets as executing severalGROUP BYqueries (one for each grouping set) and then combining, or
unioning, the results
For example, the following two queries produce the same result The first query uses twoGROUP BY
queries unioned together; the second query uses the new grouping set feature:
SELECT NULL AS Category,
Region,
COUNT(*) AS Count,
SUM(Amount) AS [Sum],
AVG(Amount) AS [Avg],
MIN(Amount) AS [Min],
MAX(Amount) AS [Max]
FROM RawData
GROUP BY Region
Trang 7UNION SELECT Category, Null,
COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max]
FROM RawData GROUP BY Category;
SELECT Category, Region,
COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max]
FROM RawData
GROUP BY GROUPING SETS (Category, Region);
Result (same for both queries):
Category Region Count Sum Avg Min Max - NULL MidWest 3 145 48 24 83 NULL NorthEast 6 236 59 28 91 NULL South 12 485 44 11 86
There’s more to grouping sets than merging multipleGROUP BYqueries; they’re also used withROLLUP
andCUBE, covered later in this chapter
Filtering grouped results
When combined with grouping, filtering can be a problem Are the row restrictions applied before the
GROUP BYor after theGROUP BY? Some databases use nested queries to properly filter before or after
theGROUP BY SQL, however, uses theHAVINGclause to filter the groups At the beginning of this
chapter, you saw the simplified order of the SQLSELECTstatement’s execution A more complete order
is as follows:
1 TheFROMclause assembles the data from the data sources
2 TheWHEREclause restricts the rows based on the conditions
3 TheGROUP BYclause assembles subsets of data
4 Aggregate functions are calculated.
Trang 85 TheHAVINGclause filters the subsets of data.
6 Any remaining expressions are calculated.
7 TheORDER BYsorts the results
Continuing with theRawDatasample table, the following query removes from the analysis any
grouping ‘‘having’’ an average of less than or equal to 25 by accepting only those summary rows with an
average greater than 25:
SELECT Year(SalesDate) as [Year],
DatePart(q,SalesDate) as [Quarter],
Count(*) as Count,
Sum(Amount) as [Sum],
Avg(Amount) as [Avg]
FROM RawData
GROUP BY Year(SalesDate), DatePart(q,SalesDate)
HAVING Avg(Amount) > 25
ORDER BY [Year], [Quarter];
Result:
Year Quarter Count Sum Avg
-
Without theHAVINGclause, the fourth quarter of 2005, with an average of 19, would have been
included in the result set
Aggravating Queries
A few aspects ofGROUP BYqueries can be aggravating when developing applications Some developers
simply avoid aggregate queries and make the reporting tool do the work, but the Database Engine will
be more efficient than any client tool Here are four typical aggravating problems and my recommended
solutions
Including group by descriptions
The previous aggregate queries all executed without error because every column participated in the
aggregate purpose of the query To test the rule, the following script adds a category table and then
attempts to return a column that isn’t included as an aggregate function orGROUP BYcolumn:
CREATE TABLE RawCategory (
RawCategoryID CHAR(1) NOT NULL PRIMARY KEY,
CategoryName VARCHAR(25) NOT NULL
);
Trang 9INSERT RawCategory (RawCategoryID, CategoryName) VALUES (’X’, ‘Sci-Fi’),
(’Y’, ‘Philosophy’), (’Z’, ‘Zoology’);
ALTER TABLE RawData ADD CONSTRAINT FT_Category FOREIGN KEY (Category) REFERENCES RawCategory(RawCategoryID);
including data outside the aggregate function or group by
SELECT R.Category, C.CategoryName,
Sum(R.Amount) as [Sum], Avg(R.Amount) as [Avg], Min(R.Amount) as [Min], Max(R.Amount) as [Max]
FROM RawData AS R INNER JOIN RawCategory AS C
ON R.Category = C.RawCategoryID GROUP BY R.Category;
As expected, includingCategoryNamein the column list causes the query to return an error message:
Msg 8120, Level 16, State 1, Line 1
Column ‘RawCategory.CategoryName’ is invalid in the select list
because it is not contained in either an aggregate function or the GROUP BY clause
Here are three solutions for including non-aggregate descriptive columns Which solution performs best
depends on the size and mix of the data and indexes
The first solution is to simply include the additional columns in theGROUP BYclause:
SELECT R.Category, C.CategoryName,
Sum(R.Amount) as [Sum], Avg(R.Amount) as [Avg], Min(R.Amount) as [Min], Max(R.Amount) as [Max]
FROM RawData AS R INNER JOIN RawCategory AS C
ON R.Category = C.RawCategoryID
GROUP BY R.Category, C.CategoryName
ORDER BY R.Category, C.CategoryName;
Result:
Category CategoryName Sum Avg Min Max -
Trang 10Another simple solution might be to include the descriptive column in an aggregate function that
accepts text, such asMIN()orMAX() This solution returns the descriptor while avoiding grouping by
an additional column:
SELECT Category,
MAX(CategoryName) AS CategoryName,
SUM(Amount) AS [Sum],
AVG(Amount) AS [Avg],
MIN(Amount) AS [Min],
MAX(Amount) AS [Max]
FROM RawData R
JOIN RawCategory C
ON R.Category = C.RawCategoryID
GROUP BY Category
ORDER BY Category,
CategoryName
Another possible solution, although more complex, is to embed the aggregate function in a subquery
and then include the additional columns in the outer query In this solution, the subquery does the
grunt work of the aggregate function andGROUP BY, leaving the outer query to handle theJOINand
bring in the descriptive column(s) For larger data sets, this may be the best-performing solution:
SELECT sq.Category, C.CategoryName,
sq.[Sum], sq.[Avg], sq.[Min], sq.[Max]
FROM (SELECT Category,
Sum(Amount) as [Sum], Avg(Amount) as [Avg], Min(Amount) as [Min], Max(Amount) as [Max]
FROM RawData GROUP BY Category ) AS sq
INNER JOIN RawCategory AS C
ON sq.Category = C.RawCategoryID
ORDER BY sq.Category, C.CategoryName;
Which solution performs best depends on the data mix If it’s an ad hoc query, then the simplest query
to write is probably the first solution If the query is going into production as part of a stored
proce-dure, then I recommend testing all three solutions against a full data load to determine which solution
actually performs best Never underestimate the optimizer
Including all group by values
TheGROUP BYfunctions occur following thewhereclause in the logical order of the query This can
present a problem if the query needs to report all of theGROUP BYcolumn values even though the data
needs to be filtered For example, a report might need to include all the months even though there’s no
data for a given month AGROUP BYquery won’t return a summary row for a group that has no data
The simple solution is to use theGROUP BY ALLoption, which includes allGROUP BYvalues regardless
of theWHEREclause However, it has a limitation: It only works well when grouping by a single
expres-sion A more severe limitation is that Microsoft lists it as deprecated, meaning it will be removed from a
future version of SQL Server Nulltheless, here’s an example