Hướng dẫn học Microsoft SQL Server 2008 part 34 pptx

SQL Server warns in the result that null values are ignored by aggregate functions, which are examined in more detail soon: SELECT SUMAmount AS [Sum], AVGAmount AS [Avg], MINAmount AS [M

Trang 1

Because SQL is now returning information from a set, rather than building a record set of rows, as soon

as a query includes an aggregate function, every column (in the column list, in the expression, or in the

ORDER BY) must participate in an aggregate function This makes sense because if a query returned

the total number of order sales, then it could not return a single order number on the summary row

Because aggregate functions are expressions, the result will have a null column name Therefore, use an

alias to name the column in the results

To demonstrate the mathematical aggregate functions, the following query produces aSUM(),AVG(),

MIN(), andMAX()of theamountcolumn SQL Server warns in the result that null values are ignored

by aggregate functions, which are examined in more detail soon:

SELECT SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max]

FROM RawData ; Result:

Warning: Null value is eliminated by an aggregate

or other SET operation

There’s actually more to theCOUNT()function than appears at first glance The next query exercises

four variations of theCOUNT()aggregate function:

SELECT COUNT(*) AS CountStar, COUNT(RawDataID) AS CountPK, COUNT(Amount) AS CountAmount, COUNT(DISTINCT Region) AS Regions FROM RawData;

Result:

CountStar CountPK CountAmount Regions

Warning: Null value is eliminated by an aggregate

or other SET operation

To examine this query in detail, the first column,COUNT(*), counts every row, regardless of any values

in the row.COUNT(RawDataID)counts all the rows with a non-null value in the primary key Because

primary keys, by definition, can’t have any nulls, this column also counts every row These two methods

of counting rows have the same query execution plan, same performance, and same result

The third column,COUNT(Amount), demonstrates why every aggregate query includes a warning

It counts the number of rows with an actual value in theAmountcolumn, and it ignores any rows

Trang 2

with a null value in theAmountcolumn Because there are four rows with null amounts, this

COUNT(Amount)finds only 20 rows

COUNT(DISTINCT region)is the oddball of this query Instead of counting rows, it counts the

unique values in the region column TheRawDatatable data has four regions:MidWest,NorthEast,

South, andWest Therefore,COUNT(DISTINCT region)returns4 Note thatCOUNT(DISTINCT *)

is invalid; it requires a specific column

Aggregates, averages, and nulls

Aggregate functions ignore nulls, which creates a special situation when calculating averages ASUM()

orAVG()aggregate function will not error out on a null, but simply skip the row with a null For this

reason, aSUM()/COUNT(*)calculation may provide a different result from anAVG()function The

COUNT(*)function includes every row, whereas theAVG()function might divide using a smaller count

of rows

To test this behavior, the next query uses three methods of calculating the average amount, and each

method generates a different result:

SELECT AVG(Amount) AS [Integer Avg],

SUM(Amount) / COUNT(*) AS [Manual Avg],

AVG(CAST((Amount) AS NUMERIC(9, 5))) AS [Numeric Avg]

FROM RawData;

Result:

Integer Avg Manual Avg Numeric Avg

- -

The first column performs the standardAVG()aggregate function and divides the sum of the amount

(946) by the number of rows with a non-null value for the amount (20)

TheSUM(AMOUNT)/COUNT(*)calculation in column two actually divides 946 by the total number of

rows in the table (24), yielding a different answer

The last column provides the best answer It uses theAVG()function so it ignores null values, but it

also improves the precision of the answer The trick is that the precision of the aggregate function is

determined by the data type precision of the source values SQL Server’s Query Optimizer first converts

theAmountvalues to anumeric(9,5)data type and then passes the values to theAVG()function

Using aggregate functions within the Query Designer

When using Management Studio’s Query Designer (select a table in the Object Explorer➪ Context

Menu➪ Edit Top 200 Rows), a query can be converted into an aggregate query using the Group By

toolbar button, as illustrated in Figure 12-2

Trang 3

FIGURE 12-2

Performing an aggregate query within Management Studio’s Query Designer The aggregate function

for the column is selected using the drop-down box in theGroup Bycolumn

For more information on using the Query Designer to build and execute queries, turn to Chapter 6, ‘‘Using Management Studio.’’

Beginning statistics

Statistics is a large and complex field of study, and while SQL Server does not pretend to replace a full

statistical analysis software package, it does calculate standard deviation and variance, both of which are

important for understanding the bell-curve spread of numbers

An average alone is not sufficient to summarize a set of values (in the lexicon of statistics, a ‘‘set’’

is referred to as a population) The value in the exact middle of a population is the statistical mean

or median (which is different from the average or arithmetic mean) The difference, or how widely

dispersed the values are from the mean, is called the population’s variance For example, the populations

Trang 4

(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) and (4, 4, 5, 5, 5, 5, 6, 6) both average to 5, but the values in the first set

vary widely from the median, whereas the second set’s values are all close to the median The standard

deviation is the square root of the variance and describes the shape of the bell curve formed by the

population

The following query uses theStDevP()andVarP()functions to return the statistical variance and the

standard deviation of the entire population of theRawDatatable:

SELECT

StDevP(Amount) as [StDev],

VarP(Amount) as [Var]

FROM RawData;

Result:

-24.2715883287435 589.11

To perform extensive statistical data analysis, I recommend exporting the query result set

to Excel and tapping Excel’s broad range of statistical functions.

The statistical formulas differ slightly when calculating variance and standard deviation from the entire

population versus a sampling of the population If the aggregate query includes the entire population,

then use theStDevP()andVarP()aggregate functions, which use the bias or n method of calculating

the deviation

However, if the query is using a sampling or subset of the population, then use the StDev()and

Var()aggregate functions so that SQL Server will use the unbiased or n-1 statistical method Because

GROUP BYqueries slice the population into subsets, these queries should always useStDevP()and

VarP()functions

All of these aggregate functions also work with the OVER() clause; see Chapter 13,

‘‘Windowing and Ranking.’’

Grouping within a Result Set

Aggregate functions are all well and good, but how often do you need a total for an entire table? Most

aggregate requirements will include a date range, department, type of sale, region, or the like That

presents a problem If the only tool to restrict the aggregate function were theWHEREclause, then

database developers would waste hours replicating the same query, or writing a lot of dynamic SQL

queries and the code to execute the aggregate queries in sequence

Fortunately, aggregate functions are complemented by theGROUP BYfunction, which automatically

par-titions the data set into subsets based on the values in certain columns Once the data set is divided into

Trang 5

subgroups, the aggregate functions are performed on each subgroup The final result is one summation

row for each group, as shown in Figure 12-3

A common example is grouping the sales result by salesperson ASUM()function without the grouping

would produce theSUM()of all sales Writing a query for each salesperson would provide aSUM()for

each person, but maintaining that over time would be cumbersome The grouping function automatically

creates a subset of data grouped for each unique salesperson, and then theSUM()function is calculated

for each salesperson’s sales Voil`a

FIGURE 12-3

Thegroup byclause slices the data set into multiple subgroups

group group group group

row row row row Data

Source(s)

Where

From

Col(s) Expr(s) Data Set

Data Set Having

Order

By Predicate

Simple groupings

Some queries use descriptive columns for the grouping, so the data used by theGROUP BYclause is

the same data you need to see to understand the groupings For example, the next query groups by

category:

SELECT Category, Count(*) as Count, Sum(Amount) as [Sum], Avg(Amount) as [Avg], Min(Amount) as [Min], Max(Amount) as [Max]

FROM RawData

GROUP BY Category;

Result:

Category Count Sum Avg Min Max -

The first column of this query returns theCategorycolumn While this column does not have an

aggregate function, it still participates within the aggregate because that’s the column by which the query

is being grouped It may therefore be included in the result set because, by definition, there can be

only a single category value in each group Each row in the result set summarizes one category, and the

aggregate functions now calculate the row count, sum average, minimum value, and maximum value for

each category

Trang 6

SQL is not limited to grouping by a column It’s possible to group by an expression, but note that the

exact same expression must be used in theSELECTlist, not the individual columns used to generate

the expression

Nor is SQL limited to grouping by a single column or expression Grouping by multiple columns and

expressions is quite common The following query is an example of grouping by two expressions that

calculate year number and quarter fromSalesDate:

SELECT Year(SalesDate) as [Year], DatePart(q,SalesDate) as [Quarter],

Count(*) as Count,

Sum(Amount) as [Sum],

Avg(Amount) as [Avg],

Min(Amount) as [Min],

Max(Amount) as [Max]

FROM RawData

GROUP BY Year(SalesDate), DatePart(q,SalesDate);

Result:

Year Quarter Count Sum Avg Min Max

-

For the purposes of a GROUP BY , null values are considered equal to other nulls and are

grouped together into a single result row.

Grouping sets

Normally, SQL Server groups by every unique combination of values in every column listed in the

GROUP BYclause Grouping sets is a variation of that theme that’s new for SQL Server 2008 With

grouping sets, a summation row is generated for each unique value in each set You can think of

grouping sets as executing severalGROUP BYqueries (one for each grouping set) and then combining, or

unioning, the results

For example, the following two queries produce the same result The first query uses twoGROUP BY

queries unioned together; the second query uses the new grouping set feature:

SELECT NULL AS Category,

Region,

COUNT(*) AS Count,

SUM(Amount) AS [Sum],

AVG(Amount) AS [Avg],

MIN(Amount) AS [Min],

MAX(Amount) AS [Max]

FROM RawData

GROUP BY Region

Trang 7

UNION SELECT Category, Null,

COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max]

FROM RawData GROUP BY Category;

SELECT Category, Region,

COUNT(*) AS Count, SUM(Amount) AS [Sum], AVG(Amount) AS [Avg], MIN(Amount) AS [Min], MAX(Amount) AS [Max]

FROM RawData

GROUP BY GROUPING SETS (Category, Region);

Result (same for both queries):

Category Region Count Sum Avg Min Max - NULL MidWest 3 145 48 24 83 NULL NorthEast 6 236 59 28 91 NULL South 12 485 44 11 86

There’s more to grouping sets than merging multipleGROUP BYqueries; they’re also used withROLLUP

andCUBE, covered later in this chapter

Filtering grouped results

When combined with grouping, filtering can be a problem Are the row restrictions applied before the

GROUP BYor after theGROUP BY? Some databases use nested queries to properly filter before or after

theGROUP BY SQL, however, uses theHAVINGclause to filter the groups At the beginning of this

chapter, you saw the simplified order of the SQLSELECTstatement’s execution A more complete order

is as follows:

1 TheFROMclause assembles the data from the data sources

2 TheWHEREclause restricts the rows based on the conditions

3 TheGROUP BYclause assembles subsets of data

4 Aggregate functions are calculated.

Trang 8

5 TheHAVINGclause filters the subsets of data.

6 Any remaining expressions are calculated.

7 TheORDER BYsorts the results

Continuing with theRawDatasample table, the following query removes from the analysis any

grouping ‘‘having’’ an average of less than or equal to 25 by accepting only those summary rows with an

average greater than 25:

SELECT Year(SalesDate) as [Year],

DatePart(q,SalesDate) as [Quarter],

Count(*) as Count,

Sum(Amount) as [Sum],

Avg(Amount) as [Avg]

FROM RawData

GROUP BY Year(SalesDate), DatePart(q,SalesDate)

HAVING Avg(Amount) > 25

ORDER BY [Year], [Quarter];

Result:

Year Quarter Count Sum Avg

-

Without theHAVINGclause, the fourth quarter of 2005, with an average of 19, would have been

included in the result set

Aggravating Queries

A few aspects ofGROUP BYqueries can be aggravating when developing applications Some developers

simply avoid aggregate queries and make the reporting tool do the work, but the Database Engine will

be more efficient than any client tool Here are four typical aggravating problems and my recommended

solutions

Including group by descriptions

The previous aggregate queries all executed without error because every column participated in the

aggregate purpose of the query To test the rule, the following script adds a category table and then

attempts to return a column that isn’t included as an aggregate function orGROUP BYcolumn:

CREATE TABLE RawCategory (

RawCategoryID CHAR(1) NOT NULL PRIMARY KEY,

CategoryName VARCHAR(25) NOT NULL

);

Trang 9

INSERT RawCategory (RawCategoryID, CategoryName) VALUES (’X’, ‘Sci-Fi’),

(’Y’, ‘Philosophy’), (’Z’, ‘Zoology’);

ALTER TABLE RawData ADD CONSTRAINT FT_Category FOREIGN KEY (Category) REFERENCES RawCategory(RawCategoryID);

including data outside the aggregate function or group by

SELECT R.Category, C.CategoryName,

Sum(R.Amount) as [Sum], Avg(R.Amount) as [Avg], Min(R.Amount) as [Min], Max(R.Amount) as [Max]

FROM RawData AS R INNER JOIN RawCategory AS C

ON R.Category = C.RawCategoryID GROUP BY R.Category;

As expected, includingCategoryNamein the column list causes the query to return an error message:

Msg 8120, Level 16, State 1, Line 1

Column ‘RawCategory.CategoryName’ is invalid in the select list

because it is not contained in either an aggregate function or the GROUP BY clause

Here are three solutions for including non-aggregate descriptive columns Which solution performs best

depends on the size and mix of the data and indexes

The first solution is to simply include the additional columns in theGROUP BYclause:

SELECT R.Category, C.CategoryName,

Sum(R.Amount) as [Sum], Avg(R.Amount) as [Avg], Min(R.Amount) as [Min], Max(R.Amount) as [Max]

FROM RawData AS R INNER JOIN RawCategory AS C

ON R.Category = C.RawCategoryID

GROUP BY R.Category, C.CategoryName

ORDER BY R.Category, C.CategoryName;

Result:

Category CategoryName Sum Avg Min Max -

Trang 10

Another simple solution might be to include the descriptive column in an aggregate function that

accepts text, such asMIN()orMAX() This solution returns the descriptor while avoiding grouping by

an additional column:

SELECT Category,

MAX(CategoryName) AS CategoryName,

SUM(Amount) AS [Sum],

AVG(Amount) AS [Avg],

MIN(Amount) AS [Min],

MAX(Amount) AS [Max]

FROM RawData R

JOIN RawCategory C

ON R.Category = C.RawCategoryID

GROUP BY Category

ORDER BY Category,

CategoryName

Another possible solution, although more complex, is to embed the aggregate function in a subquery

and then include the additional columns in the outer query In this solution, the subquery does the

grunt work of the aggregate function andGROUP BY, leaving the outer query to handle theJOINand

bring in the descriptive column(s) For larger data sets, this may be the best-performing solution:

SELECT sq.Category, C.CategoryName,

sq.[Sum], sq.[Avg], sq.[Min], sq.[Max]

FROM (SELECT Category,

Sum(Amount) as [Sum], Avg(Amount) as [Avg], Min(Amount) as [Min], Max(Amount) as [Max]

FROM RawData GROUP BY Category ) AS sq

INNER JOIN RawCategory AS C

ON sq.Category = C.RawCategoryID

ORDER BY sq.Category, C.CategoryName;

Which solution performs best depends on the data mix If it’s an ad hoc query, then the simplest query

to write is probably the first solution If the query is going into production as part of a stored

proce-dure, then I recommend testing all three solutions against a full data load to determine which solution

actually performs best Never underestimate the optimizer

Including all group by values

TheGROUP BYfunctions occur following thewhereclause in the logical order of the query This can

present a problem if the query needs to report all of theGROUP BYcolumn values even though the data

needs to be filtered For example, a report might need to include all the months even though there’s no

data for a given month AGROUP BYquery won’t return a summary row for a group that has no data

The simple solution is to use theGROUP BY ALLoption, which includes allGROUP BYvalues regardless

of theWHEREclause However, it has a limitation: It only works well when grouping by a single

expres-sion A more severe limitation is that Microsoft lists it as deprecated, meaning it will be removed from a

future version of SQL Server Nulltheless, here’s an example

Định dạng
Số trang	10
Dung lượng	602,72 KB