Joe Celko s SQL for Smarties - Advanced SQL Programming P48 pptx

Applying this function to a key or a unique column is the same as using the COUNT* function, but the optimizer may not be smart enough to spot it.. 21.2 SUM Functions This function works

Trang 1

442 CHAPTER 21: AGGREGATE FUNCTIONS

'Chester' 'A.' 'Arthur' 'R' 1881 1885

'Grover' ' ' 'Cleveland' 'D' 1885 1889

'Benjamin' ' ' 'Harrison' 'R' 1889 1893

'Grover' ' ' 'Cleveland' 'D' 1893 1897

'William' ' ' 'McKinley' 'R' 1897 1901

'Theodore' ' ' 'Roosevelt' 'R' 1901 1909

'William' 'H.' 'Taft' 'R' 1909 1913

'Woodrow' ' ' 'Wilson' 'D' 1913 1921

'Warren' 'G.' 'Harding' 'R' 1921 1923

'Calvin' ' ' 'Coolidge' 'R' 1923 1929

'Herbert' 'C.' 'Hoover' 'R' 1929 1933

'Franklin' 'D.' 'Roosevelt' 'D' 1933 1945

'Harry' 'S.' 'Truman' 'D' 1945 1953

'Dwight' 'D.' 'Eisenhower' 'R' 1953 1961

'John' 'F.' 'Kennedy' 'D' 1961 1963

'Lyndon' 'B.' 'Johnson' 'D' 1963 1969

'Richard' 'M.' 'Nixon' 'R' 1969 1974

'Gerald' 'R.' 'Ford' 'R' 1974 1977

'James' 'E.' 'Carter' 'D' 1977 1981

'Ronald' 'W.' 'Reagan' 'R' 1981 1989

'George' 'H.W.' 'Bush' 'R' 1989 1993

'William' 'J.' 'Clinton' 'D' 1993 2001

'George' 'W ' 'Bush' 'R' 2001 NULL

Your civics teacher has just asked you to tell her how many people have been President of the United States So you write the query as

SELECTCOUNT(*)FROM Presidents; and get the wrong answer For those of you who have been out of high school too long, more than one Adams, more than one John, and more than one Roosevelt have served

as president Many people have had more than one term in office, and Grover Cleveland served two discontinuous terms In short, this database is not a simple one-row, one-person system What you really want is not COUNT(*), but something that is able to look at unique combinations of multiple columns You cannot do this in one column, so you need to construct an expression that is unique The point is that you need to be very sure that the expression you are using as a parameter is really what you wanted to count

The COUNT([ALL] <value expression>) returns the number of members in the <value expression> set The NULLs were thrown away before the counting took place, and an empty set returns zero The best way to read this is: “Count the number of known values in this

Trang 2

21.2 SUM() Functions 443

expression,” with stress on the word known In this example you might use COUNT(first_name || ' ' || initial || ' ' || last_name)

The COUNT(DISTINCT <value expression>) returns the number of unique members in the <value expression> set The

NULLs were thrown away before the counting took place, and then all redundant duplicates are removed (i.e., we keep one copy) Again, an empty set returns a zero, just as with the other counting functions Applying this function to a key or a unique column is the same as using the COUNT(*) function, but the optimizer may not be smart enough to spot it

Notice that the use of the keywords ALL and DISTINCT follows the same pattern here as they did in the [ALL | DISTINCT] options in the

SELECT clause of the query expressions

21.2 SUM() Functions

This function works only with numeric values You should also consult your particular product’s manuals to find out the precision of the results for exact and approximate numeric data types

SUM([ALL] <value expression>) returns the numeric total of all known values The NULLs are removed before the summation takes place An empty set returns an empty result set, not a zero If there are other columns in the SELECT list, then that empty set will be converted into a NULL

SUM(DISTINCT <value expression>) returns the numeric total

of all known, unique values The NULLs and all redundant duplicates were removed before the summation took place Again, an empty set returns an empty result set, not a zero

That last rule is hard for people to understand If there are other columns in the SELECT list, then that empty result set will be converted into a NULL This is true for the rest of the Standard aggregate functions:

no rows

SELECT SUM(x)

FROM EmptyTable;

one row with (0, NULL) in it

SELECT COUNT(*), SUM(x)

FROM EmptyTable;

Trang 3

The summation of a set of numbers looks as though it should be easy, but it is not Make two tables with the same set of positive and negative approximate numeric values, but put one in random order and have the other sorted by absolute value The sorted table will give more accurate results The reason is simple: positive and negative values of the same magnitude will be added together and will get a chance to cancel each other out There is also less chance of an overflow or underflow error during calculations Most PC SQL implementations and a lot of mainframe implementations do not bother with this trick, because it would require a sort for every SUM()

statement, which would take a long time

Whenever an exact or approximate numeric value is assigned to exact numeric, it may not fit into the storage allowed for it SQL says that the database engine will use an approximation that preserves leading significant digits of the original number after rounding or truncating

The choice of whether to truncate or round is implementation-defined, however This can lead to some surprises when you have to shift data among SQL implementations, or move storage values from a host language program into an SQL table It is probably a good idea to create the columns with one more decimal place than you think you need

Truncation is defined as truncation toward zero; this means that 1.5 would truncate to 1, and −1.5 would truncate to −1 This is not true for all programming languages; everyone agrees on truncation toward zero for the positive numbers, but you will find that negative numbers may truncate away from zero (e.g., −1.5 would truncate to −2) SQL is also wishy-washy on rounding, leaving the implementation free to determine its method There are two major types of rounding, the scientific method and the commercial method, which are discussed in Section 3.2.1 on rounding and truncation math in SQL

21.3 AVG() Functions

AVG([ALL] <value expression>) returns the average of the values

in the value expression set An empty set returns an empty result set A set of all NULLs will become an empty set Remember that in general,

AVG(x) is not the same as (SUM(x)/COUNT(*)); the SUM(x) function has thrown away the NULLs, but the COUNT(*) has not

Likewise, AVG(DISTINCT <value expression>) returns the average of the distinct known values in the <value expression> set

Applying this function to a key or a unique column is the same as the using AVG(<value expression>) function

Trang 4

21.3 AVG() Functions 445

Remember that in general AVG(DISTINCT x) is not the same as

AVG(x) or (SUM(DISTINCT x)/COUNT(*)) The SUM(DISTINCT x)

function has thrown away the duplicate values and NULLs, but the

COUNT(*) has not An empty set returns an empty result set

The SQL engine is probably using the same code for the totaling in the AVG() that it used in the SUM() function This leads to the same problems with rounding and truncation, so you should experiment a little with your particular product to find out what happens

But even more troublesome than those problems is the problem with the average itself, because it does not really measure central tendency and can be very misleading Consider the chart below, from Darrell

Huff’s superlative little book, How to Lie with Statistics (Huff 1954) The

Sample Company has 25 employees, earning the following salaries:

Number of

Employees Salary Statistic

===================================

12 $2,000 Mode, Minimum

1 $3,000 Median

4 $3,700

3 $5,000

1 $5,700 Average

2 $10,000

1 $15,000

1 $45,000 Maximum

The average salary (or, more properly, the arithmetic mean) is

$5,700 When the boss is trying to look good to the unions, he uses this figure When the unions are trying to look impoverished, they use the mode, which is the most frequently occurring value, to show that the exploited workers are making $2,000 (which is also the minimum salary

in this case)

A better measure in this case is the median, which will be discussed later; that is, the employee with just as many cases above him as below him That gives us $3,000 The rule for calculating the median is that if there is no actual entity with that value, you fake it

Most people take an average of the two values on either side of where the median would be; others jump to the higher or lower value The mode also has a problem, because not every distribution of values has one mode Imagine a country in which there are as many very poor people as there are very rich people, and there is nobody in between

Trang 5

This would be a bimodal distribution If there were sharp classes of incomes, that would be a multimodal distribution

Some SQL products have median and mode aggregate functions as extensions, but they are not part of the standard We will discuss in detail how to write them in pure SQL in Chapter 23

21.3.1 Averages with Empty Groups

The query used here is a bit tricky, so this section can be skipped on your first reading Sometimes you need to count an empty set as part of the population when computing an average

This is easier to explain with an example that was posted on CompuServe A fish and game warden is sampling different bodies of water for fish populations Each sample falls into one or more groups (muddy bottoms, clear water, still water, and so on) and she is trying to find the average of something that is not there This is neither quite as strange as it first sounds, nor quite as simple, either She is collecting sample data on fish in a table like this:

CREATE TABLE Samples (sample_id INTEGER NOT NULL, fish CHAR(20) NOT NULL, found_cnt INTEGER NOT NULL, PRIMARY KEY (sample_id, fish));

CREATE TABLE SampleGroups (group_id INTEGER NOT NULL, sample_id INTEGER NOT NULL, PRIMARY KEY (group_id, sample_id);

Assume some of the data looks like this:

Samples sample_id fish found_cnt

============================

1 'Seabass' 14

1 'Minnow' 18

2 'Seabass' 19

Trang 6

21.3 AVG() Functions 447

SampleGroups

group_id sample_id

=====================

1 1

1 2

2 2

She needs to get the average number of each species of fish in the sample groups For example, using sample group 1 as shown, which has samples 1 and 2, we could use the parameters :my_fish =‘Minnow’

and :my_group = 1 to find the average number of minnows in sample group 1, thus:

SELECT fish, AVG(found_cnt)

FROM Samples

WHERE sample_id

IN (SELECT sample_id

FROM SampleGroups

WHERE group_id = :my_group)

AND fish = :my_fish

GROUP BY fish;

But this query will give us an average of 18 minnows, which is wrong There were no minnows for sample_id = 2, so the average is ((18 + 0)/2)

= 9 The other way is to do several steps to get the correct answer—first use a SELECT statement to get the number of samples involved, then another SELECT to get the sum, and then manually calculate the average

The obvious answer is to enter a count of zero for each animal under each sample_id, instead of letting it be missing, so you can use the original query You can create the missing rows with:

INSERT INTO Samples

SELECT M1.sample_id, M2.fish, 0

FROM Samples AS M1, Samples AS M2

WHERE NOT EXISTS (SELECT *

FROM Samples AS M3

WHERE M1.sample_id = M3.sample_id

AND M2.fish = M3.fish);

Trang 7

Unfortunately, it turns out that we have over 100,000 different species of fish and thousands of samples This trick will fill up more disk space than we have on the machine The best trick is to use this

statement:

SELECT fish, SUM(found_cnt)/

(SELECT COUNT(sample_id) FROM SampleGroups WHERE group_id = :my_group) FROM Samples

WHERE fish = :my_fish GROUP BY fish;

This query is using the rule that the average is the sum of values divided by the count of the set Another way to do this would be to use

an OUTER JOIN and preserve all the group IDs, but that would create

NULLs for the fish that are not in some of the sample groups, and you would have to handle them

21.3.2 Averages across Columns

The sum of several columns can be done with COALESCE() function to effectively remove the NULLs by replacing them with zeros:

SELECT (COALESCE(c1, 0.0) + COALESCE(c2, 0.0) + COALESCE(c3, 0.0)) AS c_total FROM Foobar;

Likewise, the minimum and maximum values of several columns can

be done with a CASE expression, or the GREATEST() and LEAST()

functions

Taking an average across several columns is easy if none of the columns are NULL You simply add the values and divide by the number

of columns However, getting rid of NULLs is a bit harder The first trick

is to count the NULLs:

SELECT (COALESCE(c1-c1, 1) + COALESCE(c2-c2, 1) + COALESCE(c3-c3, 1)) AS null_cnt FROM Foobar;

Trang 8

21.4 Extrema Functions 449

The trick is to watch out for a row with all NULLs in it This could lead

to a division by zero error

SELECT CASE WHEN COALESCE(c1, c2, c3) IS NULL

THEN NULL

ELSE (COALESCE(c1, 0.0)

+ COALESCE(c2, 0.0)

+ COALESCE(c3, 0.0))

/ (3 - (COALESCE(c1-c1, 1)

+ COALESCE(c2-c2, 1)

+ COALESCE(c3-c3, 1))

END AS hortizonal_avg

FROM Foobar;

21.4 Extrema Functions

The MIN() and MAX() functions are known as extrema functions in mathematics They assume that the elements of the set have an ordering,

so it makes sense to select a first or last element based on its value SQL provides two simple extrema functions, and you can write queries to

generalize these to (n) elements

21.4.1 Simple Extrema Functions

MAX([ALL | DISTINCT] <value expression>) returns the greatest known value in the <value expression> set This function will work on character and temporal values, as well as numeric values

An empty set returns an empty result set Technically, you can write

MAX(DISTINCT <value expression>), but it is the same as

MAX(<value expression>); this form exists only for completeness, and nobody ever uses it

MIN([ALL | DISTINCT] <value expression>) returns the smallest known value in the <value expression> set This function will also work on character and temporal values, as well as numeric values An empty set returns a NULL Likewise, MIN(DISTINCT

<value expression>) exists, but it is defined only for completeness and nobody ever uses it

The MAX() for a set of numeric values is the largest The MAX() for a set of temporal data types is the one closest to 9999-12-31, which is the final data in the ISO-8601 Standard The MAX() for a set of character strings is the last one in the ascending sort order Likewise, the MIN()

for a set of numeric values is the smallest The MIN() for a set of

Trang 9

temporal data types is the one furthest from 9999-12-31 The MIN()

for a set of character strings is the first one in the ascending sort order, but you have to know the collation used

People have a hard time understanding the MAX() and MIN()

aggregate functions when they are applied to temporal data types They seem to expect the MAX() to return the date closest to the current date Likewise, if the set has no dates before the current date, they seem to expect the MIN() function to return the date closest to the current date Human psychology wants to use the current time as an origin point for temporal reasoning

Consider the predicate “billing_date < (CURRENT_DATE - INTERVAL '90' DAY)” as an example Most people have to stop and figure out that this is looking for billings that are over 90 days past due This same thing happens with MIN() and MAX() functions

SQL also has funny rules about comparing VARCHAR strings, which can cause problems When two strings are compared for equality, the shortest one is right-padded with blanks; then they are compared position for position Thus, the strings ‘John ’ and ‘John ’ are equal You will have to check your implementation of SQL to see which string is returned as the MAX() and which as the MIN(), or whether there is any pattern to it at all

There are some tricks with extrema functions in subqueries that differ from product to product For example, to find the current employee status in a table of Salary Histories, the obvious query is:

SELECT * FROM SalaryHistory AS S0 WHERE S0.change_date = (SELECT MAX(S1.change_date) FROM SalaryHistory AS S1 WHERE S0.emp_id = S1.emp_id);

But you can also write the query as:

SELECT * FROM SalaryHistory AS S0 WHERE NOT EXISTS

(SELECT * FROM SalaryHistory AS S1 WHERE S0.emp_id = S1.emp_id AND S0.change_date < S1.change_date);

Trang 10

21.4 Extrema Functions 451

The correlated subquery with a MAX() will be implemented by going

to the subquery and building a working table, which is grouped by emp_id Then for each group you will keep track of the maximum and save it for the final result

However, the NOT EXISTS version will find the first row that meets the criteria and, when found, return TRUE Therefore, the NOT

EXISTS() predicate might run faster

21.4.2 Generalized Extrema Functions

This is known as the Top (or Bottom) (n) values problem, and it

originally appeared in Explain magazine; it was submitted by Jim

Wankowski of Hawthorne, CA (Wankowski n.d.) You are given a table

of Personnel and their salaries Write a single SQL query that will display the three highest salaries from that table It is easy to find the maximum salary with the simple query SELECT MAX(salary) FROM

Personnel; but SQL does not have a maximum function that will return a group of high values from a column The trouble with this query

is that the specification is bad, for several reasons

1 How do we define “best salary” in terms of an ordering? Is it base pay or does it include commissions? For the rest of this section, assume that we are using a simple table with a column that has the salary for each employee

2 What if we have three or fewer personnel in the company? Do

we report all the personnel we do have? Or do we return a

NULL, empty result set or error message? This is the equivalent

of calling the contest for lack of entries

3 How do we handle two personnel who tied? Include them all and allow the result set to be bigger than three? Pick an

arbitrary subset and exclude someone? Or do we return a

NULL, empty result set, or error message?

To make these problems more explicit, consider this table:

Personnel

emp_name salary

==================

'Able' 1000.00

'Baker' 900.00

Định dạng
Số trang	10
Dung lượng	138,38 KB