Applying this function to a key or a unique column is the same as using the COUNT* function, but the optimizer may not be smart enough to spot it.. 21.2 SUM Functions This function works
Trang 1442 CHAPTER 21: AGGREGATE FUNCTIONS
'Chester' 'A.' 'Arthur' 'R' 1881 1885
'Grover' ' ' 'Cleveland' 'D' 1885 1889
'Benjamin' ' ' 'Harrison' 'R' 1889 1893
'Grover' ' ' 'Cleveland' 'D' 1893 1897
'William' ' ' 'McKinley' 'R' 1897 1901
'Theodore' ' ' 'Roosevelt' 'R' 1901 1909
'William' 'H.' 'Taft' 'R' 1909 1913
'Woodrow' ' ' 'Wilson' 'D' 1913 1921
'Warren' 'G.' 'Harding' 'R' 1921 1923
'Calvin' ' ' 'Coolidge' 'R' 1923 1929
'Herbert' 'C.' 'Hoover' 'R' 1929 1933
'Franklin' 'D.' 'Roosevelt' 'D' 1933 1945
'Harry' 'S.' 'Truman' 'D' 1945 1953
'Dwight' 'D.' 'Eisenhower' 'R' 1953 1961
'John' 'F.' 'Kennedy' 'D' 1961 1963
'Lyndon' 'B.' 'Johnson' 'D' 1963 1969
'Richard' 'M.' 'Nixon' 'R' 1969 1974
'Gerald' 'R.' 'Ford' 'R' 1974 1977
'James' 'E.' 'Carter' 'D' 1977 1981
'Ronald' 'W.' 'Reagan' 'R' 1981 1989
'George' 'H.W.' 'Bush' 'R' 1989 1993
'William' 'J.' 'Clinton' 'D' 1993 2001
'George' 'W ' 'Bush' 'R' 2001 NULL
Your civics teacher has just asked you to tell her how many people have been President of the United States So you write the query as
SELECTCOUNT(*)FROM Presidents; and get the wrong answer For those of you who have been out of high school too long, more than one Adams, more than one John, and more than one Roosevelt have served
as president Many people have had more than one term in office, and Grover Cleveland served two discontinuous terms In short, this database is not a simple one-row, one-person system What you really want is not COUNT(*), but something that is able to look at unique combinations of multiple columns You cannot do this in one column, so you need to construct an expression that is unique The point is that you need to be very sure that the expression you are using as a parameter is really what you wanted to count
The COUNT([ALL] <value expression>) returns the number of members in the <value expression> set The NULLs were thrown away before the counting took place, and an empty set returns zero The best way to read this is: “Count the number of known values in this
Trang 221.2 SUM() Functions 443
expression,” with stress on the word known In this example you might use COUNT(first_name || ' ' || initial || ' ' || last_name)
The COUNT(DISTINCT <value expression>) returns the number of unique members in the <value expression> set The
NULLs were thrown away before the counting took place, and then all redundant duplicates are removed (i.e., we keep one copy) Again, an empty set returns a zero, just as with the other counting functions Applying this function to a key or a unique column is the same as using the COUNT(*) function, but the optimizer may not be smart enough to spot it
Notice that the use of the keywords ALL and DISTINCT follows the same pattern here as they did in the [ALL | DISTINCT] options in the
SELECT clause of the query expressions
21.2 SUM() Functions
This function works only with numeric values You should also consult your particular product’s manuals to find out the precision of the results for exact and approximate numeric data types
SUM([ALL] <value expression>) returns the numeric total of all known values The NULLs are removed before the summation takes place An empty set returns an empty result set, not a zero If there are other columns in the SELECT list, then that empty set will be converted into a NULL
SUM(DISTINCT <value expression>) returns the numeric total
of all known, unique values The NULLs and all redundant duplicates were removed before the summation took place Again, an empty set returns an empty result set, not a zero
That last rule is hard for people to understand If there are other columns in the SELECT list, then that empty result set will be converted into a NULL This is true for the rest of the Standard aggregate functions:
no rows
SELECT SUM(x)
FROM EmptyTable;
one row with (0, NULL) in it
SELECT COUNT(*), SUM(x)
FROM EmptyTable;
Trang 3444 CHAPTER 21: AGGREGATE FUNCTIONS
The summation of a set of numbers looks as though it should be easy, but it is not Make two tables with the same set of positive and negative approximate numeric values, but put one in random order and have the other sorted by absolute value The sorted table will give more accurate results The reason is simple: positive and negative values of the same magnitude will be added together and will get a chance to cancel each other out There is also less chance of an overflow or underflow error during calculations Most PC SQL implementations and a lot of mainframe implementations do not bother with this trick, because it would require a sort for every SUM()
statement, which would take a long time
Whenever an exact or approximate numeric value is assigned to exact numeric, it may not fit into the storage allowed for it SQL says that the database engine will use an approximation that preserves leading significant digits of the original number after rounding or truncating
The choice of whether to truncate or round is implementation-defined, however This can lead to some surprises when you have to shift data among SQL implementations, or move storage values from a host language program into an SQL table It is probably a good idea to create the columns with one more decimal place than you think you need
Truncation is defined as truncation toward zero; this means that 1.5 would truncate to 1, and −1.5 would truncate to −1 This is not true for all programming languages; everyone agrees on truncation toward zero for the positive numbers, but you will find that negative numbers may truncate away from zero (e.g., −1.5 would truncate to −2) SQL is also wishy-washy on rounding, leaving the implementation free to determine its method There are two major types of rounding, the scientific method and the commercial method, which are discussed in Section 3.2.1 on rounding and truncation math in SQL
21.3 AVG() Functions
AVG([ALL] <value expression>) returns the average of the values
in the value expression set An empty set returns an empty result set A set of all NULLs will become an empty set Remember that in general,
AVG(x) is not the same as (SUM(x)/COUNT(*)); the SUM(x) function has thrown away the NULLs, but the COUNT(*) has not
Likewise, AVG(DISTINCT <value expression>) returns the average of the distinct known values in the <value expression> set
Applying this function to a key or a unique column is the same as the using AVG(<value expression>) function
Trang 421.3 AVG() Functions 445
Remember that in general AVG(DISTINCT x) is not the same as
AVG(x) or (SUM(DISTINCT x)/COUNT(*)) The SUM(DISTINCT x)
function has thrown away the duplicate values and NULLs, but the
COUNT(*) has not An empty set returns an empty result set
The SQL engine is probably using the same code for the totaling in the AVG() that it used in the SUM() function This leads to the same problems with rounding and truncation, so you should experiment a little with your particular product to find out what happens
But even more troublesome than those problems is the problem with the average itself, because it does not really measure central tendency and can be very misleading Consider the chart below, from Darrell
Huff’s superlative little book, How to Lie with Statistics (Huff 1954) The
Sample Company has 25 employees, earning the following salaries:
Number of
Employees Salary Statistic
===================================
12 $2,000 Mode, Minimum
1 $3,000 Median
4 $3,700
3 $5,000
1 $5,700 Average
2 $10,000
1 $15,000
1 $45,000 Maximum
The average salary (or, more properly, the arithmetic mean) is
$5,700 When the boss is trying to look good to the unions, he uses this figure When the unions are trying to look impoverished, they use the mode, which is the most frequently occurring value, to show that the exploited workers are making $2,000 (which is also the minimum salary
in this case)
A better measure in this case is the median, which will be discussed later; that is, the employee with just as many cases above him as below him That gives us $3,000 The rule for calculating the median is that if there is no actual entity with that value, you fake it
Most people take an average of the two values on either side of where the median would be; others jump to the higher or lower value The mode also has a problem, because not every distribution of values has one mode Imagine a country in which there are as many very poor people as there are very rich people, and there is nobody in between
Trang 5446 CHAPTER 21: AGGREGATE FUNCTIONS
This would be a bimodal distribution If there were sharp classes of incomes, that would be a multimodal distribution
Some SQL products have median and mode aggregate functions as extensions, but they are not part of the standard We will discuss in detail how to write them in pure SQL in Chapter 23
21.3.1 Averages with Empty Groups
The query used here is a bit tricky, so this section can be skipped on your first reading Sometimes you need to count an empty set as part of the population when computing an average
This is easier to explain with an example that was posted on CompuServe A fish and game warden is sampling different bodies of water for fish populations Each sample falls into one or more groups (muddy bottoms, clear water, still water, and so on) and she is trying to find the average of something that is not there This is neither quite as strange as it first sounds, nor quite as simple, either She is collecting sample data on fish in a table like this:
CREATE TABLE Samples (sample_id INTEGER NOT NULL, fish CHAR(20) NOT NULL, found_cnt INTEGER NOT NULL, PRIMARY KEY (sample_id, fish));
CREATE TABLE SampleGroups (group_id INTEGER NOT NULL, sample_id INTEGER NOT NULL, PRIMARY KEY (group_id, sample_id);
Assume some of the data looks like this:
Samples sample_id fish found_cnt
============================
1 'Seabass' 14
1 'Minnow' 18
2 'Seabass' 19
Trang 621.3 AVG() Functions 447
SampleGroups
group_id sample_id
=====================
1 1
1 2
2 2
She needs to get the average number of each species of fish in the sample groups For example, using sample group 1 as shown, which has samples 1 and 2, we could use the parameters :my_fish =‘Minnow’
and :my_group = 1 to find the average number of minnows in sample group 1, thus:
SELECT fish, AVG(found_cnt)
FROM Samples
WHERE sample_id
IN (SELECT sample_id
FROM SampleGroups
WHERE group_id = :my_group)
AND fish = :my_fish
GROUP BY fish;
But this query will give us an average of 18 minnows, which is wrong There were no minnows for sample_id = 2, so the average is ((18 + 0)/2)
= 9 The other way is to do several steps to get the correct answer—first use a SELECT statement to get the number of samples involved, then another SELECT to get the sum, and then manually calculate the average
The obvious answer is to enter a count of zero for each animal under each sample_id, instead of letting it be missing, so you can use the original query You can create the missing rows with:
INSERT INTO Samples
SELECT M1.sample_id, M2.fish, 0
FROM Samples AS M1, Samples AS M2
WHERE NOT EXISTS (SELECT *
FROM Samples AS M3
WHERE M1.sample_id = M3.sample_id
AND M2.fish = M3.fish);
Trang 7448 CHAPTER 21: AGGREGATE FUNCTIONS
Unfortunately, it turns out that we have over 100,000 different species of fish and thousands of samples This trick will fill up more disk space than we have on the machine The best trick is to use this
statement:
SELECT fish, SUM(found_cnt)/
(SELECT COUNT(sample_id) FROM SampleGroups WHERE group_id = :my_group) FROM Samples
WHERE fish = :my_fish GROUP BY fish;
This query is using the rule that the average is the sum of values divided by the count of the set Another way to do this would be to use
an OUTER JOIN and preserve all the group IDs, but that would create
NULLs for the fish that are not in some of the sample groups, and you would have to handle them
21.3.2 Averages across Columns
The sum of several columns can be done with COALESCE() function to effectively remove the NULLs by replacing them with zeros:
SELECT (COALESCE(c1, 0.0) + COALESCE(c2, 0.0) + COALESCE(c3, 0.0)) AS c_total FROM Foobar;
Likewise, the minimum and maximum values of several columns can
be done with a CASE expression, or the GREATEST() and LEAST()
functions
Taking an average across several columns is easy if none of the columns are NULL You simply add the values and divide by the number
of columns However, getting rid of NULLs is a bit harder The first trick
is to count the NULLs:
SELECT (COALESCE(c1-c1, 1) + COALESCE(c2-c2, 1) + COALESCE(c3-c3, 1)) AS null_cnt FROM Foobar;
Trang 821.4 Extrema Functions 449
The trick is to watch out for a row with all NULLs in it This could lead
to a division by zero error
SELECT CASE WHEN COALESCE(c1, c2, c3) IS NULL
THEN NULL
ELSE (COALESCE(c1, 0.0)
+ COALESCE(c2, 0.0)
+ COALESCE(c3, 0.0))
/ (3 - (COALESCE(c1-c1, 1)
+ COALESCE(c2-c2, 1)
+ COALESCE(c3-c3, 1))
END AS hortizonal_avg
FROM Foobar;
21.4 Extrema Functions
The MIN() and MAX() functions are known as extrema functions in mathematics They assume that the elements of the set have an ordering,
so it makes sense to select a first or last element based on its value SQL provides two simple extrema functions, and you can write queries to
generalize these to (n) elements
21.4.1 Simple Extrema Functions
MAX([ALL | DISTINCT] <value expression>) returns the greatest known value in the <value expression> set This function will work on character and temporal values, as well as numeric values
An empty set returns an empty result set Technically, you can write
MAX(DISTINCT <value expression>), but it is the same as
MAX(<value expression>); this form exists only for completeness, and nobody ever uses it
MIN([ALL | DISTINCT] <value expression>) returns the smallest known value in the <value expression> set This function will also work on character and temporal values, as well as numeric values An empty set returns a NULL Likewise, MIN(DISTINCT
<value expression>) exists, but it is defined only for completeness and nobody ever uses it
The MAX() for a set of numeric values is the largest The MAX() for a set of temporal data types is the one closest to 9999-12-31, which is the final data in the ISO-8601 Standard The MAX() for a set of character strings is the last one in the ascending sort order Likewise, the MIN()
for a set of numeric values is the smallest The MIN() for a set of
Trang 9450 CHAPTER 21: AGGREGATE FUNCTIONS
temporal data types is the one furthest from 9999-12-31 The MIN()
for a set of character strings is the first one in the ascending sort order, but you have to know the collation used
People have a hard time understanding the MAX() and MIN()
aggregate functions when they are applied to temporal data types They seem to expect the MAX() to return the date closest to the current date Likewise, if the set has no dates before the current date, they seem to expect the MIN() function to return the date closest to the current date Human psychology wants to use the current time as an origin point for temporal reasoning
Consider the predicate “billing_date < (CURRENT_DATE - INTERVAL '90' DAY)” as an example Most people have to stop and figure out that this is looking for billings that are over 90 days past due This same thing happens with MIN() and MAX() functions
SQL also has funny rules about comparing VARCHAR strings, which can cause problems When two strings are compared for equality, the shortest one is right-padded with blanks; then they are compared position for position Thus, the strings ‘John ’ and ‘John ’ are equal You will have to check your implementation of SQL to see which string is returned as the MAX() and which as the MIN(), or whether there is any pattern to it at all
There are some tricks with extrema functions in subqueries that differ from product to product For example, to find the current employee status in a table of Salary Histories, the obvious query is:
SELECT * FROM SalaryHistory AS S0 WHERE S0.change_date = (SELECT MAX(S1.change_date) FROM SalaryHistory AS S1 WHERE S0.emp_id = S1.emp_id);
But you can also write the query as:
SELECT * FROM SalaryHistory AS S0 WHERE NOT EXISTS
(SELECT * FROM SalaryHistory AS S1 WHERE S0.emp_id = S1.emp_id AND S0.change_date < S1.change_date);
Trang 1021.4 Extrema Functions 451
The correlated subquery with a MAX() will be implemented by going
to the subquery and building a working table, which is grouped by emp_id Then for each group you will keep track of the maximum and save it for the final result
However, the NOT EXISTS version will find the first row that meets the criteria and, when found, return TRUE Therefore, the NOT
EXISTS() predicate might run faster
21.4.2 Generalized Extrema Functions
This is known as the Top (or Bottom) (n) values problem, and it
originally appeared in Explain magazine; it was submitted by Jim
Wankowski of Hawthorne, CA (Wankowski n.d.) You are given a table
of Personnel and their salaries Write a single SQL query that will display the three highest salaries from that table It is easy to find the maximum salary with the simple query SELECT MAX(salary) FROM
Personnel; but SQL does not have a maximum function that will return a group of high values from a column The trouble with this query
is that the specification is bad, for several reasons
1 How do we define “best salary” in terms of an ordering? Is it base pay or does it include commissions? For the rest of this section, assume that we are using a simple table with a column that has the salary for each employee
2 What if we have three or fewer personnel in the company? Do
we report all the personnel we do have? Or do we return a
NULL, empty result set or error message? This is the equivalent
of calling the contest for lack of entries
3 How do we handle two personnel who tied? Include them all and allow the result set to be bigger than three? Pick an
arbitrary subset and exclude someone? Or do we return a
NULL, empty result set, or error message?
To make these problems more explicit, consider this table:
Personnel
emp_name salary
==================
'Able' 1000.00
'Baker' 900.00