462 CHAPTER 21: AGGREGATE FUNCTIONS or: SELECT P2.dept_nbr, MINP1.salary_amt FROM Personnel AS P1, Personnel AS P2 WHERE P1.dept_nbr = P2.dept_nbr AND P1.salary_amt >= P2.salary_amt GRO
Trang 1462 CHAPTER 21: AGGREGATE FUNCTIONS
or:
SELECT P2.dept_nbr, MIN(P1.salary_amt) FROM Personnel AS P1, Personnel AS P2 WHERE P1.dept_nbr = P2.dept_nbr
AND P1.salary_amt >= P2.salary_amt GROUP BY P2.dept_nbr, P2.salary_amt HAVING COUNT(DISTINCT P1.salary_amt) <= 3;
21.4.4 GREATEST() and LEAST() Functions
Oracle has a proprietary pair of functions that return greatest and least values, respectively—a sort of “horizontal” MAX() and MIN() The syntax is GREATEST (<list of values>) and LEAST (<list of values>) Awkwardly, DB2 allows MIN and MAX as synonyms for
LEAST and GREATEST.
If you have NULL s, then you have to decide if they sort high or low, if they will be excluded or will propagate the NULL , so that you can define this function several ways.
If you don’t have NULL s in the data:
CASE WHEN col1 > col2 THEN col1 ELSE col2 END
If you want the highest non- NULL value:
CASE WHEN col1 > col2 THEN col1 ELSE COALESCE(col2, col1) END
If you want to return NULL where one of the cols is NULL : CASE WHEN col1 > col2 OR col1 IS NULL
THEN col1 ELSE col2 END
But for the rest of this section, let’s assume ( a < b ) and NULL is high: GREATEST (a, b) = b
GREATEST (a, NULL) = NULL GREATEST (NULL, b) = NULL GREATEST (NULL, NULL) = NULL
Trang 2We can write this as:
GREATEST(x, y) ::= CASE WHEN (COALESCE (x, y) > COALESCE (y, x)) THEN x
ELSE y END
The rules for LEAST() are:
LEAST (a, b) = a
LEAST (a, NULL) = a
LEAST (NULL, b) = b
LEAST (NULL, NULL) = NULL
This is written:
LEAST(x, y) ::= CASE WHEN (COALESCE (x, y) <= COALESCE (y, x)) THEN COALESCE (x, y)
ELSE COALESCE (y, x) END
This can be done in Standard SQL, but takes a little bit of work Let’s assume that we have a table that holds the scores for a player in a series
of five games and we want to get his best score from all five games. CREATE TABLE Games
(player CHAR(10) NOT NULL PRIMARY KEY,
score_1 INTEGER NOT NULL DEFAULT 0,
score_2 INTEGER NOT NULL DEFAULT 0,
score_3 INTEGER NOT NULL DEFAULT 0,
score_4 INTEGER NOT NULL DEFAULT 0,
score_5 INTEGER NOT NULL DEFAULT 0);
and we want to find the GREATEST (score_1, score_2, score_3, score_4, score_5)
SELECT player, MAX(CASE X.seq_nbr
WHEN 1 THEN score_1
WHEN 2 THEN score_2
WHEN 3 THEN score_3
WHEN 4 THEN score_4
WHEN 5 THEN score_5
ELSE NULL END) AS best_score
Trang 3464 CHAPTER 21: AGGREGATE FUNCTIONS
FROM Games CROSS JOIN (VALUES (1), (2), (3), (4), (5)) AS X(seq_nbr) GROUP BY player;
Another approach is to use a pure CASE expression:
CASE WHEN score_1 <= score_2 AND score_1 <= score_3 AND score_1 <= score_4 AND score_1 <= score_5 THEN score_1
WHEN score_2 <= score_3 AND score_2 <= score_4 AND score_2 <= score_5
THEN score_2 WHEN score_3 <= score_4 AND score_3 <= score_5 THEN score_3
WHEN score_4 <= score_5 THEN score_4
ELSE score_5 END
A final trick is to use a bit of algebra You can define:
GREATEST(a, b) ::= (a + b + ABS(a - b)) / 2 LEAST(a, b) ::= (a + b - ABS(a - b)) / 2
Then iterate on it as a recurrence relation on numeric values For example, for three items, you can use GREATEST (a, GREATEST(b, c)) , which expands to:
((a + b) + ABS(a - b) + 2 * c + ABS((a + b) + ABS(a - b)
- 2 * c))/4
You need to watch for possible overflow errors if the numbers are large and NULL s propagate in the math functions Here is the answer for five scores
(score_1 + score_2 + 2*score_3 + 4*score_4 + 8*score_5 + ABS(score_1 - score_2) + ABS((score_1 + score_2) + ABS(score_1 - score_2) - 2*score_3)
Trang 4+ ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2)))
+ ABS(score_1 + score_2 + 2*score_3 + 4*score_4 - 8*score_5 + ABS(score_1 - score_2) + ABS((score_1 + score_2) +
ABS(score_1 - score_2) - 2*score_3)
+ ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2))) )) / 16
21.5 The LIST() Aggregate Function
The LIST([DISTINCT] <string expression>) is part of Sybase’s SQL Anywhere (formerly WATCOM SQL) It is the only aggregate function to work on character strings It takes a column of strings, removes the NULL s and merges them into a single result string with commas between each of the original strings The DISTINCT option removes duplicates as well as NULL s before concatenating the strings This function is a generalized version of concatenation, just as SUM() is a generalized version of addition.
MySQL 4.1 extended this function into the GROUP_CONCAT()
function, which does the same thing but adds options for ORDER BY and
SEPARATOR
This is handy when you use SQL to write SQL queries As one simple example, you can apply it against the schema tables and obtain the names of all the columns in a table, then use that list to expand a
SELECT * into the current column list.
One nonproprietary way of doing this query is with scalar subquery expressions Assume we have these two tables:
CREATE TABLE People
(person_id INTEGER NOT NULL PRIMARY KEY,
name CHAR(10) NOT NULL);
INSERT INTO People
VALUES (1, 'John'), (2, 'Mary'), (3, 'Fred'), (4, 'Jane');
CREATE TABLE Clothes
(person_id INTEGER NOT NULL,
seq_nbr INTEGER NOT NULL,
item_name CHAR(10) NOT NULL,
worn_flag CHAR(1) NOT NULL
Trang 5466 CHAPTER 21: AGGREGATE FUNCTIONS
CONSTRAINT worn_flag_yes_no CHECK (worn_flag IN ('Y', 'N')), PRIMARY KEY (id, seq_nbr));
INSERT INTO Clothes VALUES (1, 1, 'Hat', 'Y'), (1, 2, 'Coat', 'N'), (1, 3, 'Glove', 'Y'), (2, 1, 'Hat', 'Y'), (2, 2, 'Coat', 'Y'), (3, 1, 'Shoes', 'N'), (4, 1, 'Pants', 'N'), (4, 2, 'Socks', 'Y');
Using the LIST() function, we could get an output of the outfits of the people with the simple query:
SELECT P0.person_id, P0.person_name, LIST(item_name) AS fashion FROM People AS P0, Clothes AS C0
WHERE P0.person_id = C0.clothes_id AND C0.worn_flag = 'Y'
GROUP BY P0.person_id, P0.person_name;
Result
id name fashion
=======================
1 'John' 'Hat, Glove'
2 'Mary' 'Hat, Coat'
4 'Jane' 'Socks'
21.5.1 The LIST() Function with a Procedure
To do this without an aggregate function, you must first know the highest sequence number, so you can create the query In this case, the query is a simple “ SELECT MAX(seq_nbr) FROM Clothes ”
statement, but you might have to use a COUNT(*) for other tables. SELECT DISTINCT P0.person_id, P0.person_name,
SUBSTRING ((SELECT CASE WHEN C1.worn_flag = 'Y' THEN (', ' || item_name) ELSE '' END FROM Clothes AS C1
WHERE C1.clothes_id = C0.clothes_id
Trang 6AND C1.seq_nbr = 1) ||
(SELECT CASE WHEN C2.worn_flag = 'Y'
THEN (', ' || item_name) ELSE '' END FROM Clothes AS C2
WHERE C2.id = C0.clothes_id
AND C2.seq_nbr = 2) ||
(SELECT CASE WHEN C3.worn_flag = 'Y'
THEN (', ' || item_name) ELSE '' END FROM Clothes AS C3
WHERE C3.clothes_id = C0.clothes_id
AND C3.seq_nbr = 3) FROM 3) AS list
FROM People AS P0, Clothes AS C0
WHERE P0.person_id = C0.clothes_id;
id name list
===========================
1 John Hat, Glove
2 Mary Hat, Coat
3 Fred
4 Jane Socks
Again, the CASE expression on worn_flag can be replaced with an IS NULL to replace NULL s with an empty string If you don’t want to see that Fred is naked—has an empty string of clothing—then change the outermost WHERE clause to read:
WHERE P0.person_id = C0.clothes_id AND C0.worn_flag = 'Y';
Since you don’t want to see a leading comma, remember to TRIM() it off or to use the SUBSTRING() function to remove the first two
characters I opted for the SUBSTRING() , because the TRIM() function requires a scan of the string
21.5.2 The LIST() Function by Crosstabs
Carl Federl used this to get a similar result:
CREATE TABLE Crosstabs
(seq_nbr INTEGER NOT NULL PRIMARY KEY,
seq_nbr_1 INTEGER NOT NULL,
seq_nbr_2 INTEGER NOT NULL,
Trang 7468 CHAPTER 21: AGGREGATE FUNCTIONS
seq_nbr_3 INTEGER NOT NULL, seq_nbr_4 INTEGER NOT NULL, seq_nbr_5 INTEGER NOT NULL);
INSERT INTO Crosstabs VALUES (1, 1, 0, 0, 0, 0), (2, 0, 1, 0, 0, 0), (3, 0, 0, 1, 0, 0), (4, 0, 0, 0, 1, 0), (5, 0, 0, 0, 0, 1);
SELECT Clothes.id, TRIM (MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_1 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_2 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_3 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_4 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_5 * 10))) FROM Clothes, Crosstabs
WHERE Clothes.seq_nbr = Crosstabs.seq_nbr AND Clothes.worn_flag = 'Y'
GROUP BY Clothes.id;
21.6 The PRD() Aggregate Function
Bob McGowan sent me a message on CompuServe asking for help with a problem His client, a financial institution, tracks investment
performance with a table something like this:
CREATE TABLE Performance (portfolio_id CHAR(7) NOT NULL, execute_date DATE NOT NULL, rate_of_return DECIMAL(13,7) NOT NULL);
To calculate a rate of return over a date range, you use the formula: (1 + rate_of_return [day_1])
* (1 + rate_of_return [day_2])
* (1 + rate_of_return [day_3])
* (1 + rate_of_return [day_4])
* (1 + rate_of_return [day_N])
Trang 8How would you construct a query that would return one row for each portfolio’s return over the date range? What Mr McGowan really wants is
an aggregate function in the SELECT clause to return a columnar product, like the SUM() returns a columnar total
If you were a math major, you would write these functions as capital Sigma (∑) for summation and capital Pi for product ( π ) If such an aggregate function existed in SQL, the syntax for it would look
something like:
PRD ([DISTINCT] <expression>)
While I am not sure that there is any use for the DISTINCT option, the new aggregate function would let us write his problem simply as: SELECT portfolio_id, PRD(1.00 + rate_of_return)
FROM Performance
WHERE execute_date BETWEEN start_date AND end_date
GROUP BY portfolio_id;
21.6.1 PRD() Function by Expressions
There is a trick to doing this, but you need a second table that looks like this and covers a period of five days:
CREATE TABLE BigPi
(execute_date DATE NOT NULL,
day_1 INTEGER NOT NULL,
day_2 INTEGER NOT NULL,
day_3 INTEGER NOT NULL,
day_4 INTEGER NOT NULL,
day_5 INTEGER NOT NULL);
Let’s assume we wanted to look at January 6 to 10, so we need to update the execute_date column to that range, thus:
INSERT INTO BigPi
VALUES ('2006-01-06', 1, 0, 0, 0, 0),
('2006-01-07', 0, 1, 0, 0, 0),
('2006-01-08', 0, 0, 1, 0, 0),
('2006-01-09', 0, 0, 0, 1, 0),
('2006-01-10', 0, 0, 0, 0, 1);
Trang 9470 CHAPTER 21: AGGREGATE FUNCTIONS
The idea is that there is a one in the column when BigPi.execute_date
is equal to the nth date in the range, and a zero otherwise The query for
this problem is:
SELECT portfolio_id, (SUM((1.00 + P1.rate_of_return) * M1.day_1) * SUM((1.00 + P1.rate_of_return) * M1.day_2) * SUM((1.00 + P1.rate_of_return) * M1.day_3) * SUM((1.00 + P1.rate_of_return) * M1.day_4) * SUM((1.00 + P1.rate_of_return) * M1.day_5)) AS product FROM Performance AS P1, BigPi AS M1
WHERE M1.execute_date = P1.execute_date AND P1.execute_date BETWEEN '2006-01-06' AND '2006-01-10' GROUP BY portfolio_id;
If anyone is missing a rate_of_return entry on a date in that range, his
or her product will be zero That might be fine, but if you needed to get a
NULL when you have missing data, then replace each SUM() expression with a CASE expression like this:
CASE WHEN SUM((1.00 + P1.rate_of_return) * M1.day_N) = 0.00
THEN CAST (NULL AS DECIMAL(6, 4)) ELSE SUM((1.00 + P1.rate_of_return) * M1.day_N) END
Alternately, if your SQL has the full SQL set of expressions, use this version:
COALESCE (SUM((1.00 + P1.rate_of_return) * M1.day_N), 0.00)
21.6.2 The PRD() Aggregate Function by Logarithms
Roy Harvey, another SQL guru who answered questions on CompuServe, found a different solution—one that could only come from someone old enough to remember slide rules and multiplication by adding logs The nice part of this solution is that you can also use the
DISTINCT option in the SUM() function
But there are a lot of warnings about this approach Some older SQL implementation might have trouble with using an aggregate function result as a parameter This has always been part of the standard, but
Trang 10some SQL products use very different mechanisms for the aggregate functions.
Another, more fundamental problem is that a log of zero or less is undefined, so your SQL might return a NULL or an error message You will also see some SQL products that use LN() for the natural log and
LOG10() for the logarithm base ten, and some SQLs that use
LOG(<parameter>, <base>) for a general logarithm function Given all those warnings, the expression for the product of a column from logarithm and exponential functions is:
SELECT ((EXP (SUM (LN (CASE WHEN nbr = 0.00
THEN CAST (NULL AS FLOAT)
ELSE ABS(nbr) END))))
* (CASE WHEN MIN (ABS (nbr)) = 0.00
THEN 0.00
ELSE 1.00 END)
* (CASE WHEN MOD (SUM (CASE WHEN SIGN(nbr) = -1
THEN 1
ELSE 0 END), 2) = 1
THEN -1.00
ELSE 1.00 END) AS big_pi
FROM NumberTable;
The nice part of this is that you can also use the SUM (DISTINCT
<expression>) option to get the equivalent of PRD (DISTINCT
<expression>)
You should watch the data type of the column involved and use either integer 0 and 1 or decimal 0.00 and 1.00 as is appropriate in the CASE
statements It is worth studying the three CASE expressions that make up the terms of the Prod calculation.
The first CASE expression is to ensure that all zeros and negative numbers are converted to a nonnegative or NULL for the SUM()
function, just in case your SQL raises an exception.
The second CASE expression will return zero as the answer if there is
a zero in the nbr column of any selected row The MIN(ABS(nbr))
trick is handy for detecting the existence of a zero in a list of both positive and negative numbers with an aggregate function.
The third CASE expression will return −1 if there is an odd number of negative numbers in the nbr column The innermost CASE expression uses a SIGN() function, which returns + 1 for a positive number, −1 for
a negative number and 0 for a zero The SUM() counts the −1 results,