Joe Celko s SQL for Smarties - Advanced SQL Programming P50 pdf

462 CHAPTER 21: AGGREGATE FUNCTIONS or: SELECT P2.dept_nbr, MINP1.salary_amt FROM Personnel AS P1, Personnel AS P2 WHERE P1.dept_nbr = P2.dept_nbr AND P1.salary_amt >= P2.salary_amt GRO

Trang 1

462 CHAPTER 21: AGGREGATE FUNCTIONS

or:

SELECT P2.dept_nbr, MIN(P1.salary_amt) FROM Personnel AS P1, Personnel AS P2 WHERE P1.dept_nbr = P2.dept_nbr

AND P1.salary_amt >= P2.salary_amt GROUP BY P2.dept_nbr, P2.salary_amt HAVING COUNT(DISTINCT P1.salary_amt) <= 3;

21.4.4 GREATEST() and LEAST() Functions

Oracle has a proprietary pair of functions that return greatest and least values, respectively—a sort of “horizontal” MAX() and MIN() The syntax is GREATEST (<list of values>) and LEAST (<list of values>) Awkwardly, DB2 allows MIN and MAX as synonyms for

LEAST and GREATEST.

If you have NULL s, then you have to decide if they sort high or low, if they will be excluded or will propagate the NULL , so that you can define this function several ways.

If you don’t have NULL s in the data:

CASE WHEN col1 > col2 THEN col1 ELSE col2 END

If you want the highest non- NULL value:

CASE WHEN col1 > col2 THEN col1 ELSE COALESCE(col2, col1) END

If you want to return NULL where one of the cols is NULL : CASE WHEN col1 > col2 OR col1 IS NULL

THEN col1 ELSE col2 END

But for the rest of this section, let’s assume ( a < b ) and NULL is high: GREATEST (a, b) = b

GREATEST (a, NULL) = NULL GREATEST (NULL, b) = NULL GREATEST (NULL, NULL) = NULL

Trang 2

We can write this as:

GREATEST(x, y) ::= CASE WHEN (COALESCE (x, y) > COALESCE (y, x)) THEN x

ELSE y END

The rules for LEAST() are:

LEAST (a, b) = a

LEAST (a, NULL) = a

LEAST (NULL, b) = b

LEAST (NULL, NULL) = NULL

This is written:

LEAST(x, y) ::= CASE WHEN (COALESCE (x, y) <= COALESCE (y, x)) THEN COALESCE (x, y)

ELSE COALESCE (y, x) END

This can be done in Standard SQL, but takes a little bit of work Let’s assume that we have a table that holds the scores for a player in a series

of five games and we want to get his best score from all five games. CREATE TABLE Games

(player CHAR(10) NOT NULL PRIMARY KEY,

score_1 INTEGER NOT NULL DEFAULT 0,

score_5 INTEGER NOT NULL DEFAULT 0);

and we want to find the GREATEST (score_1, score_2, score_3, score_4, score_5)

SELECT player, MAX(CASE X.seq_nbr

WHEN 1 THEN score_1

WHEN 2 THEN score_2

WHEN 3 THEN score_3

WHEN 4 THEN score_4

WHEN 5 THEN score_5

ELSE NULL END) AS best_score

Trang 3

FROM Games CROSS JOIN (VALUES (1), (2), (3), (4), (5)) AS X(seq_nbr) GROUP BY player;

Another approach is to use a pure CASE expression:

CASE WHEN score_1 <= score_2 AND score_1 <= score_3 AND score_1 <= score_4 AND score_1 <= score_5 THEN score_1

WHEN score_2 <= score_3 AND score_2 <= score_4 AND score_2 <= score_5

THEN score_2 WHEN score_3 <= score_4 AND score_3 <= score_5 THEN score_3

WHEN score_4 <= score_5 THEN score_4

ELSE score_5 END

A final trick is to use a bit of algebra You can define:

GREATEST(a, b) ::= (a + b + ABS(a - b)) / 2 LEAST(a, b) ::= (a + b - ABS(a - b)) / 2

Then iterate on it as a recurrence relation on numeric values For example, for three items, you can use GREATEST (a, GREATEST(b, c)) , which expands to:

((a + b) + ABS(a - b) + 2 * c + ABS((a + b) + ABS(a - b)

- 2 * c))/4

You need to watch for possible overflow errors if the numbers are large and NULL s propagate in the math functions Here is the answer for five scores

(score_1 + score_2 + 2*score_3 + 4*score_4 + 8*score_5 + ABS(score_1 - score_2) + ABS((score_1 + score_2) + ABS(score_1 - score_2) - 2*score_3)

Trang 4

+ ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2)))

+ ABS(score_1 + score_2 + 2*score_3 + 4*score_4 - 8*score_5 + ABS(score_1 - score_2) + ABS((score_1 + score_2) +

ABS(score_1 - score_2) - 2*score_3)

+ ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2))) )) / 16

21.5 The LIST() Aggregate Function

The LIST([DISTINCT] <string expression>) is part of Sybase’s SQL Anywhere (formerly WATCOM SQL) It is the only aggregate function to work on character strings It takes a column of strings, removes the NULL s and merges them into a single result string with commas between each of the original strings The DISTINCT option removes duplicates as well as NULL s before concatenating the strings This function is a generalized version of concatenation, just as SUM() is a generalized version of addition.

MySQL 4.1 extended this function into the GROUP_CONCAT()

function, which does the same thing but adds options for ORDER BY and

SEPARATOR

This is handy when you use SQL to write SQL queries As one simple example, you can apply it against the schema tables and obtain the names of all the columns in a table, then use that list to expand a

SELECT * into the current column list.

One nonproprietary way of doing this query is with scalar subquery expressions Assume we have these two tables:

CREATE TABLE People

(person_id INTEGER NOT NULL PRIMARY KEY,

name CHAR(10) NOT NULL);

INSERT INTO People

VALUES (1, 'John'), (2, 'Mary'), (3, 'Fred'), (4, 'Jane');

CREATE TABLE Clothes

(person_id INTEGER NOT NULL,

seq_nbr INTEGER NOT NULL,

item_name CHAR(10) NOT NULL,

worn_flag CHAR(1) NOT NULL

Trang 5

CONSTRAINT worn_flag_yes_no CHECK (worn_flag IN ('Y', 'N')), PRIMARY KEY (id, seq_nbr));

INSERT INTO Clothes VALUES (1, 1, 'Hat', 'Y'), (1, 2, 'Coat', 'N'), (1, 3, 'Glove', 'Y'), (2, 1, 'Hat', 'Y'), (2, 2, 'Coat', 'Y'), (3, 1, 'Shoes', 'N'), (4, 1, 'Pants', 'N'), (4, 2, 'Socks', 'Y');

Using the LIST() function, we could get an output of the outfits of the people with the simple query:

SELECT P0.person_id, P0.person_name, LIST(item_name) AS fashion FROM People AS P0, Clothes AS C0

WHERE P0.person_id = C0.clothes_id AND C0.worn_flag = 'Y'

GROUP BY P0.person_id, P0.person_name;

Result

id name fashion

=======================

1 'John' 'Hat, Glove'

2 'Mary' 'Hat, Coat'

4 'Jane' 'Socks'

21.5.1 The LIST() Function with a Procedure

To do this without an aggregate function, you must first know the highest sequence number, so you can create the query In this case, the query is a simple “ SELECT MAX(seq_nbr) FROM Clothes ”

statement, but you might have to use a COUNT(*) for other tables. SELECT DISTINCT P0.person_id, P0.person_name,

SUBSTRING ((SELECT CASE WHEN C1.worn_flag = 'Y' THEN (', ' || item_name) ELSE '' END FROM Clothes AS C1

WHERE C1.clothes_id = C0.clothes_id

Trang 6

AND C1.seq_nbr = 1) ||

(SELECT CASE WHEN C2.worn_flag = 'Y'

THEN (', ' || item_name) ELSE '' END FROM Clothes AS C2

WHERE C2.id = C0.clothes_id

AND C2.seq_nbr = 2) ||

(SELECT CASE WHEN C3.worn_flag = 'Y'

THEN (', ' || item_name) ELSE '' END FROM Clothes AS C3

WHERE C3.clothes_id = C0.clothes_id

AND C3.seq_nbr = 3) FROM 3) AS list

FROM People AS P0, Clothes AS C0

WHERE P0.person_id = C0.clothes_id;

id name list

===========================

1 John Hat, Glove

2 Mary Hat, Coat

3 Fred

4 Jane Socks

Again, the CASE expression on worn_flag can be replaced with an IS NULL to replace NULL s with an empty string If you don’t want to see that Fred is naked—has an empty string of clothing—then change the outermost WHERE clause to read:

WHERE P0.person_id = C0.clothes_id AND C0.worn_flag = 'Y';

Since you don’t want to see a leading comma, remember to TRIM() it off or to use the SUBSTRING() function to remove the first two

characters I opted for the SUBSTRING() , because the TRIM() function requires a scan of the string

21.5.2 The LIST() Function by Crosstabs

Carl Federl used this to get a similar result:

CREATE TABLE Crosstabs

(seq_nbr INTEGER NOT NULL PRIMARY KEY,

seq_nbr_1 INTEGER NOT NULL,

seq_nbr_2 INTEGER NOT NULL,

Trang 7

seq_nbr_3 INTEGER NOT NULL, seq_nbr_4 INTEGER NOT NULL, seq_nbr_5 INTEGER NOT NULL);

INSERT INTO Crosstabs VALUES (1, 1, 0, 0, 0, 0), (2, 0, 1, 0, 0, 0), (3, 0, 0, 1, 0, 0), (4, 0, 0, 0, 1, 0), (5, 0, 0, 0, 0, 1);

SELECT Clothes.id, TRIM (MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_1 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_2 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_3 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_4 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_5 * 10))) FROM Clothes, Crosstabs

WHERE Clothes.seq_nbr = Crosstabs.seq_nbr AND Clothes.worn_flag = 'Y'

GROUP BY Clothes.id;

21.6 The PRD() Aggregate Function

Bob McGowan sent me a message on CompuServe asking for help with a problem His client, a financial institution, tracks investment

performance with a table something like this:

CREATE TABLE Performance (portfolio_id CHAR(7) NOT NULL, execute_date DATE NOT NULL, rate_of_return DECIMAL(13,7) NOT NULL);

To calculate a rate of return over a date range, you use the formula: (1 + rate_of_return [day_1])

* (1 + rate_of_return [day_2])

* (1 + rate_of_return [day_N])

Trang 8

How would you construct a query that would return one row for each portfolio’s return over the date range? What Mr McGowan really wants is

an aggregate function in the SELECT clause to return a columnar product, like the SUM() returns a columnar total

If you were a math major, you would write these functions as capital Sigma (∑) for summation and capital Pi for product ( π ) If such an aggregate function existed in SQL, the syntax for it would look

something like:

PRD ([DISTINCT] <expression>)

While I am not sure that there is any use for the DISTINCT option, the new aggregate function would let us write his problem simply as: SELECT portfolio_id, PRD(1.00 + rate_of_return)

FROM Performance

WHERE execute_date BETWEEN start_date AND end_date

GROUP BY portfolio_id;

21.6.1 PRD() Function by Expressions

There is a trick to doing this, but you need a second table that looks like this and covers a period of five days:

CREATE TABLE BigPi

(execute_date DATE NOT NULL,

day_1 INTEGER NOT NULL,

day_5 INTEGER NOT NULL);

Let’s assume we wanted to look at January 6 to 10, so we need to update the execute_date column to that range, thus:

INSERT INTO BigPi

VALUES ('2006-01-06', 1, 0, 0, 0, 0),

('2006-01-07', 0, 1, 0, 0, 0),

('2006-01-08', 0, 0, 1, 0, 0),

('2006-01-09', 0, 0, 0, 1, 0),

('2006-01-10', 0, 0, 0, 0, 1);

Trang 9

The idea is that there is a one in the column when BigPi.execute_date

is equal to the nth date in the range, and a zero otherwise The query for

this problem is:

SELECT portfolio_id, (SUM((1.00 + P1.rate_of_return) * M1.day_1) * SUM((1.00 + P1.rate_of_return) * M1.day_2) * SUM((1.00 + P1.rate_of_return) * M1.day_3) * SUM((1.00 + P1.rate_of_return) * M1.day_4) * SUM((1.00 + P1.rate_of_return) * M1.day_5)) AS product FROM Performance AS P1, BigPi AS M1

WHERE M1.execute_date = P1.execute_date AND P1.execute_date BETWEEN '2006-01-06' AND '2006-01-10' GROUP BY portfolio_id;

If anyone is missing a rate_of_return entry on a date in that range, his

or her product will be zero That might be fine, but if you needed to get a

NULL when you have missing data, then replace each SUM() expression with a CASE expression like this:

CASE WHEN SUM((1.00 + P1.rate_of_return) * M1.day_N) = 0.00

THEN CAST (NULL AS DECIMAL(6, 4)) ELSE SUM((1.00 + P1.rate_of_return) * M1.day_N) END

Alternately, if your SQL has the full SQL set of expressions, use this version:

COALESCE (SUM((1.00 + P1.rate_of_return) * M1.day_N), 0.00)

21.6.2 The PRD() Aggregate Function by Logarithms

Roy Harvey, another SQL guru who answered questions on CompuServe, found a different solution—one that could only come from someone old enough to remember slide rules and multiplication by adding logs The nice part of this solution is that you can also use the

DISTINCT option in the SUM() function

But there are a lot of warnings about this approach Some older SQL implementation might have trouble with using an aggregate function result as a parameter This has always been part of the standard, but

Trang 10

some SQL products use very different mechanisms for the aggregate functions.

Another, more fundamental problem is that a log of zero or less is undefined, so your SQL might return a NULL or an error message You will also see some SQL products that use LN() for the natural log and

LOG10() for the logarithm base ten, and some SQLs that use

LOG(<parameter>, <base>) for a general logarithm function Given all those warnings, the expression for the product of a column from logarithm and exponential functions is:

SELECT ((EXP (SUM (LN (CASE WHEN nbr = 0.00

THEN CAST (NULL AS FLOAT)

ELSE ABS(nbr) END))))

* (CASE WHEN MIN (ABS (nbr)) = 0.00

THEN 0.00

ELSE 1.00 END)

* (CASE WHEN MOD (SUM (CASE WHEN SIGN(nbr) = -1

THEN 1

ELSE 0 END), 2) = 1

THEN -1.00

ELSE 1.00 END) AS big_pi

FROM NumberTable;

The nice part of this is that you can also use the SUM (DISTINCT

<expression>) option to get the equivalent of PRD (DISTINCT

<expression>)

You should watch the data type of the column involved and use either integer 0 and 1 or decimal 0.00 and 1.00 as is appropriate in the CASE

statements It is worth studying the three CASE expressions that make up the terms of the Prod calculation.

The first CASE expression is to ensure that all zeros and negative numbers are converted to a nonnegative or NULL for the SUM()

function, just in case your SQL raises an exception.

The second CASE expression will return zero as the answer if there is

a zero in the nbr column of any selected row The MIN(ABS(nbr))

trick is handy for detecting the existence of a zero in a list of both positive and negative numbers with an aggregate function.

The third CASE expression will return −1 if there is an odd number of negative numbers in the nbr column The innermost CASE expression uses a SIGN() function, which returns + 1 for a positive number, −1 for

a negative number and 0 for a zero The SUM() counts the −1 results,

Định dạng
Số trang	10
Dung lượng	132,92 KB