The GROUP BY clause takes the result of the FROM and WHERE clauses, then puts the rows into groups defined as having the same values for the columns listed in the GROUP BY clause.. 428 C
Trang 1422 CHAPTER 19: PARTITIONING DATA IN QUERIES
SELECT (tot_cost - ((tot_qty_on_hand - :order_qty_on_hand) * unit_price)) AS cost
FROM LIFO AS L1 WHERE stock_date = (SELECT MAX(stock_date) FROM LIFO AS L2 WHERE tot_qty_on_hand >= :order_qty_on_hand);
This is straight algebra and a little logic Find the most recent date that we had enough (or more) quantity on hand to meet the order If, by dumb blind luck, there is a day when the quantity on hand exactly matched the order, return the total cost as the answer If the order was for more than we have in stock, then return nothing If we go back to a day when we had more in stock than the order was for, then look at the unit price on that day, multiply by the overage and subtract it
Alternatively, you can use a derived table and a CASE expression The CASE expression computes the cost of units that have a running total quantity less than the :order_qty_on_hand, and then it does algebra on the final block of inventory, which would put the running total over the limit The outer query does a sum on these blocks
SELECT SUM(R3.v) AS cost FROM (SELECT R1.unit_price
* CASE WHEN SUM(R2.qty_on_hand) <=
:order_qty_on_hand THEN R1.qty_on_hand ELSE :order_qty_on_hand
- (SUM(R2.qty_on_hand) - R1.qty_on_hand) END
FROM InventoryReceipts AS R1, InventoryReceipts AS R2 WHERE R1.purchase_date <= R2.purchase_date GROUP BY R1.purchase_date, R1.qty_on_hand, R1.unit_price HAVING (SUM(R2.qty_on_hand) - R1.qty_on_hand) <=
:order_qty_on_hand)
AS R3(v);
FIFO can be done with a similar VIEW or derived table:
CREATE VIEW FIFO (stock_date, unit_price, tot_qty_on_hand, tot_cost)
AS
Trang 219.5 FIFO and LIFO Subsets 423
SELECT R1.purchase_date, R1.unit_price,
SUM(R2.qty_on_hand), SUM(R2.qty_on_hand *
R2.unit_price)
FROM InventoryReceipts AS R1,
InventoryReceipts AS R2
WHERE R2.purchase_date <= R1.purchase_date
GROUP BY R1.purchase_date, R1.unit_price;
With the corresponding query:
SELECT (tot_cost - ((tot_qty_on_hand - :order_qty_on_hand) * unit_price)) AS cost
FROM FIFO AS F1
WHERE stock_date
= (SELECT MIN (stock_date)
FROM FIFO AS F2
WHERE tot_qty_on_hand >= :order_qty_on_hand);
Trang 4C H A P T E R
20
Grouping Operations
II AM SEPARATING THE partitions and grouping operations based on the idea that a group has group properties that we are trying to find, so we get an answer back for each group A partition is simply a way of subsetting the original table so that we get a table back as a result
20.1 GROUP BY Clause
The GROUP BY clause is based on simple partitions A partition of a set divides the set into subsets such that the union of the subsets returns the original set, and the intersection of the subsets is empty Think of
it as cutting up a pizza pie—each piece of pepperoni belongs to one and only one slice of pizza When you get to the section on SQL-99 OLAP extensions, you will see “variations on a theme” in the ROLLUP and CUBE operators, but this is where it all starts
The GROUP BY clause takes the result of the FROM and WHERE clauses, then puts the rows into groups defined as having the same values for the columns listed in the GROUP BY clause Each group is reduced to a single row in the result table This result table is called a grouped table, and all operations are now defined on groups rather than on rows
By convention, the NULLs are treated as one group The order of the grouping columns in the GROUP BY clause does not matter, but
Trang 5426 CHAPTER 20: GROUPING OPERATIONS
since all or some of the column names have to appear in the SELECT list, you should probably use the same order in both lists for readability Please note the SELECT column names might be a subset of the GROUP BY clause column names, but never the other way around Let us construct a sample table called Villes to explain in detail how this works The table is declared as:
CREATE TABLE Villes (state_code CHAR(2) NOT NULL, usps codes city_name CHAR(25) NOT NULL,
PRIMARY KEY (city_name, state_code));
We populate it with the names of cities that end in “-ville” in each state The first problem is to find a count of the number of such cities by state_code The immediate nạve query might be:
SELECT state_code, city_name, COUNT(*) FROM Villes
GROUP BY state_code;
The groups for Tennessee would have the rows ('TN', 'Nashville') and ('TN', 'Knoxville') The first position in the result is the grouping column, which has to be constant within the group The third column in the SELECT clause is the COUNT(*) for the group, which is clearly two The city_name column is a problem Since the table is grouped by states, there can be at most 50 groups, one for each state_code The COUNT(*) is clearly a single value, and it applies to the group as a whole But what possible single value could I output for a city_name in each group? Pick a typical city_name and use it? If all the cities have the same name, use that name; otherwise, output a NULL? The worst possible choice would be to output both rows with the COUNT(*)
of 2, since each row would imply that there are two cities named Nashville and two cities named Knoxville in Tennessee
Each row represents a single group, so anything in it must be a characteristic of the group, not of a single row in the group This is why there is a rule that the SELECT list must be made up only of grouping columns with optional aggregate function expressions
Trang 620.2 GROUP BY and HAVING 427
20.1.1 NULLs and Groups
SQL puts the NULLs into a single group, as if they were all equal The other option, which was used in some of the first SQL implementations before the standard, was to put each NULL into a group by itself That is not an unreasonable choice But to make a meaningful choice between the two options, you would have to know the semantics of the data you are trying to model SQL is a language based on syntax, not semantics For example, if a NULL is being used for a missing diagnosis in a medical record, you know that each patient will probably have a different disease when the NULLs are resolved Putting the NULLs in one group would make sense if you wanted to consider unprocessed diagnosis reports as one group in a summary Putting each NULL in its own group would make sense if you wanted to consider each unprocessed diagnosis report as an action item for treatment of the relevant class of diseases Another example was a traffic ticket database that used NULL for a missing auto tag Obviously, there is more than one car without a tag in the database The general scheme for getting separate groups for each NULL is straightforward:
SELECT x,
FROM Table1
WHERE x IS NOT NULL
GROUP BY x
UNION ALL
SELECT x,
FROM Table1
WHERE x IS NULL;
There will also be cases, such as the traffic tickets, where you can use another GROUP BY clause to form groups where the principal grouping columns are NULL For example, the VIN (Vehicle Identification
Number) is taken when the car is missing a tag, and it would provide a grouping column
20.2 GROUP BY and HAVING
One of the biggest problems in working with the GROUP BY clause lies in not understanding how the WHERE and HAVING clauses work Consider this query to find all departments with fewer than five programmers:
Trang 7428 CHAPTER 20: GROUPING OPERATIONS
SELECT dept_nbr FROM Personnel WHERE job_title = 'Programmer' GROUP BY dept_nbr
HAVING COUNT(*) < 5;
The result of this query does not have a row for any departments with
no programmers The order of execution of the clauses does WHERE first,
so those employees whose jobs are not equal to 'Programmer' are never passed to the GROUP BY clause You have missed data that you might want to trap
The next query will also pick up those departments that have no programmers, because the COUNT(DISTINCT x) function will return a zero for an empty set
SELECT DISTINCT dept_nbr FROM Personnel AS P1 WHERE 5 > (SELECT COUNT(DISTINCT P2.emp_nbr) FROM Personnel AS P2
WHERE P1.dept_nbr = P2 dept_nbr AND P2.job_title = 'Programmer');
If there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group Many early implementations of SQL required that the HAVING clause belong to a GROUP BY clause, so you might see old code written under that assumption
Since the HAVING clause applies only to the rows of a grouped table,
it can reference only the grouping columns and aggregate functions that apply to the group That is why this query would fail:
SELECT dept_nbr Invalid Query!
FROM Personnel GROUP BY dept_nbr HAVING COUNT(*) < 5 AND job_title = 'Programmer';
When the HAVING clause is executed, job is not in the grouped table
as a column—it is a property of a row, not of a group Likewise, this query would fail for much the same reason:
Trang 820.2 GROUP BY and HAVING 429
SELECT dept_nbr Invalid Query!
FROM Personnel
WHERE COUNT(*) < 5
AND job_title = 'Programmer'
GROUP BY dept_nbr;
The COUNT(*) does not exist until after the departmental groups are
formed
20.2.1 Group Characteristics and the HAVING Clause
You can use the aggregate functions and the HAVING clause to determine
certain characteristics of the groups formed by the GROUP BY clause
For example, given a simple grouped table with three columns:
SELECT col1, col2
FROM Foobar
GROUP BY col1, col2
HAVING ;
You can determine the following properties of the groups with these
HAVING clauses:
HAVING COUNT (DISTINCT col_x) = COUNT (col_x)
col_x has all distinct values
HAVING COUNT(*) = COUNT(col_x);
there are no NULLs in the column
HAVING MIN(col_x - <const>) = -MAX(col_x - <const>)
col_x deviates above and below const by the same amount
HAVING MIN(col_x) * MAX(col_x) < 0
either MIN or MAX is negative, not both
HAVING MIN(col_x) * MAX(col_x) > 0
col_x is either all positive or all negative
HAVING MIN(SIGN(col_x)) = MAX(SIGN(col_x))
col_x is all positive, all negative or all zero
Trang 9430 CHAPTER 20: GROUPING OPERATIONS
HAVING MIN(ABS(col_x)) = 0;
col_x has at least one zero HAVING MIN(ABS(col_x)) = MIN(col_x) col_x >= 0 (although the where clause can handle this, too) HAVINGMIN(col_x) = -MAX(col_x)
col_x deviates above and below zero by the same amount HAVING MIN(col_x) * MAX(col_x) = 0
either one or both of MIN or MAX is zero HAVING MIN(col_x) < MAX(col_x)
col_x has more than one value (may be faster than count (*) > 1) HAVING MIN(col_x) = MAX(col_x)
col_x has one value or NULLs HAVING (MAX(seq) - MIN(seq)+1) = COUNT(seq) the sequential numbers in seq have no gaps Tom Moreau contributed most of these suggestions
Let me remind you again that if there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group This means that if you wish to apply one of the tests given above to the whole table, you will need to use a constant in the SELECT list
This will be easier to see with an example You are given a table with a column of unique sequential numbers that start at 1 When you go to insert a new row, you must use a sequence number that is not currently
in the column—that is, fill the gaps If there are no gaps, then and only then can you use the next highest integer in the sequence
CREATE TABLE Foobar (seq_nbr INTEGER NOT NULL PRIMARY KEY CHECK (seq > 0),
junk CHAR(5) NOT NULL);
INSERT INTO Foobar VALUES (1, 'Tom'), (2, 'Dick'), (4, 'Harry'), (5, 'Moe');
How do I find if I have any gaps?
Trang 1020.3 Multiple Aggregation Levels 431
EXISTS (SELECT 'gap'
FROM Foobar
HAVING COUNT(*) = MAX(seq_nbr))
You could not use “SELECT seq_nbr” because the column values will not be identical within the single group made from the table, so the subquery fails with a cardinality violation Likewise, “SELECT*” fails because the asterisk is converted into a column name picked by the SQL engine Here is the insertion statement:
INSERT INTO Foobar (seq_nbr, junk)
VALUES (CASE WHEN EXISTS no gaps
(SELECT 'no gaps'
FROM Foobar
HAVING COUNT(*) = MAX(seq_nbr))
THEN (SELECT MAX(seq_nbr) FROM Foobar) + 1
ELSE (SELECT MIN(seq_nbr) gaps
FROM Foobar
WHERE (seq_nbr - 1)
NOT IN (SELECT seq_nbr FROM Foobar) AND seq_nbr > 0) - 1,
'Celko');
The ELSE clause has to handle a special situation when 1 is in the seq_nbr column, so that it does not return an illegal zero The only tricky part is waiting for the entire scalar subquery expression to compute before subtracting one; writing “MIN(seq_nbr -1)” or
“MIN(seq_nbr) -1” in the SELECT list could disable the use of indexes in many SQL products
20.3 Multiple Aggregation Levels
The rule in SQL is that you cannot nest aggregate functions, such as SELECT department, MIN(COUNT(items)) illegal syntax!
FROM Foobar
GROUP BY department;
The usual intent of this is to get multiple levels of aggregation; this example probably wanted the smallest count of items within each department But this makes no sense, because a department (i.e., a