Joe Celko s SQL for Smarties - Advanced SQL Programming P46 potx

The GROUP BY clause takes the result of the FROM and WHERE clauses, then puts the rows into groups defined as having the same values for the columns listed in the GROUP BY clause.. 428 C

Trang 1

422 CHAPTER 19: PARTITIONING DATA IN QUERIES

SELECT (tot_cost - ((tot_qty_on_hand - :order_qty_on_hand) * unit_price)) AS cost

FROM LIFO AS L1 WHERE stock_date = (SELECT MAX(stock_date) FROM LIFO AS L2 WHERE tot_qty_on_hand >= :order_qty_on_hand);

This is straight algebra and a little logic Find the most recent date that we had enough (or more) quantity on hand to meet the order If, by dumb blind luck, there is a day when the quantity on hand exactly matched the order, return the total cost as the answer If the order was for more than we have in stock, then return nothing If we go back to a day when we had more in stock than the order was for, then look at the unit price on that day, multiply by the overage and subtract it

Alternatively, you can use a derived table and a CASE expression The CASE expression computes the cost of units that have a running total quantity less than the :order_qty_on_hand, and then it does algebra on the final block of inventory, which would put the running total over the limit The outer query does a sum on these blocks

SELECT SUM(R3.v) AS cost FROM (SELECT R1.unit_price

* CASE WHEN SUM(R2.qty_on_hand) <=

:order_qty_on_hand THEN R1.qty_on_hand ELSE :order_qty_on_hand

- (SUM(R2.qty_on_hand) - R1.qty_on_hand) END

FROM InventoryReceipts AS R1, InventoryReceipts AS R2 WHERE R1.purchase_date <= R2.purchase_date GROUP BY R1.purchase_date, R1.qty_on_hand, R1.unit_price HAVING (SUM(R2.qty_on_hand) - R1.qty_on_hand) <=

:order_qty_on_hand)

AS R3(v);

FIFO can be done with a similar VIEW or derived table:

CREATE VIEW FIFO (stock_date, unit_price, tot_qty_on_hand, tot_cost)

AS

Trang 2

19.5 FIFO and LIFO Subsets 423

SELECT R1.purchase_date, R1.unit_price,

SUM(R2.qty_on_hand), SUM(R2.qty_on_hand *

R2.unit_price)

FROM InventoryReceipts AS R1,

InventoryReceipts AS R2

WHERE R2.purchase_date <= R1.purchase_date

GROUP BY R1.purchase_date, R1.unit_price;

With the corresponding query:

SELECT (tot_cost - ((tot_qty_on_hand - :order_qty_on_hand) * unit_price)) AS cost

FROM FIFO AS F1

WHERE stock_date

= (SELECT MIN (stock_date)

FROM FIFO AS F2

WHERE tot_qty_on_hand >= :order_qty_on_hand);

Trang 4

C H A P T E R

20

Grouping Operations

II AM SEPARATING THE partitions and grouping operations based on the idea that a group has group properties that we are trying to find, so we get an answer back for each group A partition is simply a way of subsetting the original table so that we get a table back as a result

20.1 GROUP BY Clause

The GROUP BY clause is based on simple partitions A partition of a set divides the set into subsets such that the union of the subsets returns the original set, and the intersection of the subsets is empty Think of

it as cutting up a pizza pie—each piece of pepperoni belongs to one and only one slice of pizza When you get to the section on SQL-99 OLAP extensions, you will see “variations on a theme” in the ROLLUP and CUBE operators, but this is where it all starts

The GROUP BY clause takes the result of the FROM and WHERE clauses, then puts the rows into groups defined as having the same values for the columns listed in the GROUP BY clause Each group is reduced to a single row in the result table This result table is called a grouped table, and all operations are now defined on groups rather than on rows

By convention, the NULLs are treated as one group The order of the grouping columns in the GROUP BY clause does not matter, but

Trang 5

426 CHAPTER 20: GROUPING OPERATIONS

since all or some of the column names have to appear in the SELECT list, you should probably use the same order in both lists for readability Please note the SELECT column names might be a subset of the GROUP BY clause column names, but never the other way around Let us construct a sample table called Villes to explain in detail how this works The table is declared as:

CREATE TABLE Villes (state_code CHAR(2) NOT NULL, usps codes city_name CHAR(25) NOT NULL,

PRIMARY KEY (city_name, state_code));

We populate it with the names of cities that end in “-ville” in each state The first problem is to find a count of the number of such cities by state_code The immediate nạve query might be:

SELECT state_code, city_name, COUNT(*) FROM Villes

GROUP BY state_code;

The groups for Tennessee would have the rows ('TN', 'Nashville') and ('TN', 'Knoxville') The first position in the result is the grouping column, which has to be constant within the group The third column in the SELECT clause is the COUNT(*) for the group, which is clearly two The city_name column is a problem Since the table is grouped by states, there can be at most 50 groups, one for each state_code The COUNT(*) is clearly a single value, and it applies to the group as a whole But what possible single value could I output for a city_name in each group? Pick a typical city_name and use it? If all the cities have the same name, use that name; otherwise, output a NULL? The worst possible choice would be to output both rows with the COUNT(*)

of 2, since each row would imply that there are two cities named Nashville and two cities named Knoxville in Tennessee

Each row represents a single group, so anything in it must be a characteristic of the group, not of a single row in the group This is why there is a rule that the SELECT list must be made up only of grouping columns with optional aggregate function expressions

Trang 6

20.2 GROUP BY and HAVING 427

20.1.1 NULLs and Groups

SQL puts the NULLs into a single group, as if they were all equal The other option, which was used in some of the first SQL implementations before the standard, was to put each NULL into a group by itself That is not an unreasonable choice But to make a meaningful choice between the two options, you would have to know the semantics of the data you are trying to model SQL is a language based on syntax, not semantics For example, if a NULL is being used for a missing diagnosis in a medical record, you know that each patient will probably have a different disease when the NULLs are resolved Putting the NULLs in one group would make sense if you wanted to consider unprocessed diagnosis reports as one group in a summary Putting each NULL in its own group would make sense if you wanted to consider each unprocessed diagnosis report as an action item for treatment of the relevant class of diseases Another example was a traffic ticket database that used NULL for a missing auto tag Obviously, there is more than one car without a tag in the database The general scheme for getting separate groups for each NULL is straightforward:

SELECT x,

FROM Table1

WHERE x IS NOT NULL

GROUP BY x

UNION ALL

SELECT x,

FROM Table1

WHERE x IS NULL;

There will also be cases, such as the traffic tickets, where you can use another GROUP BY clause to form groups where the principal grouping columns are NULL For example, the VIN (Vehicle Identification

Number) is taken when the car is missing a tag, and it would provide a grouping column

20.2 GROUP BY and HAVING

One of the biggest problems in working with the GROUP BY clause lies in not understanding how the WHERE and HAVING clauses work Consider this query to find all departments with fewer than five programmers:

Trang 7

SELECT dept_nbr FROM Personnel WHERE job_title = 'Programmer' GROUP BY dept_nbr

HAVING COUNT(*) < 5;

The result of this query does not have a row for any departments with

no programmers The order of execution of the clauses does WHERE first,

so those employees whose jobs are not equal to 'Programmer' are never passed to the GROUP BY clause You have missed data that you might want to trap

The next query will also pick up those departments that have no programmers, because the COUNT(DISTINCT x) function will return a zero for an empty set

SELECT DISTINCT dept_nbr FROM Personnel AS P1 WHERE 5 > (SELECT COUNT(DISTINCT P2.emp_nbr) FROM Personnel AS P2

WHERE P1.dept_nbr = P2 dept_nbr AND P2.job_title = 'Programmer');

If there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group Many early implementations of SQL required that the HAVING clause belong to a GROUP BY clause, so you might see old code written under that assumption

Since the HAVING clause applies only to the rows of a grouped table,

it can reference only the grouping columns and aggregate functions that apply to the group That is why this query would fail:

SELECT dept_nbr Invalid Query!

FROM Personnel GROUP BY dept_nbr HAVING COUNT(*) < 5 AND job_title = 'Programmer';

When the HAVING clause is executed, job is not in the grouped table

as a column—it is a property of a row, not of a group Likewise, this query would fail for much the same reason:

Trang 8

20.2 GROUP BY and HAVING 429

SELECT dept_nbr Invalid Query!

FROM Personnel

WHERE COUNT(*) < 5

AND job_title = 'Programmer'

GROUP BY dept_nbr;

The COUNT(*) does not exist until after the departmental groups are

formed

20.2.1 Group Characteristics and the HAVING Clause

You can use the aggregate functions and the HAVING clause to determine

certain characteristics of the groups formed by the GROUP BY clause

For example, given a simple grouped table with three columns:

SELECT col1, col2

FROM Foobar

GROUP BY col1, col2

HAVING ;

You can determine the following properties of the groups with these

HAVING clauses:

HAVING COUNT (DISTINCT col_x) = COUNT (col_x)

col_x has all distinct values

HAVING COUNT(*) = COUNT(col_x);

there are no NULLs in the column

HAVING MIN(col_x - <const>) = -MAX(col_x - <const>)

col_x deviates above and below const by the same amount

HAVING MIN(col_x) * MAX(col_x) < 0

either MIN or MAX is negative, not both

HAVING MIN(col_x) * MAX(col_x) > 0

col_x is either all positive or all negative

HAVING MIN(SIGN(col_x)) = MAX(SIGN(col_x))

col_x is all positive, all negative or all zero

Trang 9

HAVING MIN(ABS(col_x)) = 0;

col_x has at least one zero HAVING MIN(ABS(col_x)) = MIN(col_x) col_x >= 0 (although the where clause can handle this, too) HAVINGMIN(col_x) = -MAX(col_x)

col_x deviates above and below zero by the same amount HAVING MIN(col_x) * MAX(col_x) = 0

either one or both of MIN or MAX is zero HAVING MIN(col_x) < MAX(col_x)

col_x has more than one value (may be faster than count (*) > 1) HAVING MIN(col_x) = MAX(col_x)

col_x has one value or NULLs HAVING (MAX(seq) - MIN(seq)+1) = COUNT(seq) the sequential numbers in seq have no gaps Tom Moreau contributed most of these suggestions

Let me remind you again that if there is no GROUP BY clause, the HAVING clause will treat the entire table as a single group This means that if you wish to apply one of the tests given above to the whole table, you will need to use a constant in the SELECT list

This will be easier to see with an example You are given a table with a column of unique sequential numbers that start at 1 When you go to insert a new row, you must use a sequence number that is not currently

in the column—that is, fill the gaps If there are no gaps, then and only then can you use the next highest integer in the sequence

CREATE TABLE Foobar (seq_nbr INTEGER NOT NULL PRIMARY KEY CHECK (seq > 0),

junk CHAR(5) NOT NULL);

INSERT INTO Foobar VALUES (1, 'Tom'), (2, 'Dick'), (4, 'Harry'), (5, 'Moe');

How do I find if I have any gaps?

Trang 10

20.3 Multiple Aggregation Levels 431

EXISTS (SELECT 'gap'

FROM Foobar

HAVING COUNT(*) = MAX(seq_nbr))

You could not use “SELECT seq_nbr” because the column values will not be identical within the single group made from the table, so the subquery fails with a cardinality violation Likewise, “SELECT*” fails because the asterisk is converted into a column name picked by the SQL engine Here is the insertion statement:

INSERT INTO Foobar (seq_nbr, junk)

VALUES (CASE WHEN EXISTS no gaps

(SELECT 'no gaps'

FROM Foobar

HAVING COUNT(*) = MAX(seq_nbr))

THEN (SELECT MAX(seq_nbr) FROM Foobar) + 1

ELSE (SELECT MIN(seq_nbr) gaps

FROM Foobar

WHERE (seq_nbr - 1)

NOT IN (SELECT seq_nbr FROM Foobar) AND seq_nbr > 0) - 1,

'Celko');

The ELSE clause has to handle a special situation when 1 is in the seq_nbr column, so that it does not return an illegal zero The only tricky part is waiting for the entire scalar subquery expression to compute before subtracting one; writing “MIN(seq_nbr -1)” or

“MIN(seq_nbr) -1” in the SELECT list could disable the use of indexes in many SQL products

20.3 Multiple Aggregation Levels

The rule in SQL is that you cannot nest aggregate functions, such as SELECT department, MIN(COUNT(items)) illegal syntax!

FROM Foobar

GROUP BY department;

The usual intent of this is to get multiple levels of aggregation; this example probably wanted the smallest count of items within each department But this makes no sense, because a department (i.e., a

Định dạng
Số trang	10
Dung lượng	233,79 KB