Joe Celko s SQL for Smarties - Advanced SQL Programming P56 ppsx

Every value in the column weight partitions the table into three sections: values that are less than weight, values that are equal to weight, and values that are greater than weight.. Cl

Trang 1

SELECT CASE MOD(COUNT(*),2) WHEN 0 even sized table THEN (P1.weight + MIN(CASE WHEN P1.weight > P2.weight THEN P1.weight

ELSE NULL END))/2.0 ELSE P2.weight odd sized table

END FROM Parts AS P1, Parts AS P2 GROUP BY P1.weight

HAVING COUNT(CASE WHEN P1.weight >= P2.weight THEN 1

ELSE NULL END) = (COUNT(*) + 1) / 2;

This answer is due to Ken Henderson

23.3.8 Celko’s Third Median

Another approach involves looking at a picture of a line of sorted values and seeing where the median would fall Every value in the column weight partitions the table into three sections: values that are less than weight, values that are equal to weight, and values that are greater than weight We can get a profile of each value with a tabular subquery expression

Now the question is how to define a median in terms of the partitions Clearly, the definition of a median means that if (lesser = greater) then weight is the median

Now look at Figure 23.1 for the other situations If there are more elements in the greater values than half the size of the table, then weight cannot be a median Likewise, if there are more elements in the lesser values than half the size of the table, then weight cannot be a median

If (lesser + equal) = greater, then weight is a left-hand median Likewise, if (greater + equal) = lesser, then weight is a right-hand median However, if weight is the median, then both lesser and greater must have tallies of less than half the size of the table That translates into the following SQL:

SELECT AVG(DISTINCT weight) FROM (SELECT P1.part_nbr, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END),

SUM(CASE WHEN P2.weight = P1.weight

Trang 2

THEN 1 ELSE 0 END),

SUM(CASE WHEN P2.weight > P1.weight

THEN 1 ELSE 0 END)

FROM Parts AS P1, Parts AS P2

GROUP BY P1.part_nbr, P1.weight)

AS Partitions (part_nbr, weight, lesser, equal, greater)

WHERE lesser = greater

OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0

AND greater <= (SELECT COUNT(*) FROM Parts)/2.0);

The reason for not expanding the VIEW in the FROM clause into a tabular subquery expression is that the table can be used for other partitions of the table, such as quintiles

It is also worth noting that you can use either AVG(DISTINCT i) or AVG(i) in the SELECT clause The AVG(DISTINCT i) will return the usual median when there are two values This happens when you have an even number of rows and a partition in the middle, such as, (1, 2, 2, 3,

3, 3) which has (2, 3) in the middle, which gives us 2.5 for the median The AVG(i) will return the weighted median instead The weighted median looks at the set of middle values and skews in favor of the more common of the two values The table with (1, 2, 2, 3, 3, 3) would return (2, 2, 3, 3, 3) in the middle, which gives us 2.6 for the weighted median The weighted median is a more accurate description of the data

I sent this first attempt to Richard Romley, who invented the method

of first working with groups when designing a query It made it quite a

Figure 23.1

Defining a

Median.

Trang 3

bit simpler, but let me take you through the steps so you can see the reasoning

Look at the WHERE clause It could use some algebra, and since it deals only with aggregate functions and scalar subqueries, you could move it into a HAVING clause Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, but

it is not always possible

First, though, let’s do some algebra on the expression in the WHERE clause

lesser <= (SELECT COUNT(*) FROM Parts)/2.0

Since we already have lesser, equal, and greater for every row in the derived table Partitions, and since the sum of lesser, equal, and greater must always be exactly equal to the total number of rows in the Parts table, we can replace the scalar subquery with this expression:

lesser <= (lesser + equal + greater)/2.0

But this is the same as:

2.0 * lesser <= lesser + equal + greater

which becomes:

2.0 * lesser - lesser <= equal + greater

which becomes:

lesser <= equal + greater

So the query becomes:

SELECT AVG(DISTINCT weight) FROM (SELECT P1.part_nbr, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END),

SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END),

SUM(CASE WHEN P2.weight > P1.weight

Trang 4

THEN 1 ELSE 0 END)

GROUP BY P1.part_nbr, P1.weight)

AS Partitions (part_nbr, weight, lesser, equal, greater)

OR (lesser <= equal + greater

AND greater <= equal + lesser);

We can rewrite the WHERE clause with DeMorgan’s law

OR (equal >= lesser - greater

AND equal >= greater - lesser)

But this is the same as:

OR equal >= ABS(lesser - greater)

But if the first condition was true (lesser = greater), the second must necessarily also be true (i.e., equal >= 0), so the first clause is redundant and can be eliminated completely

WHERE equal >= ABS(lesser - greater)

So much for algebra Instead of a WHERE clause operating on the columns of the derived table, why not perform the same test as a HAVING clause on the inner query that derives Partitions? This

eliminates all but one column from the derived table, it will run much faster, and it simplifies the query to this:

SELECT AVG(DISTINCT weight)

FROM (SELECT P1.weight

GROUP BY P1.part_nbr, P1.weight

HAVING SUM(CASE WHEN P2.weight = P1.weight

THEN 1 ELSE 0 END)

>= ABS(SUM(CASE WHEN P2.weight < P1.weight THEN 1

WHEN P2.weight > P1.weight THEN -1 ELSE 0 END)))

AS Partitions;

Trang 5

If you prefer to use functions instead of a CASE expression, then use this version of the query:

SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_nbr, P1.weight HAVING SUM(ABS(1 - SIGN(P1.weight - P2.weight)) >= ABS(SUM(SIGN (P1.weight - P2.weight)))

AS Partitions;

23.3.9 Ken Henderson’s Median

In many SQL products, the fastest way to find the median is to use a cursor and just go to the middle of the sorted table Ken Henderson published a version of this solution with a cursor that can be translated

in SQL/PSM Assume that we wish to find the median of column “x” in table Foobar

BEGIN DECLARE idx INTEGER;

DECLARE median NUMERIC(20,5);

DECLARE median2 NUMERIC(20,5);

DECLARE Median_Cursor CURSOR FOR SELECT x

FROM Foobar ORDER BY x FOR READ ONLY;

SET idx = CASE WHEN (MOD(SELECT COUNT(*) FROM Foobar), 2) = 0 THEN (SELECT COUNT(*) FROM Foobar)/2

ELSE ((SELECT COUNT(*) FROM Foobar)/2) + 1 END;

OPEN Median_Cursor;

FETCH ABSOLUTE idx FROM Median_Cursor INTO median;

IF MOD(idx, 2) = 0 THEN FETCH Median_Cursor INTO median2;

SET median = (median + median2)/2;

END IF;

CLOSE Median_Cursor;

END;

Trang 6

This might not be true in other products Some of them have a median function that uses the balanced tree indexes to locate the middle

of the distribution

If the distribution is symmetrical and has only a single peak, then the mode, median, and mean are the same value If not, then the distribution

is somehow skewed If (mode < median < mean) then the distribution is skewed to the right If (mode > median > mean) then the distribution is skewed to the left

23.4 Variance and Standard Deviation

The standard deviation is a measure of how far away from the average the values in a normally distributed population are It is hard to calculate

in SQL, because it involves a square root, and standard SQL has only the basic four arithmetic operators

Many vendors will allow you to use other math functions, but in all fairness, most SQL databases are in commercial applications and have little or no need for engineering or statistical calculations

The usual trick is to load the raw data into an appropriate host language, such as FORTRAN, and do the work there The formula for the standard deviation is:

where (n) is the number of items in the sample set, and the xs are the

values of the items

The variance is defined as the standard deviation squared, so we can avoid taking a square root and keep the calculations in pure SQL The queries look like this:

CREATE TABLE Samples (x REAL NOT NULL);

INSERT INTO Samples (x)

VALUES (64.0), (48.0), (55.0), (68.0), (72.0),

(59.0), (57.0), (61.0), (63.0), (60.0),

(60.0), (43.0), (67.0), (70.0), (65.0),

(55.0), (56.0), (64.0), (61.0), (60.0);

SELECT ((COUNT(*) * SUM(x*x)) - (SUM(x) * SUM(x)))

/(COUNT(*) * (COUNT(*)-1)) AS variance

FROM Samples;

Trang 7

If you want to check this on your own SQL product, the correct answer is 48.9894 or just 49 depending how you handle rounding If your SQL product has a standard deviation operator, use it instead

If you have a version of SQL with an absolute value function, ABS(), you can also compute the average deviation following this pattern:

BEGIN SELECT AVG(x) INTO :average FROM Samples;

SELECT SUM(ABS(x - :average)) / COUNT(x) AS AverDeviation FROM Samples;

END;

This is a measure of how far data values drift away from the average, without any consideration of the direction of the drift

23.6 Cumulative Statistics

A cumulative or running statistic looks at each data value and how it is related to the whole data set The most common examples involve changes in an aggregate value over time or on some other well-ordered dimension A bank balance, which changes with each deposit or withdrawal, is a running total over time The total weight of a delivery truck as we add each package is a running total over the set of packages But since two packages can have the same weight, we need a way to break ties—for example, use the arrival dates of the packages, and if that fails, use the alphabetical order of the last names of the shippers In SQL, this means that we need a table with a key that we can use to order the rows

Computer people classify reports as one-pass reports or two-pass reports, a terminology that comes from the number of times the computer used to have to read the data file to produce the desired results These are really cumulative aggregate statistics

Most report writers can produce a listing with totals and other aggregated descriptive statistics after each grouping (e.g., “Give me the total amount of sales broken down by salesmen within territories”) Such reports are called banded reports or control-break reports, depending on the vendor The closest thing to such reports that the SQL language has

is the GROUP BY clause used with aggregate functions

Trang 8

The two-pass report involves finding out something about the group

as a whole in the first pass, then using that information in the second pass to produce the results for each row in the group The most common two-pass reports order the groups against each other (“Show me the total sales in each territory, ordered from high to low”) or show the

cumulative totals or cumulative percentages within a group (“Show me what percentage each customer contributes to total sales”)

23.6.1 Running Totals

Running totals keep track of changes, which usually occur over time, but these could be changes on some other well-ordered dimension A common example we all know is a bank account, for which we record withdrawals and deposits in a checkbook register The running total is the balance of the account after each transaction The query for the checkbook register is simply:

SELECT B0.transaction, B0.trans_date, SUM(B1.amount) AS balance FROM BankAccount AS B0, BankAccount AS B1

WHERE B1.trans_date <= B0.trans_date

GROUP BY B0.transaction, B0.trans_date;

You can use a scalar subquery instead:

SELECT B0.transaction, B0.trans_date,

(SELECT SUM(B1.amount)

FROM BankAccount AS B1

WHERE B1.trans_date <= B0.trans_date) AS balance

FROM BankAccount AS B0;

Which version will work better is dependent on your SQL product Notice that this query handles both deposits (positive numbers) and withdrawals (negative numbers) There is a problem with running totals when two items occur at the same time In this example, the transaction code keeps the transactions unique, but it is possible to have a withdrawal and a deposit on the same day that will be aggregated together

If we showed the withdrawals before the deposits on that day, the balance could fall below zero, which might trigger some actions we don’t want The rule in banking is that deposits are credited before withdrawals

on the same day, so simply extend the transaction date to show all

Trang 9

deposits with a time before all withdrawals to fool the query But remember that not all situations have a clearly defined policy like this Here is another version of the cumulative total problem that attempts

to reduce the work done in the outermost query Assume we have a table with data on the amount of sales to customers We want to see each amount and the cumulative total, in order by the amount, at which the customer gave us more than $500.00 in business

SELECT C1.cust_id, C1.sales_amt, SUM(C2.sales_amt) AS cumulative_amt

FROM Customers AS C1 INNER JOIN Customers AS C2

ON C1.sales_amt <= C2.sales_amt WHERE C1.sales_amt

>= (SELECT MAX(X.sales_amt) FROM (SELECT C3

FROM Customers AS C3 INNER JOIN Customers AS C4

ON C3.sales_amt <= C4.sales_amt GROUP BY C3.cust_id, C3.sales_amt HAVING SUM(C4.sales_amt) >= 500.00) AS X(sales_amt))

GROUP BY C1.cust_id, C1.sales_amt;

This query limits the processing that must be done in the outer query

by first calculating the cutoff point for each customer This sort of trick is best for larger tables where the self-join is often very slow

23.6.2 Running Differences

Another kind of statistic, related to running totals, is running differences In this case, we have the actual amount of something at various points in time and we want to compute the change since the last reading Here is a quick scenario: we have a clipboard and a paper form

on which we record the quantities of a chemical in a tank at different points in time from a gauge We need to report the time, the gauge reading, and the difference between each reading and the preceding one Here is some sample result data, showing the calculation we need:

Trang 10

tank reading quantity difference

====================================================

'50A' '2005-02-01-07:30' 300 NULL starting data '50A' '2005-02-01-07:35' 500 200

'50A' '2005-02-01-07:45' 1200 700

'50A' '2005-02-01-07:50' 800 -400

'50A' '2005-02-01-08:00' NULL NULL

'50A' '2005-02-01-09:00' 1300 500

'51A' '2005-02-01-07:20' 6000 NULL starting data '51A' '2005-02-01-07:22' 8000 2000

'51A' '2005-02-01-09:30' NULL NULL

'51A' '2005-02-01-00:45' 5000 -3000

'51A' '2005-02-01-01:00' 2500 -2500

The NULL values mean that we missed taking a reading The trick is a correlated subquery expression that computes the difference between the quantity in the current row and the quantity in the row with the largest known time value that is less than the time in the current row on the same date and on the same tank

SELECT tank, time,

(quantity

- (SELECT quantity

FROM Deliveries AS D1

WHERE D1.tank = D0.tank same tank

AND D1.time

= (SELECT (MAX D2.time) most recent delivery FROM Deliveries AS D2

WHERE D2.tank = D0.tank same tank

AND D2.time < D0.time))) AS difference FROM Deliveries AS D0;

This is a modification of the running-totals query, but it is more elaborate, since it cannot use the sum of the prior history

23.6.3 Cumulative Percentages

Cumulative percentages are a bit more complex than running totals or differences They show what percentage of the whole set of data values the current subset of data values is Again, this is easier to show with an example than to say in words You are given a table of the sales made by your sales force, which looks like this:

Định dạng
Số trang	10
Dung lượng	238,05 KB