Every value in the column weight partitions the table into three sections: values that are less than weight, values that are equal to weight, and values that are greater than weight.. Cl
Trang 1SELECT CASE MOD(COUNT(*),2) WHEN 0 even sized table THEN (P1.weight + MIN(CASE WHEN P1.weight > P2.weight THEN P1.weight
ELSE NULL END))/2.0 ELSE P2.weight odd sized table
END FROM Parts AS P1, Parts AS P2 GROUP BY P1.weight
HAVING COUNT(CASE WHEN P1.weight >= P2.weight THEN 1
ELSE NULL END) = (COUNT(*) + 1) / 2;
This answer is due to Ken Henderson
23.3.8 Celko’s Third Median
Another approach involves looking at a picture of a line of sorted values and seeing where the median would fall Every value in the column weight partitions the table into three sections: values that are less than weight, values that are equal to weight, and values that are greater than weight We can get a profile of each value with a tabular subquery expression
Now the question is how to define a median in terms of the partitions Clearly, the definition of a median means that if (lesser = greater) then weight is the median
Now look at Figure 23.1 for the other situations If there are more elements in the greater values than half the size of the table, then weight cannot be a median Likewise, if there are more elements in the lesser values than half the size of the table, then weight cannot be a median
If (lesser + equal) = greater, then weight is a left-hand median Likewise, if (greater + equal) = lesser, then weight is a right-hand median However, if weight is the median, then both lesser and greater must have tallies of less than half the size of the table That translates into the following SQL:
SELECT AVG(DISTINCT weight) FROM (SELECT P1.part_nbr, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END),
SUM(CASE WHEN P2.weight = P1.weight
Trang 2THEN 1 ELSE 0 END),
SUM(CASE WHEN P2.weight > P1.weight
THEN 1 ELSE 0 END)
FROM Parts AS P1, Parts AS P2
GROUP BY P1.part_nbr, P1.weight)
AS Partitions (part_nbr, weight, lesser, equal, greater)
WHERE lesser = greater
OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0
AND greater <= (SELECT COUNT(*) FROM Parts)/2.0);
The reason for not expanding the VIEW in the FROM clause into a tabular subquery expression is that the table can be used for other partitions of the table, such as quintiles
It is also worth noting that you can use either AVG(DISTINCT i) or AVG(i) in the SELECT clause The AVG(DISTINCT i) will return the usual median when there are two values This happens when you have an even number of rows and a partition in the middle, such as, (1, 2, 2, 3,
3, 3) which has (2, 3) in the middle, which gives us 2.5 for the median The AVG(i) will return the weighted median instead The weighted median looks at the set of middle values and skews in favor of the more common of the two values The table with (1, 2, 2, 3, 3, 3) would return (2, 2, 3, 3, 3) in the middle, which gives us 2.6 for the weighted median The weighted median is a more accurate description of the data
I sent this first attempt to Richard Romley, who invented the method
of first working with groups when designing a query It made it quite a
Figure 23.1
Defining a
Median.
Trang 3bit simpler, but let me take you through the steps so you can see the reasoning
Look at the WHERE clause It could use some algebra, and since it deals only with aggregate functions and scalar subqueries, you could move it into a HAVING clause Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, but
it is not always possible
First, though, let’s do some algebra on the expression in the WHERE clause
lesser <= (SELECT COUNT(*) FROM Parts)/2.0
Since we already have lesser, equal, and greater for every row in the derived table Partitions, and since the sum of lesser, equal, and greater must always be exactly equal to the total number of rows in the Parts table, we can replace the scalar subquery with this expression:
lesser <= (lesser + equal + greater)/2.0
But this is the same as:
2.0 * lesser <= lesser + equal + greater
which becomes:
2.0 * lesser - lesser <= equal + greater
which becomes:
lesser <= equal + greater
So the query becomes:
SELECT AVG(DISTINCT weight) FROM (SELECT P1.part_nbr, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END),
SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END),
SUM(CASE WHEN P2.weight > P1.weight
Trang 4THEN 1 ELSE 0 END)
FROM Parts AS P1, Parts AS P2
GROUP BY P1.part_nbr, P1.weight)
AS Partitions (part_nbr, weight, lesser, equal, greater)
WHERE lesser = greater
OR (lesser <= equal + greater
AND greater <= equal + lesser);
We can rewrite the WHERE clause with DeMorgan’s law
WHERE lesser = greater
OR (equal >= lesser - greater
AND equal >= greater - lesser)
But this is the same as:
WHERE lesser = greater
OR equal >= ABS(lesser - greater)
But if the first condition was true (lesser = greater), the second must necessarily also be true (i.e., equal >= 0), so the first clause is redundant and can be eliminated completely
WHERE equal >= ABS(lesser - greater)
So much for algebra Instead of a WHERE clause operating on the columns of the derived table, why not perform the same test as a HAVING clause on the inner query that derives Partitions? This
eliminates all but one column from the derived table, it will run much faster, and it simplifies the query to this:
SELECT AVG(DISTINCT weight)
FROM (SELECT P1.weight
FROM Parts AS P1, Parts AS P2
GROUP BY P1.part_nbr, P1.weight
HAVING SUM(CASE WHEN P2.weight = P1.weight
THEN 1 ELSE 0 END)
>= ABS(SUM(CASE WHEN P2.weight < P1.weight THEN 1
WHEN P2.weight > P1.weight THEN -1 ELSE 0 END)))
AS Partitions;
Trang 5If you prefer to use functions instead of a CASE expression, then use this version of the query:
SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_nbr, P1.weight HAVING SUM(ABS(1 - SIGN(P1.weight - P2.weight)) >= ABS(SUM(SIGN (P1.weight - P2.weight)))
AS Partitions;
23.3.9 Ken Henderson’s Median
In many SQL products, the fastest way to find the median is to use a cursor and just go to the middle of the sorted table Ken Henderson published a version of this solution with a cursor that can be translated
in SQL/PSM Assume that we wish to find the median of column “x” in table Foobar
BEGIN DECLARE idx INTEGER;
DECLARE median NUMERIC(20,5);
DECLARE median2 NUMERIC(20,5);
DECLARE Median_Cursor CURSOR FOR SELECT x
FROM Foobar ORDER BY x FOR READ ONLY;
SET idx = CASE WHEN (MOD(SELECT COUNT(*) FROM Foobar), 2) = 0 THEN (SELECT COUNT(*) FROM Foobar)/2
ELSE ((SELECT COUNT(*) FROM Foobar)/2) + 1 END;
OPEN Median_Cursor;
FETCH ABSOLUTE idx FROM Median_Cursor INTO median;
IF MOD(idx, 2) = 0 THEN FETCH Median_Cursor INTO median2;
SET median = (median + median2)/2;
END IF;
CLOSE Median_Cursor;
END;
Trang 6This might not be true in other products Some of them have a median function that uses the balanced tree indexes to locate the middle
of the distribution
If the distribution is symmetrical and has only a single peak, then the mode, median, and mean are the same value If not, then the distribution
is somehow skewed If (mode < median < mean) then the distribution is skewed to the right If (mode > median > mean) then the distribution is skewed to the left
23.4 Variance and Standard Deviation
The standard deviation is a measure of how far away from the average the values in a normally distributed population are It is hard to calculate
in SQL, because it involves a square root, and standard SQL has only the basic four arithmetic operators
Many vendors will allow you to use other math functions, but in all fairness, most SQL databases are in commercial applications and have little or no need for engineering or statistical calculations
The usual trick is to load the raw data into an appropriate host language, such as FORTRAN, and do the work there The formula for the standard deviation is:
where (n) is the number of items in the sample set, and the xs are the
values of the items
The variance is defined as the standard deviation squared, so we can avoid taking a square root and keep the calculations in pure SQL The queries look like this:
CREATE TABLE Samples (x REAL NOT NULL);
INSERT INTO Samples (x)
VALUES (64.0), (48.0), (55.0), (68.0), (72.0),
(59.0), (57.0), (61.0), (63.0), (60.0),
(60.0), (43.0), (67.0), (70.0), (65.0),
(55.0), (56.0), (64.0), (61.0), (60.0);
SELECT ((COUNT(*) * SUM(x*x)) - (SUM(x) * SUM(x)))
/(COUNT(*) * (COUNT(*)-1)) AS variance
FROM Samples;
Trang 7If you want to check this on your own SQL product, the correct answer is 48.9894 or just 49 depending how you handle rounding If your SQL product has a standard deviation operator, use it instead
If you have a version of SQL with an absolute value function, ABS(), you can also compute the average deviation following this pattern:
BEGIN SELECT AVG(x) INTO :average FROM Samples;
SELECT SUM(ABS(x - :average)) / COUNT(x) AS AverDeviation FROM Samples;
END;
This is a measure of how far data values drift away from the average, without any consideration of the direction of the drift
23.6 Cumulative Statistics
A cumulative or running statistic looks at each data value and how it is related to the whole data set The most common examples involve changes in an aggregate value over time or on some other well-ordered dimension A bank balance, which changes with each deposit or withdrawal, is a running total over time The total weight of a delivery truck as we add each package is a running total over the set of packages But since two packages can have the same weight, we need a way to break ties—for example, use the arrival dates of the packages, and if that fails, use the alphabetical order of the last names of the shippers In SQL, this means that we need a table with a key that we can use to order the rows
Computer people classify reports as one-pass reports or two-pass reports, a terminology that comes from the number of times the computer used to have to read the data file to produce the desired results These are really cumulative aggregate statistics
Most report writers can produce a listing with totals and other aggregated descriptive statistics after each grouping (e.g., “Give me the total amount of sales broken down by salesmen within territories”) Such reports are called banded reports or control-break reports, depending on the vendor The closest thing to such reports that the SQL language has
is the GROUP BY clause used with aggregate functions
Trang 8The two-pass report involves finding out something about the group
as a whole in the first pass, then using that information in the second pass to produce the results for each row in the group The most common two-pass reports order the groups against each other (“Show me the total sales in each territory, ordered from high to low”) or show the
cumulative totals or cumulative percentages within a group (“Show me what percentage each customer contributes to total sales”)
23.6.1 Running Totals
Running totals keep track of changes, which usually occur over time, but these could be changes on some other well-ordered dimension A common example we all know is a bank account, for which we record withdrawals and deposits in a checkbook register The running total is the balance of the account after each transaction The query for the checkbook register is simply:
SELECT B0.transaction, B0.trans_date, SUM(B1.amount) AS balance FROM BankAccount AS B0, BankAccount AS B1
WHERE B1.trans_date <= B0.trans_date
GROUP BY B0.transaction, B0.trans_date;
You can use a scalar subquery instead:
SELECT B0.transaction, B0.trans_date,
(SELECT SUM(B1.amount)
FROM BankAccount AS B1
WHERE B1.trans_date <= B0.trans_date) AS balance
FROM BankAccount AS B0;
Which version will work better is dependent on your SQL product Notice that this query handles both deposits (positive numbers) and withdrawals (negative numbers) There is a problem with running totals when two items occur at the same time In this example, the transaction code keeps the transactions unique, but it is possible to have a withdrawal and a deposit on the same day that will be aggregated together
If we showed the withdrawals before the deposits on that day, the balance could fall below zero, which might trigger some actions we don’t want The rule in banking is that deposits are credited before withdrawals
on the same day, so simply extend the transaction date to show all
Trang 9deposits with a time before all withdrawals to fool the query But remember that not all situations have a clearly defined policy like this Here is another version of the cumulative total problem that attempts
to reduce the work done in the outermost query Assume we have a table with data on the amount of sales to customers We want to see each amount and the cumulative total, in order by the amount, at which the customer gave us more than $500.00 in business
SELECT C1.cust_id, C1.sales_amt, SUM(C2.sales_amt) AS cumulative_amt
FROM Customers AS C1 INNER JOIN Customers AS C2
ON C1.sales_amt <= C2.sales_amt WHERE C1.sales_amt
>= (SELECT MAX(X.sales_amt) FROM (SELECT C3
FROM Customers AS C3 INNER JOIN Customers AS C4
ON C3.sales_amt <= C4.sales_amt GROUP BY C3.cust_id, C3.sales_amt HAVING SUM(C4.sales_amt) >= 500.00) AS X(sales_amt))
GROUP BY C1.cust_id, C1.sales_amt;
This query limits the processing that must be done in the outer query
by first calculating the cutoff point for each customer This sort of trick is best for larger tables where the self-join is often very slow
23.6.2 Running Differences
Another kind of statistic, related to running totals, is running differences In this case, we have the actual amount of something at various points in time and we want to compute the change since the last reading Here is a quick scenario: we have a clipboard and a paper form
on which we record the quantities of a chemical in a tank at different points in time from a gauge We need to report the time, the gauge reading, and the difference between each reading and the preceding one Here is some sample result data, showing the calculation we need:
Trang 10tank reading quantity difference
====================================================
'50A' '2005-02-01-07:30' 300 NULL starting data '50A' '2005-02-01-07:35' 500 200
'50A' '2005-02-01-07:45' 1200 700
'50A' '2005-02-01-07:50' 800 -400
'50A' '2005-02-01-08:00' NULL NULL
'50A' '2005-02-01-09:00' 1300 500
'51A' '2005-02-01-07:20' 6000 NULL starting data '51A' '2005-02-01-07:22' 8000 2000
'51A' '2005-02-01-09:30' NULL NULL
'51A' '2005-02-01-00:45' 5000 -3000
'51A' '2005-02-01-01:00' 2500 -2500
The NULL values mean that we missed taking a reading The trick is a correlated subquery expression that computes the difference between the quantity in the current row and the quantity in the row with the largest known time value that is less than the time in the current row on the same date and on the same tank
SELECT tank, time,
(quantity
- (SELECT quantity
FROM Deliveries AS D1
WHERE D1.tank = D0.tank same tank
AND D1.time
= (SELECT (MAX D2.time) most recent delivery FROM Deliveries AS D2
WHERE D2.tank = D0.tank same tank
AND D2.time < D0.time))) AS difference FROM Deliveries AS D0;
This is a modification of the running-totals query, but it is more elaborate, since it cannot use the sum of the prior history
23.6.3 Cumulative Percentages
Cumulative percentages are a bit more complex than running totals or differences They show what percentage of the whole set of data values the current subset of data values is Again, this is easier to show with an example than to say in words You are given a table of the sales made by your sales force, which looks like this: