Joe Celko s SQL for Smarties - Advanced SQL Programming P55 pptx

WITH SELECT salary, COUNT* FROM Payroll GROUP BY salary AS P1 salary, occurs SELECT salary FROM P1 WHERE P1.occurs = SELECT MAXoccurs IN P1; This is probably the best approach, since

Trang 1

512 CHAPTER 23: STATISTICS IN SQL

This would return the result set (‘red’, ‘green’) for the example table, and would not change to (‘green’) until the ratio of ‘red’ to ‘green’ tipped

by two percentage points

Likewise, you can use a derived table to get the mode

WITH (SELECT salary, COUNT(*) FROM Payroll

GROUP BY salary)

AS P1 (salary, occurs) SELECT salary

FROM P1 WHERE P1.occurs = (SELECT MAX(occurs) IN P1);

This is probably the best approach, since the WITH clause will materialize the P1 table and can locate the MAX() while doing so

One problem is that SQLs likes to maintain the data types, so if x is an INTEGER, you may get an integer result You can avoid this by writing AVG(1.0 * x) or AVG(CAST (X AS FLOAT)) or AVG(CAST (X AS DECIMAL (s,p))) to be safe This is implementation-defined, so check your product first

Newbies tend to forget that the built-in aggregate functions drop the rows with NULLs before doing the computations This means that (SUM(x)/COUNT(*)) is not the same as AVG(x) Consider (x * 1.0)/COUNT(*) versus AVG(COALESCE(x * 1.0, 0.0)) as versions of the mean that handle NULLs differently

Sample and population means are slightly different A sample needs

to use frequencies to adjust the estimate of the mean The formula SUM(x * 1.0 * abs_perc/100) AS mean_p needs the VIEW we had at the start of this section

The name “mean_p” is to remind us that it is a population mean and not the simple AVG() of the sample data in the table

The median is defined as the value for which there are just as many cases with a value below it as above it If such a value exists in the data set, this value is called the statistical median by some authors If no such value exists in the data set, the usual method is to divide the data set into two

Trang 2

23.3 The Median 513

halves of equal size such that all values in one half are lower than any value in the other half The median is then the average of the highest value in the lower half and the lowest value in the upper half, and is called the financial median by some authors The financial median is the most common term used for this median, so we will stick to it Let us use Date’s famous Parts table, from several of his textbooks (Date 1983, 1995a), which has a column for weight in it, like this:

Parts

part_nbr part_name part_color weight city_name

=================================================

'p1' 'Nut' 'Red' '12' 'London'

'p2' 'Bolt' 'Green' '17' 'Paris'

'p3' 'Cam' 'Blue' '12' 'Paris'

'p4' 'Screw' 'Red' '14' 'London'

'p5' 'Cam' 'Blue' '12' 'Paris'

'p6' 'Cog' 'Red' '19' 'London'

First, sort the table by weights and find the three rows in the lower half of the table The greatest value in the lower half is 12; the smallest value in the upper half is 14; their average, and therefore the median, is

13 If the table had an odd number of rows, we would have looked at only one row after the sorting

The median is a better measure of central tendency than the average, but it is also harder to calculate without sorting This is a disadvantage of SQL, compared with procedural languages, and it might be the reason that the median is not a common vendor extension in SQL

implementations The variance and standard deviation are quite

common, probably because they require no sorting and are therefore much easier to calculate; however, they are less useful to commercial users

23.3.1 Date’s First Median

Date proposed two different solutions for the median (Date 1992a; Celko and Date 1993) His first solution was based on the fact that if you duplicate every row in a table, the median will stay the same The duplication will guarantee that you always work with a table that has an even number of rows The first version that appeared in his column was wrong and drew some mail from me and from others who had different solutions Here is a corrected version of his first solution:

Trang 3

CREATE VIEW Temp1

AS SELECT weight FROM Parts UNION ALL

SELECT weight FROM Parts;

CREATE VIEW Temp2

AS SELECT weight FROM Temp1 WHERE (SELECT COUNT(*) FROM Parts) <= (SELECT COUNT(*)

FROM Temp1 AS T1 WHERE T1.weight >= Temp1.weight) AND (SELECT COUNT(*) FROM Parts)

<= (SELECT COUNT(*) FROM Temp1 AS T2 WHERE T2.weight <= Temp1.weight);

SELECT AVG(DISTINCT weight) AS median FROM Temp2;

This involves the construction of a doubled table of values, which can

be expensive in terms of both time and storage space The use of AVG(DISTINCT x) is important, because leaving it out would return the simple average instead of the median Consider the set of weights (12, 17, 17, 14, 12, 19) The doubled table, Temp1, is then (12, 12, 12,

12, 14, 14, 17, 17, 17, 17, 19, 19) But because of the duplicated values, Temp2 becomes (14, 14, 17, 17, 17, 17), not just (14, 17) The simple average is (96 / 6.0) = 16; it should be (31/2.0) = 15.5 instead

23.3.2 Celko’s First Median

A slight modification of Date’s solution will avoid the use of a doubled table, but it depends on a CEILING() function

SELECT MIN(weight) smallest value in upper half FROM Parts

WHERE weight

IN (SELECT P1.weight FROM Parts AS P1, Parts AS P2 WHERE P2.weight >= P1.weight GROUP BY P1.weight

HAVING COUNT(*)

Trang 4

23.3 The Median 515

<= (SELECT CEILING(COUNT(*) / 2.0)

FROM Parts))

UNION

SELECT MAX(weight) largest value in lower half

FROM Parts

WHERE weight IN (SELECT P1.weight

FROM Parts AS P1, Parts AS P2

WHERE P2.weight <= P1.weight

HAVING COUNT(*) <=

(SELECT CEILING(COUNT(*) / 2.0) FROM Parts));

Alternately, using the same idea and a CASE expression:

SELECT AVG(DISTINCT CAST(weight AS FLOAT)) AS median

FROM (SELECT MAX(weight)

FROM Parts AS B1

WHERE (SELECT COUNT(*) + 1

FROM Parts

WHERE weight < B1.weight)

<= (SELECT CEILING (COUNT(*)/2.0)

FROM Parts)

UNION ALL

SELECT MAX(weight)

FROM Parts AS B

WHERE (SELECT COUNT(*) + 1

FROM Parts

WHERE weight < B.weight)

<= CASE (SELECT MOD (COUNT(*), 2)

FROM Parts)

WHEN 0

THEN (SELECT CEILING (COUNT(*)/2.0) + 1

FROM Parts)

ELSE (SELECT CEILING (COUNT(*)/2.0)

FROM Parts)

END) AS Medians(weight);

Older versions of SQL allow a HAVING clause only with a GROUP BY; this may not work with your SQL The CEILING() function is included

to be sure that if there is an odd number of rows in Parts, the two halves will overlap on that value Again, truncation and rounding in division

Trang 5

are implementation-defined, so you will need to experiment with your product

23.3.3 Date’s Second Median

Date’s second solution (Date 1995b) was based on Celko’s median, folded into one query:

SELECT AVG(DISTINCT Parts.weight) AS median FROM Parts

WHERE Parts.weight IN (SELECT MIN(weight) FROM Parts WHERE Parts.weight IN (SELECT P2.weight FROM Parts AS P1, Parts AS P2 WHERE P2.weight <= P1.weight GROUP BY P2.weight

HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts))

UNION SELECT MAX(weight) FROM Parts WHERE Parts.weight IN (SELECT P2.weight FROM Parts AS P1, Parts AS P2 WHERE P2.weight >= P1.weight GROUP BY P2.weight

HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts)));

Date mentions that this solution will return a NULL for an empty table and that it assumes there are no NULLs in the column If there are NULLs, the WHERE clauses should be modified to remove them

23.3.4 Murchison’s Median

Rory Murchison of the Aetna Institute has a solution that modifies Date’s first method by concatenating the key to each value to make sure that

Trang 6

23.3 The Median 517

every value is seen as a unique entity Selecting the middle values is then

a special case of finding the nth item in the table

SELECT AVG(weight)

FROM Parts AS P1

WHERE EXISTS

(SELECT COUNT(*)

FROM Parts AS P2

WHERE CAST(weight AS CHAR(5)) || P2.part_nbr >=

CAST(weight AS CHAR(5)) || P1.part_nbr

HAVING COUNT(*) = (SELECT FLOOR(COUNT(*) / 2.0)

FROM Parts)

OR COUNT(*) = (SELECT CEILING((COUNT(*) / 2.0)

FROM Parts));

This method depends on being able to have a HAVING clause without

a GROUP BY, which is part of the ANSI standard but often missed by

new programmers

Another handy trick, if you don’t have FLOOR() and CEILING()

functions, is to use (COUNT(*) + 1) / 2.0 and COUNT(*) / 2.0 +

1 to handle the odd-and-even-elements problem Just to work it out,

consider the case where the COUNT(*) returns 8 for an answer: (8 + 1) /

2.0 = (9 / 2.0) = 4.5 and (8 / 2.0) + 1 = 4 + 1 = 5

The 4.5 will round to 4 in DB2 and other SQL implementations The

case where the COUNT(*) returns 9 would work like this: (9 + 1) / 2.0 =

(10 / 2.0) = 5 and (9 / 2.0) + 1 = 4.5 + 1 = 5.5, which will likewise round

to 5 in DB2

23.3.5 Celko’s Second Median

This is another method for finding the median that uses a working table

with the values, as well as a tally of their occurrences from the original

table This working table should be quite a bit smaller than the original

table, and it should be very fast to construct if there is an index on the

target column The Parts table will serve as an example, thus:

construct Working table of occurrences by weight

CREATE TABLE Working

(weight REAL NOT NULL,

occurs INTEGER NOT NULL);

INSERT INTO Working (weight, occurs)

Trang 7

SELECT weight, COUNT(*) FROM Parts

GROUP BY weight;

Now that we have this table, we want to use it to construct a summary table that has the number of occurrences of each weight and the total number of data elements before and after we add them to the working table

construct table of cumulative tallies CREATE TABLE Summary

(weight REAL NOT NULL, occurs INTEGER NOT NULL, number of occurrences pre_tally INTEGER NOT NULL, cumulative tally before post_tally INTEGER NOT NULL);

cumulative tally after INSERT INTO Summary

SELECT S2.weight, S2.occurs, SUM(S1.occurs) - S2.occurs, SUM(S1.occurs)

FROM Working AS S1, Working AS S2 WHERE S1.weight <= S2.weight GROUP BY S2.weight, S2.occurs;

Let (n / 2.0) be the middle position in the table There are two

mutually exclusive situations In the first case, the median lies in a position between the pre_tally and post_tally of one weight value In the second case, the median lies on the pre_tally of one row and the post_tally of another The middle position can be calculated by the scalar subquery (SELECT MAX(post_tally) / 2.0 FROM Summary)

SELECT AVG(S3.weight) AS median FROM Summary AS S3

WHERE (S3.post_tally > (SELECT MAX(post_tally) / 2.0 FROM Summary)

AND S3.pre_tally < (SELECT MAX(post_tally) / 2.0 FROM Summary))

OR S3.pre_tally = (SELECT MAX(post_tally) / 2.0 FROM Summary)

OR S3.post_tally = (SELECT MAX(post_tally) / 2.0 FROM Summary);

Trang 8

23.3 The Median 519

The first predicate, with the AND operator, handles the case where the median falls inside one weight value; the other two predicates handle the case where the median is between two weights A BETWEEN predicate will not work in this query

These tables can be used to compute percentiles, deciles, and quartiles simply by changing the scalar subquery For example, to find the highest tenth (first dectile), the subquery would be (SELECT 9 * MAX(post_tally) / 10 FROM Summary); to find the highest two-tenths, (SELECT 8 * MAX(post_tally) / 10 FROM Summary) In

general, to find the highest n-tenths, (SELECT (10 - n) *

MAX(post_tally) / 10 FROM Summary)

23.3.6 Vaughan’s Median with VIEWs

Philip Vaughan of San Jose, California proposed a simple median technique based on all of these methods It derives a VIEW with unique weights and number of occurrences, and then derives a VIEW of the middle set of weights

CREATE VIEW ValueSet(weight, occurs)

AS SELECT weight, COUNT(*)

FROM Parts

GROUP BY weight;

The MiddleValues VIEW is used to get the median by taking an average The clever part of this code is the way it handles empty result sets in the outermost WHERE clause that result from having only one value for all weights in the table Empty sets sum to NULL, because there

is no element to map the index

CREATE VIEW MiddleValues(weight)

AS SELECT weight

FROM ValueSet AS VS1

WHERE (SELECT SUM(VS2.occurs)/2.0 + 0.25

FROM ValueSet AS VS2) >

(SELECT SUM(VS2.occurs)

WHERE VS1.weight <= VS2.weight) - VS1.occurs

AND (SELECT SUM(VS2.occurs)/2.0 + 0.25

FROM ValueSet AS VS2) >

(SELECT SUM(VS2.occurs)

Trang 9

WHERE VS1.weight >= VS2.weight) - VS1.occurs;

SELECT AVG(weight) AS median FROM MiddleValues;

23.3.7 Median with Characteristic Function

Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented

a series of articles in SQL Forum magazine on computing the median (SQL Forum 1993, 1994) They define a characteristic function, which

they call delta, using the Sybase SIGN() function The delta or characteristic function accepts a Boolean expression as an argument, and returns one if it is TRUE and zero if it is FALSE or UNKNOWN We can construct the delta function easily with a CASE expression

The authors also distinguish between the statistical median, whose value must be a member of the set, and the financial median, whose value is the average of the middle two members of the set A statistical median exists when the number of items in the set is odd If the number

of items is even, you must decide whether you want to use the highest value in the lower half (they call this the left median) or the lowest value

in the upper half (they call this the right median)

The left statistical median of a unique column can be found with this query, if you assume that we have a column called bin that represents the storage location of a part

SELECT P1.bin FROM Parts AS P1, Parts AS P2 GROUP BY P1.bin

HAVING SUM(CASE WHEN (P2.bin <= P1.bin) THEN 1 ELSE 0 END) = (COUNT(*) / 2.0);

Changing the direction of the theta test in the HAVING clause will allow you to pick the right statistical median if a central element does not exist in the set You will also notice something else about the median of a set of unique values: it is usually meaningless What does the median bin number mean, anyway? A good rule of thumb is that if it does not make sense as an average, it does not make sense as a median

The statistical median of a column with duplicate values can be found with a query based on the same ideas, but you have to adjust the HAVING clause to allow for overlap; thus, the left statistical median is found by:

SELECT P1.weight FROM Parts AS P1, Parts AS P2

Trang 10

23.3 The Median 521

GROUP BY P1.weight

HAVING SUM(CASE WHEN P2.weight <= P1.weight

THEN 1 ELSE 0 END)

>= (COUNT(*) / 2.0)

AND SUM(CASE WHEN P2.weight >= P1.weight

THEN 1 ELSE 0 END)

>= (COUNT(*) / 2.0);

Notice that here the left and right medians can be the same, so there

is no need to pick one over the other in many of the situations where you have an even number of items Switching the comparison operators in the two CASE expressions will give you the right statistical median The authors’ query for the financial median depends on some Sybase features that cannot be found in other products, so I would recommend using a combination of the right and left statistical medians to return a set of values about the center of the data, and then averaging them Using a derived table, we can write the query as:

SELECT AVG(DISTINCT weight)

FROM (SELECT P1.weight

FROM Parts AS P1, Parts AS P2

GROUP BY P1.weight

HAVING (SUM(CASE WHEN P2.weight <= P1.weight

THEN 1 ELSE 0 END)

>= ((COUNT(*)) / 2.0)

AND SUM(CASE WHEN P2.weight >= P1.weight

THEN 1 ELSE 0 END)

>= (COUNT(*)/2.0)));

In doing this, we can gain some additional control over the

calculations

This version will use one copy of the left and right median to compute the statistical median However, by simply changing the AVG(DISTINCT weight) to AVG(weight), the median will favor the direction with the most occurrences This might be easier to see with an example Assume that we have weights (13, 13, 13, 14) in the Parts table A pure statistical median would be (13 + 14) /2.0 = 13.5; however, weighting it would give (13 + 13 + 13 + 14) / 4.0 = 13.25, which is more representative of central tendency

Another version of the financial median, which uses the CASE

expression in both of its forms, is:

Định dạng
Số trang	10
Dung lượng	133,92 KB