Joe Celko s SQL for Smarties - Advanced SQL Programming P58 pps

This means you also need to write these two queries: SELECT race, COUNT* FROM Personnel GROUP BY race; SELECT sex, COUNT* FROM Personnel GROUP BY sex; However, what I wanted was a table

Trang 1

542 CHAPTER 23: STATISTICS IN SQL

This approach works fine for two variables and would produce a table that could be sent to a report writer program to give a final version But where are your column and row totals? This means you also need to write these two queries:

SELECT race, COUNT(*) FROM Personnel GROUP BY race;

SELECT sex, COUNT(*) FROM Personnel GROUP BY sex;

However, what I wanted was a table with a row for males and a row for females, with columns for each of the racial groups, just as I drew it But let us assume that we want to get this information broken down within a third variable, such as a job code I want to see the job_nbr and the total by sex and race within each job code Our query set starts to get bigger and bigger A crosstab can also include other summary data, such

as total or average salary within each cell of the table

23.7.1 Crosstabs by Cross Join

A solution proposed by John M Baird of Datapoint in San Antonio, Texas involves creating a matrix table for each variable in the crosstab, thus:

SexMatrix sex Male Female ==================

'M' 1 0 'F' 0 1

RaceMatrix race asian black caucasian latino Other ========================================================

asian 1 0 0 0 0 black 0 1 0 0 0 caucasian 0 0 1 0 0 latino 0 0 0 1 0 Other 0 0 0 0 1

The query then constructs the cells by using a CROSSJOIN

(Cartesian product) and summation for each one, thus:

Trang 2

SELECT job_nbr,

SUM(asian * male) AS AsianMale,

SUM(asian * female) AS AsianFemale,

SUM(black * male) AS BlackMale,

SUM(black * female) AS BlackFemale,

SUM(cauc * male) AS CaucMale,

SUM(cauc * female) AS CaucFemale,

SUM(latino * male) AS LatinoMale,

SUM(latino * female) AS LatinoFemale,

SUM(other * male) AS OtherMale,

SUM(other * female) AS OtherFemale

FROM Personnel, SexMatrix, RaceMatrix

WHERE (RaceMatrix.race = Personnel.race)

AND (SexMatrix.sex = Personnel.sex)

GROUP BY job_nbr;

Numeric summary data can be obtained from this table For example, the total salary for each cell can be computed by SUM(<race> *

<sex> * salary) AS <cell name> in place of what we have here

23.7.2 Crosstabs by Outer Joins

Another method, due to Jim Panttaja, uses a series of temporary tables or

VIEWs and then combines them with OUTERJOINs

CREATE VIEW Guys (race, maletally)

AS SELECT race, COUNT(*)

FROM Personnel

WHERE sex = 'M'

GROUP BY race;

Correspondingly, you could have written:

CREATE VIEW Dolls (race, femaletally)

AS SELECT race, COUNT(*)

FROM Personnel

WHERE sex = 'F'

GROUP BY race;

But they can be combined for a crosstab, without column and row totals, like this:

Trang 3

SELECT Guys.race, maletally, femaletally FROM Guys LEFT OUTER JOIN Dolls

ON Guys.race = Dolls.race;

The idea is to build a starting column in the crosstab, then progressively add columns to it You use the LEFTOUTERJOIN to avoid missing-data problems

23.7.3 Crosstabs by Subquery

Another method takes advantage of the orthogonality of correlated subqueries in SQL-92 Think about what each row or column in the crosstab wants

SELECT DISTINCT race, (SELECT COUNT(*) FROM Personnel AS P1 WHERE P0.race = P1.race AND sex = 'M') AS MaleTally, (SELECT COUNT(*)

FROM Personnel AS P2 WHERE P0.race = P2.race AND sex = 'F') AS FemaleTally FROM Personnel AS P0;

An advantage of this approach is that you can attach another column

to get the row tally by adding (SELECT COUNT(*)

FROM Personnel AS P3 WHERE P0.race = P3.race) AS RaceTally

Likewise, to get the column tallies, union the previous query with: SELECT 'Summary',

(SELECT COUNT(*) FROM Personnel WHERE sex = 'M') AS GrandMaleTally, (SELECT COUNT(*)

FROM Personnel WHERE sex = 'F') AS GrandFemaleTally,

Trang 4

(SELECT COUNT(*)

FROM Personnel) AS GrandTally

FROM Personnel;

23.7.4 Crosstabs by CASE Expression

Probably the best method is to use the CASE expression If you need to get the final row of the traditional crosstab, you can add:

SELECT sex,

SUM(CASE race WHEN 'caucasian' THEN 1 ELSE 0 END) AS

caucasian,

SUM(CASE race WHEN 'black' THEN 1 ELSE 0 END) AS black, SUM(CASE race WHEN 'asian' THEN 1 ELSE 0 END) AS asian, SUM(CASE race WHEN 'latino' THEN 1 ELSE 0 END) AS latino, SUM(CASE race WHEN 'other' THEN 1 ELSE 0 END) AS other, COUNT(*) AS row_total

FROM Personnel

GROUP BY sex

UNION ALL

SELECT ' ',

SUM(CASE race WHEN 'caucasian' THEN 1 ELSE 0 END),

SUM(CASE race WHEN 'black' THEN 1 ELSE 0 END),

SUM(CASE race WHEN 'asian' THEN 1 ELSE 0 END),

SUM(CASE race WHEN 'latino' THEN 1 ELSE 0 END),

SUM(CASE race WHEN 'other' THEN 1 ELSE 0 END),

COUNT(*) AS column_total

FROM Personnel;

23.8 Harmonic Mean and Geometric Mean

The harmonic mean is defined as the reciprocal of the arithmetic mean

of the reciprocals of the values of a set It is appropriate when dealing with rates and prices Of limited use, it is found mostly in averaging rates

SELECT COUNT(*)/SUM(1.0/x) AS harmonic_mean

FROM Foobar;

The geometric mean is the exponential of the mean of the logs of the

data items You can also express it as the nth root of the product of the (n) data items This second form is more subject to rounding errors than

Trang 5

the first The geometric mean is sometimes a better measure of central tendency than the simple arithmetic mean when you are analyzing change over time

SELECT EXP (AVG (LOG (nbr))) AS geometric_mean FROM NumberTable;

If you have negative numbers this will blow up, because the logarithm

is not defined for values less than or equal to zero

23.9 Multivariable Descriptive Statistics in SQL

More and more SQL products are adding more complicated descriptive statistics to their aggregate function library For example, CA-Ingres comes with a very nice set of such tools

Many of the single-column aggregate functions for which we just gave code are built-in functions If you have that advantage, then use them They will have corrections for floating-point rounding errors and be more accurate

Descriptive statistics are not all single-column computations You often want to know relationships among several variables for prediction and description Let’s pick one statistic that is representative of this class

of functions and see what problems we have writing our own aggregate function for it

23.9.1 Covariance

The covariance is defined as a measure of the extent to which two variables move together Financial analysts use it to determine the degree

to which return on two securities is related over time A high covariance indicates similar movements This code is due to Steve Kass:

CREATE TABLE Samples (sample_nbr INTEGER NOT NULL PRIMARY KEY,

x FLOAT NOT NULL,

y FLOAT NOT NULL);

INSERT INTO Samples VALUES (1, 3, 9), (2, 2, 7), (3, 4, 12), (4, 5, 15), (5, 6, 17);

SELECT sample_nbr, x, y, ((1.0/n) * SUM((x - xbar)*(y - ybar))) AS covariance

Trang 6

FROM Samples

CROSS JOIN

(SELECT COUNT(*), AVG(x), AVG(y) FROM Samples)

AS A (n, xbar, ybar)

GROUP BY n;

23.9.2 Pearson’s r

One of the most useful covariants is Pearson’s r, or the linear correlation

coefficient It measures the strength of the linear association between two variables In English, given a set of observations (x1, y1), (x2, y2), ,

(xn, yn), I want to know: when one variable goes up or down, how well

does the other variable follow it?

The correlation coefficient always takes a value between +1 and -1 Positive one means that they match to each other exactly Negative one means that increasing values in one variable correspond to decreasing values in the other variable A correlation value close to zero indicates no association between the variables In the real world, you will not see +1

or −1 very often—this would mean that you are looking at a natural law, and not a statistical relationship The values in between are much more realistic, with 0.70 or greater being a strong correlation

The formula translates into SQL in a straightforward manner CREATE TABLE Samples

(sample_name CHAR(3) NOT NULL PRIMARY KEY,

x REAL, y REAL);

INSERT INTO Samples

VALUES ('a', 1.0, 2.0), ('b', 2.0, 5.0), ('c', 3.0, 6.0);

r= 0.9608

SELECT (SUM(x - AVG(x))*(y - AVG(y)))

/ SQRT(SUM((x - AVG(x))^2) * SUM((y - AVG(y))^2))

AS pearson_r

FROM Samples;

SQRT() is the square root function, which is quite common in SQL today, and ^2 is the square of the number Some products use

POWER(x, n) instead of the exponent notation Alternately, or you can use repeated multiplication

Trang 7

23.9.3 NULLs in Multivariable Descriptive Statistics

If (x, y) = (NULL, NULL), then the query will drop the pair in the aggregate functions, as per the usual rules of SQL But what is the correct (or reasonable) behavior if (x, y) has one and only one NULL in the pair?

We can make several arguments

1 Drop the pairs that contain any NULLs That is quick and easy with a “WHERE x IS NOT NULL AND y IS NOT NULL” clause added to the query The argument is that if you don’t know one or both values, how can you know what their rela-tionship is?

2 Convert (x, NULL) to (x, AVG(y)) and (NULL, y) to

(AVG(x), y) The idea is to “smooth out” the missing values with a reasonable replacement that is based on the whole set from which known values were drawn There might be better replacement values in a particular situation, but that idea would still hold

3 Replace (NULL, NULL) with (a, a) for some value to say that the

NULLs are in the same grouping This kind of “pseudo-equality” is the basis for putting NULLs into one group in a

GROUP BY operation I am not sure what the correct practice for the (x, NULL) and (y, NULL) pairs are

4 First calculate a linear regression with the known pairs, say y = (a + b*x), and then fill in the expected values If you forgot your high school algebra, that would be y[i] = a + b * x[i] for the pair (x[i], NULL), and x[i] = (y - a) / b

5 Catch the SQLSTATE warning code message (found in Standard SQL) to show that an aggregate function has dropped

NULLs before doing the computations, and use the message to report to the user about the missing data

I can also use COUNT(*) and COUNT(x+y) to determine how much data is missing I think we would all agree that if I have a small subset of non-NULL pairs, then my correlation is less reliable than if I obtained it from a large subset of non-NULL pairs

There is no right answer to this question You will need to know the nature of your data to make a good decision

Trang 8

C H A P T E R

24

Regions, Runs, Gaps, Sequences,

and Series

TABLES DO NOT HAVE an ordering to their rows Yes, the physical storage

of the rows in many SQL products might be ordered if the product is built on an old file system More modern implementations might not construct and materialize the result set rows until the end of the query execution

The first rule in a relational database is that all relationships are shown in tables by values in columns This means that things involving an ordering must have a table with at least two columns One column, the sequence number, is the primary key; the other column has the value that holds that position in the sequence

The sequence column has consecutive unique integers, without any gaps in the numbering Examples of this sort of data would be ticket numbers, time series data taken at fixed intervals, and the like The ordering of those identifiers carries some information, such as physical

or temporal location A subsequence is a set of consecutive unique identifiers within a larger containing sequence that has some property This property is usually consecutive numbering

For example, given the data CREATE TABLE List

(seq_nbr INTEGER NOT NULL UNIQUE, val INTEGER NOT NULL UNIQUE);

Trang 9

550 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

INSERT INTO List VALUES (1, 99), (2, 10), (3, 11), (4, 12), (5, 13), (6, 14), (7, 0);

You can find subsequences of size three that follow the rule—(10, 11, 12), (11, 12, 13), and (12, 13, 14)—but the longest sequence is (10, 11,

12, 13, 14), and it is of size five

A run is like a sequence, but the numbers do not have to be consecutive, just increasing and contiguous For example, given the run {(1, 1), (2, 2), (3, 12), (4, 15), (5, 23)}, you can find subruns of size three: (1, 2, 12), (2, 12, 15), and (12, 15, 23)

A region is contiguous, and all the values are the same For example, {(1, 1), (2, 0), (3, 0), (4, 0), (5, 25)} has a region of zeros that is three items long

In procedural languages, you would simply sort the data and scan it

In SQL, you have to define everything in terms of sets and nested sets Some of these queries can be done with the OLAP addition to SQL-99, but they are not yet common in SQL products

24.1 Finding Subregions of Size ( n )

This example is adapted from SQL and Its Applications (Lorie and Daudenarde 1991) You are given a table of theater seats:

CREATE TABLE Theater (seat_nbr INTEGER NOT NULL PRIMARY KEY, sequencing number occupancy_status CHAR(1) NOT NULL values CONSTRAINT valid_occupancy_status

CHECK (occupancy_status IN ('A', 'S'));

In this table, an occupancy_status code of ‘A’ means available, and ‘S’ means sold Your problem is to write a query that will return the

subregions of (n) consecutive seats still available Assume that consecutive seat_nbrs means that the seats are also consecutive for a moment, ignoring rows of seating where seat_nbr(n) and seat_nbr((n) + 1) might be on different physical theater rows For (n) = 3, we can write a self-JOIN query, thus:

SELECT T1.seat_nbr, T2.seat_nbr, T3.seat_nbr FROM Theater AT T1, Theater AT T2, Theater AT T3 WHERE T1.occupancy_status = 'A'

Trang 10

24.2 Numbering Regions 551

AND T2.occupancy_status = 'A'

AND T3.occupancy_status = 'A'

AND T2.seat_nbr = T1.seat_nbr + 1

AND T3.seat_nbr = T2.seat_nbr + 1;

The trouble with this answer is that it works only for (n = 3) This pattern can be extended for any (n), but what we really want is a

generalized query where we can use (n) as a parameter to the query The solution given by Lorie and Daudenarde starts with a given seat_nbr and looks at all the available seats between it and ((n) - 1) seats further up The real trick is switching from the English-language

statement “All seats between here and there are available” to the passive-voice version, “Available is the occupancy_status of all the seats between here and there,” so that you can see the query

SELECT seat_nbr, ' thru ', (seat_nbr + (:(n) - 1))

FROM Theater AS T1

WHERE occupancy_status = 'A'

AND 'A' = ALL (SELECT occupancy_status

FROM Theater AS T2

WHERE T2.seat_nbr > T1.seat_nbr

AND T2.seat_nbr <= T1.seat_nbr + (:(n) - 1));

Please notice that this returns subregions That is, if seats (1, 2, 3, 4, 5) are available, this query will return (1, 2, 3), (2, 3, 4), and (3, 4, 5) as its result set

24.2 Numbering Regions

Instead of looking for a region, we want to number the regions in the order in which they appear For example, given a view or table with a payment history, we want to break it into groupings of behavior—for example, whether or not the payments were on time or late

CREATE TABLE PaymentHistory

(payment_nbr INTEGER NOT NULL PRIMARY KEY,

paid_on_time CHAR(1) DEFAULT 'Y' NOT NULL

CHECK(paid_on_time IN ('Y', 'N')));

INSERT INTO PaymentHistory

VALUES (1006, 'Y'), (1005, 'Y'),

Định dạng
Số trang	10
Dung lượng	244,64 KB