Joe Celko s SQL for Smarties - Advanced SQL Programming P60 doc

562 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES or: SELECT MINF1.seq_nbr + 1 FROM List AS F1 UNION ALL VALUE 0 WHERE L1.seq_nbr +1 NOT IN SELECT seq_nbr FROM List; Finding ent

Trang 1

562 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

or:

SELECT MIN(F1.seq_nbr + 1) FROM List AS F1

UNION ALL VALUE (0) WHERE (L1.seq_nbr +1) NOT IN (SELECT seq_nbr FROM List);

Finding entire gaps follows from this pattern, and we get this short piece of code

SELECT (s + 1) AS gap_start, (e - 1) AS gap_end FROM (SELECT L1.seq_nbr, MIN(L2.seq_nbr) FROM List AS L1, List AS L2 WHERE L1.seq_nbr < L2.seq_nbr GROUP BY L1.seq_nbr)

AS G(s, e) WHERE (e - 1) > s;

Without the derived table we get:

SELECT (L1.seq_nbr + 1) AS gap_start, (MIN(L2.seq_nbr) - 1) AS gap_end FROM List AS L1, List AS L2

WHERE L1.seq_nbr < L2.seq_nbr GROUP BY L1.seq_nbr

HAVING (MIN(L2.seq_nbr) - L1.seq_nbr) > 1;

24.6 Summation of a Series

While this topic is a bit more mathematical than most SQL programmers actually have to use in their work, it does demonstrate the power of SQL and a little knowledge of some basic college math

The summation of a series builds a running total of the values in a table and shows the cumulative total for each value in the series Let’s create a table and some sample data

CREATE TABLE Series (seq_nbr INTEGER NOT NULL PRIMARY KEY,

Trang 2

val INTEGER NOT NULL,

answer INTEGER null means not computed yet

);

Sequences

seq_nbr val answer

======================

1 6 6

2 41 47

3 12 59

4 51 110

5 21 131

6 70 201

7 79 280

8 62 342

This simple summation is not a problem.

UPDATE Series

SET answer = (SELECT SUM(R1.val)

FROM Series AS S1

WHERE R1.seq_nbr <= Series.seq_nbr)

WHERE answer IS NULL;

This is the form we can use for most problems of this type with only one level of summation But things can be worse This problem came from Francisco Moreno, and on the surface it sounds easy First, create the usual table and populate it

DROP TABLE Series;

CREATE TABLE Series

(seq_nbr INTEGER NOT NULL,

val REAL NOT NULL,

answer REAL);

INSERT INTO Series

VALUES (0, 6.0, NULL),

(1, 6.0, NULL),

(2, 10.0, NULL),

(3, 12.0, NULL),

(4, 14.0, NULL);

Trang 3

The goal is to compute the average of the first two terms, then add the third value to the result and average the two of them, and so forth In this data, we would have:

seq_nbr val answer ====================

0 6.0 NULL

1 6.0 6.0

2 10.0 8.0

3 12.0 10.0

4 14.0 12.0

The first thing we need to do is get rid of the value where (seq_nbr = 0) and change the table to read:

seq_nbr val answer =====================

1 12.0 NULL

2 10.0 NULL

3 12.0 NULL

4 14.0 NULL

The obvious approach is to do the calculations directly.

UPDATE Series SET answer = (Series.val + (SELECT R1.answer FROM Series AS S1 WHERE R1.seq_nbr = Series.seq_nbr - 1))/2.0 WHERE answer IS NULL;

But there is a problem with this approach It will only calculate one value at a time The reason is that this series is much more complex than

a simple running total

What we have is actually a double summation, in which the terms are defined by a continued fraction Let’s work out the first four answers by brute force and see if we can find a pattern.

answer1 = (12)/2 = 6 answer2 = ((12)/2 + 10)/2 = 8 answer3 = (((12)/2 + 10)/2 + 12)/2 = 10 answer4 = (((((12)/2 + 10)/2 + 12)/2) + 14)/2 = 12

Trang 4

The real trick is to do some algebra and get rid of the nested

parentheses

answer1 = (12)/2 = 6

answer2 = (12/4) + (10/2) = 8

answer3 = (12/8) + (10/4) + (12/2) = 10

answer4 = (12/16) + (10/8) + (12/4) + (14/2) = 12

When we see powers of 2, we know we can reseq_nbr them with a formula:

answer1 = (12)/2^1 = 6

answer2 = (12/(2^2)) + (10/(2^1)) = 8

answer3 = (12/(2^3)) + (10/(2^2)) + (12/(2^1)) = 10

answer4 = (12/2^4) + (10/(2^3)) + (12/(2^2)) + (14/(2^1)) = 12

The problem is that you need to “count backwards” from the current value to compute higher powers for the previous terms of the

summation That is simply (current_val - previous_val + 1) Putting it all together, we get this expression:

UPDATE Series

SET answer

= (SELECT SUM(val

* POWER(2,

CASE WHEN R1.seq_nbr > 0

THEN Series.seq_nbr - R1.seq_nbr + 1 ELSE NULL END))

FROM Series AS S1

WHERE R1.seq_nbr <= Series.seq_nbr);

This assumes that we have a POWER(base, exponent) function in our implementation The reason for the second copy of Series under the name S2 in the SUM() expression is that an aggregate function cannot have an outer reference

24.7 Swapping and Sliding Values in a List

You will often want to manipulate a list of values, changing their

sequence position numbers The simplest such operation is to swap two values in your table

Trang 5

CREATE PROCEDURE SwapValues (IN low_seq_nbr INTEGER, IN high_seq_nbr INTEGER) LANGUAGE SQL

BEGIN put them in order SET low_seq_nbr

= CASE WHEN low_seq_nbr <= high_seq_nbr THEN low_seq_nbr ELSE high_seq_nbr;

SET high_seq_nbr = CASE WHEN low_seq_nbr <= high_seq_nbr THEN high_seq_nbr ELSE low_seq_nbr;

UPDATE Runs swap SET seq_nbr = low_seq_nbr + ABS(seq_nbr - high_seq_nbr) WHERE seq_nbr IN (low_seq_nbr, high_seq_nbr);

END;

Inserting a new value into the table is easy:

CREATE PROCEDURE InsertValue (IN new_value INTEGER) LANGUAGE SQL

INSERT INTO Runs (seq_nbr, val) VALUES ((SELECT MAX(seq_nbr) FROM Runs) + 1, new_value);

A bit trickier procedure is to move one value to a new position and slide the remaining values either up or down This mimics the way a physical queue would act Here is a solution from Dave Portas

CREATE PROCEDURE SlideValues

(IN old_seq_nbr INTEGER, IN new_seq_nbr INTEGER)

LANGUAGE SQL

UPDATE Runs

SET seq_nbr

= CASE

WHEN seq_nbr = old_seq_nbr THEN new_seq_nbr

WHEN seq_nbr BETWEEN old_seq_nbr AND new_seq_nbr THEN seq_nbr - 1 WHEN seq_nbr BETWEEN new_seq_nbr AND old_seq_nbr THEN seq_nbr + 1 ELSE seq_nbr END

WHERE seq_nbr BETWEEN old_seq_nbr AND new_seq_nbr

OR seq_nbr BETWEEN new_seq_nbr AND old_seq_nbr;

Trang 6

This handles moving a value to a higher or to a lower position in the table You can see how calls or slight changes to these procedures could

do other related operations

One of the most useful tricks is to have a calendar table with a Julianized date column Instead of trying to manipulate temporal data, convert the dates to a sequence of integers and treat the queries as regions, runs, gaps, and so forth

The sequence can be made up of calendar days or Julianized business days, which do not include holidays and weekends There are a lot of possible methods

24.8 Condensing a List of Numbers

The goal is to take a list of numbers and condense them into contiguous ranges Show the high and low values for each range; if the range has one number, then the high and low values will be the same This answer is due to Steve Kass

SELECT MIN(i) AS low, MAX(i) AS high

FROM (SELECT N1.i, COUNT(N2.i) - N1.i

FROM Numbers AS N1, Numbers AS N2

WHERE N2.i <= N1.i

GROUP BY N1.i)

AS N(i, gp)

GROUP BY gp;

24.9 Folding a List of Numbers

It is possible to use the Sequence table to give columns in the same row, which are related to each other, values with a little math instead of self-joins

For example, given the numbers 1 to (n), you might want to spread them out across (k) columns Let (k = 3) so we can see the pattern

SELECT seq_nbr,

CASE WHEN MOD((seq_nbr + 1), 3) = 2

AND seq_nbr + 1 <= :n

THEN (seq_nbr + 1)

ELSE NULL END AS second,

CASE WHEN MOD((seq_nbr + 2), 3) = 0

AND (seq_nbr + 2) <= :n

THEN (seq_nbr + 2)

Trang 7

ELSE NULL END AS third FROM Sequence

WHERE MOD((seq_nbr + 3), 3) = 1 AND seq_nbr <= :n;

Columns which have no value assigned to them will get a NULL That

is, for (n = 8) the incomplete row will be (7, 8, NULL) and for (n = 7) it

would be (7, NULL, NULL) We never get a row with (NULL, NULL, NULL)

Using math can be fancier In a golf tournament, the players with the lowest and highest scores are matched together for the next round Then the players with the second lowest and second highest scores are matched together, and so forth If the number of players is odd, the player with the middle score sits out that round These pairs can be built with a simple query

SELECT seq_nbr AS low_score, CASE WHEN seq_nbr <= (:n - seq_nbr) THEN (:n - seq_nbr) + 1 ELSE NULL END AS high_score FROM Sequence AS S1

WHERE S1.seq_nbr <= CASE WHEN MOD(:n, 2) = 1 THEN FLOOR(:n/2) + 1 ELSE (:n/2) END;

If you play around with the basic math functions, you can do quite a bit.

24.10 Coverings

Mikito Harakiri proposed the problem of writing the shortest SQL query that would return a minimal cover of a set of intervals For example, given this table, how do you find the contiguous numbers that are completely covered by the given intervals?

CREATE TABLE Intervals (x INTEGER NOT NULL,

y INTEGER NOT NULL, CHECK (x <= y), PRIMARY KEY (x, y));

Trang 8

INSERT INTO Intervals VALUES (1, 3);

The query should return

Results

min_x MAX(y)

================

1 12

20 21

120 132

Dieter Nöth found an answer with OLAP functions:

SELECT min_x, MAX(y)

FROM (SELECT x, y,

MAX(CASE WHEN x <= MAX_Y THEN NULL ELSE x END) OVER (ORDER BY x, y

ROWS UNBOUNDED PRECEDING) AS min_x

FROM (SELECT x, y,

MAX(y)

OVER(ORDER BY x, y

ROWS BETWEEN UNBOUNDED PRECEDING

AND 1 PRECEDING) AS max_y

FROM Intervals)

AS DT)

AS DT

GROUP BY min_x;

Trang 9

Here is a query that uses a self-join and three-level, nested correlated subquery that uses the same approach

SELECT I1.x, MAX(I2.y) AS y FROM Intervals AS I1 INNER JOIN Intervals AS I2

ON I2.y > I1.x WHERE NOT EXISTS (SELECT * FROM Intervals AS I3 WHERE I1.x - 1 BETWEEN I3.x AND I3.y) AND NOT EXISTS

(SELECT * FROM Intervals AS I4 WHERE I4.y > I1.x AND I4.y < I2.y AND NOT EXISTS (SELECT * FROM Intervals AS I5 WHERE I4.y + 1 BETWEEN I5.x AND I5.y)) GROUP BY I1.x;

This is essentially the same format, but converted to use left anti-semi-joins instead of subqueries I do not think it is shorter, but it might execute better on some platforms, and some people prefer this format to subqueries

SELECT I1.x, MAX(I2.y) AS y FROM Intervals AS I1 INNER JOIN Intervals AS I2

ON I2.y > I1.x LEFT OUTER JOIN Intervals AS I3

ON I1.x - 1 BETWEEN I3.x AND I3.y LEFT OUTER JOIN

(Intervals AS I4 LEFT OUTER JOIN Intervals AS I5

ON I4.y + 1 BETWEEN I5.x AND I5.y)

Trang 10

ON I4.y > I1.x

AND I4.y < I2.y

AND I5.x IS NULL

WHERE I3.x IS NULL

AND I4.x IS NULL

GROUP BY I1.x;

If the table is large, the correlated subqueries (version 1) or the quintuple self-join (version 2) will probably make it slow But we were asked for a short query, not for a quick one

Tony Andrews came up with this answer

SELECT Starts.x, Ends.y

FROM (SELECT x, ROW_NUMBER() OVER(ORDER BY x) AS rn

FROM (SELECT x, y,

LAG(y) OVER(ORDER BY x) AS prev_y

FROM Intervals)

WHERE prev_y IS NULL

OR prev_y < x) AS Starts,

(SELECT y, ROW_NUMBER() OVER(ORDER BY y) AS rn

FROM (SELECT x, y,

LEAD(x) OVER(ORDER BY y) AS next_x

FROM Intervals)

WHERE next_x IS NULL

OR y < next_x) AS Ends

WHERE Starts.rn = Ends.rn;

John Gilson decided that using recursion is an interesting take on this problem and made this offering:

WITH RECURSIVE Cover (x, y, n)

AS (SELECT x, y, (SELECT COUNT(*) FROM Intervals)

FROM Intervals

UNION ALL

SELECT CASE WHEN I3.x <= I.x THEN I3.x ELSE I.x END,

CASE WHEN I3.y >= I.y THEN I3.y ELSE I.y END,

I3.n - 1

FROM Intervals AS I, Cover AS C

WHERE I.x <= I3.y

AND I.y >= I3.x

AND (I.x <> I3.x OR I.y <> I3.y)

Định dạng
Số trang	10
Dung lượng	123,92 KB