562 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES or: SELECT MINF1.seq_nbr + 1 FROM List AS F1 UNION ALL VALUE 0 WHERE L1.seq_nbr +1 NOT IN SELECT seq_nbr FROM List; Finding ent
Trang 1562 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
or:
SELECT MIN(F1.seq_nbr + 1) FROM List AS F1
UNION ALL VALUE (0) WHERE (L1.seq_nbr +1) NOT IN (SELECT seq_nbr FROM List);
Finding entire gaps follows from this pattern, and we get this short piece of code
SELECT (s + 1) AS gap_start, (e - 1) AS gap_end FROM (SELECT L1.seq_nbr, MIN(L2.seq_nbr) FROM List AS L1, List AS L2 WHERE L1.seq_nbr < L2.seq_nbr GROUP BY L1.seq_nbr)
AS G(s, e) WHERE (e - 1) > s;
Without the derived table we get:
SELECT (L1.seq_nbr + 1) AS gap_start, (MIN(L2.seq_nbr) - 1) AS gap_end FROM List AS L1, List AS L2
WHERE L1.seq_nbr < L2.seq_nbr GROUP BY L1.seq_nbr
HAVING (MIN(L2.seq_nbr) - L1.seq_nbr) > 1;
24.6 Summation of a Series
While this topic is a bit more mathematical than most SQL programmers actually have to use in their work, it does demonstrate the power of SQL and a little knowledge of some basic college math
The summation of a series builds a running total of the values in a table and shows the cumulative total for each value in the series Let’s create a table and some sample data
CREATE TABLE Series (seq_nbr INTEGER NOT NULL PRIMARY KEY,
Trang 2val INTEGER NOT NULL,
answer INTEGER null means not computed yet
);
Sequences
seq_nbr val answer
======================
1 6 6
2 41 47
3 12 59
4 51 110
5 21 131
6 70 201
7 79 280
8 62 342
This simple summation is not a problem.
UPDATE Series
SET answer = (SELECT SUM(R1.val)
FROM Series AS S1
WHERE R1.seq_nbr <= Series.seq_nbr)
WHERE answer IS NULL;
This is the form we can use for most problems of this type with only one level of summation But things can be worse This problem came from Francisco Moreno, and on the surface it sounds easy First, create the usual table and populate it
DROP TABLE Series;
CREATE TABLE Series
(seq_nbr INTEGER NOT NULL,
val REAL NOT NULL,
answer REAL);
INSERT INTO Series
VALUES (0, 6.0, NULL),
(1, 6.0, NULL),
(2, 10.0, NULL),
(3, 12.0, NULL),
(4, 14.0, NULL);
Trang 3564 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
The goal is to compute the average of the first two terms, then add the third value to the result and average the two of them, and so forth In this data, we would have:
seq_nbr val answer ====================
0 6.0 NULL
1 6.0 6.0
2 10.0 8.0
3 12.0 10.0
4 14.0 12.0
The first thing we need to do is get rid of the value where (seq_nbr = 0) and change the table to read:
seq_nbr val answer =====================
1 12.0 NULL
2 10.0 NULL
3 12.0 NULL
4 14.0 NULL
The obvious approach is to do the calculations directly.
UPDATE Series SET answer = (Series.val + (SELECT R1.answer FROM Series AS S1 WHERE R1.seq_nbr = Series.seq_nbr - 1))/2.0 WHERE answer IS NULL;
But there is a problem with this approach It will only calculate one value at a time The reason is that this series is much more complex than
a simple running total
What we have is actually a double summation, in which the terms are defined by a continued fraction Let’s work out the first four answers by brute force and see if we can find a pattern.
answer1 = (12)/2 = 6 answer2 = ((12)/2 + 10)/2 = 8 answer3 = (((12)/2 + 10)/2 + 12)/2 = 10 answer4 = (((((12)/2 + 10)/2 + 12)/2) + 14)/2 = 12
Trang 4The real trick is to do some algebra and get rid of the nested
parentheses
answer1 = (12)/2 = 6
answer2 = (12/4) + (10/2) = 8
answer3 = (12/8) + (10/4) + (12/2) = 10
answer4 = (12/16) + (10/8) + (12/4) + (14/2) = 12
When we see powers of 2, we know we can reseq_nbr them with a formula:
answer1 = (12)/2^1 = 6
answer2 = (12/(2^2)) + (10/(2^1)) = 8
answer3 = (12/(2^3)) + (10/(2^2)) + (12/(2^1)) = 10
answer4 = (12/2^4) + (10/(2^3)) + (12/(2^2)) + (14/(2^1)) = 12
The problem is that you need to “count backwards” from the current value to compute higher powers for the previous terms of the
summation That is simply (current_val - previous_val + 1) Putting it all together, we get this expression:
UPDATE Series
SET answer
= (SELECT SUM(val
* POWER(2,
CASE WHEN R1.seq_nbr > 0
THEN Series.seq_nbr - R1.seq_nbr + 1 ELSE NULL END))
FROM Series AS S1
WHERE R1.seq_nbr <= Series.seq_nbr);
This assumes that we have a POWER(base, exponent) function in our implementation The reason for the second copy of Series under the name S2 in the SUM() expression is that an aggregate function cannot have an outer reference
24.7 Swapping and Sliding Values in a List
You will often want to manipulate a list of values, changing their
sequence position numbers The simplest such operation is to swap two values in your table
Trang 5566 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
CREATE PROCEDURE SwapValues (IN low_seq_nbr INTEGER, IN high_seq_nbr INTEGER) LANGUAGE SQL
BEGIN put them in order SET low_seq_nbr
= CASE WHEN low_seq_nbr <= high_seq_nbr THEN low_seq_nbr ELSE high_seq_nbr;
SET high_seq_nbr = CASE WHEN low_seq_nbr <= high_seq_nbr THEN high_seq_nbr ELSE low_seq_nbr;
UPDATE Runs swap SET seq_nbr = low_seq_nbr + ABS(seq_nbr - high_seq_nbr) WHERE seq_nbr IN (low_seq_nbr, high_seq_nbr);
END;
Inserting a new value into the table is easy:
CREATE PROCEDURE InsertValue (IN new_value INTEGER) LANGUAGE SQL
INSERT INTO Runs (seq_nbr, val) VALUES ((SELECT MAX(seq_nbr) FROM Runs) + 1, new_value);
A bit trickier procedure is to move one value to a new position and slide the remaining values either up or down This mimics the way a physical queue would act Here is a solution from Dave Portas
CREATE PROCEDURE SlideValues
(IN old_seq_nbr INTEGER, IN new_seq_nbr INTEGER)
LANGUAGE SQL
UPDATE Runs
SET seq_nbr
= CASE
WHEN seq_nbr = old_seq_nbr THEN new_seq_nbr
WHEN seq_nbr BETWEEN old_seq_nbr AND new_seq_nbr THEN seq_nbr - 1 WHEN seq_nbr BETWEEN new_seq_nbr AND old_seq_nbr THEN seq_nbr + 1 ELSE seq_nbr END
WHERE seq_nbr BETWEEN old_seq_nbr AND new_seq_nbr
OR seq_nbr BETWEEN new_seq_nbr AND old_seq_nbr;
Trang 6This handles moving a value to a higher or to a lower position in the table You can see how calls or slight changes to these procedures could
do other related operations
One of the most useful tricks is to have a calendar table with a Julianized date column Instead of trying to manipulate temporal data, convert the dates to a sequence of integers and treat the queries as regions, runs, gaps, and so forth
The sequence can be made up of calendar days or Julianized business days, which do not include holidays and weekends There are a lot of possible methods
24.8 Condensing a List of Numbers
The goal is to take a list of numbers and condense them into contiguous ranges Show the high and low values for each range; if the range has one number, then the high and low values will be the same This answer is due to Steve Kass
SELECT MIN(i) AS low, MAX(i) AS high
FROM (SELECT N1.i, COUNT(N2.i) - N1.i
FROM Numbers AS N1, Numbers AS N2
WHERE N2.i <= N1.i
GROUP BY N1.i)
AS N(i, gp)
GROUP BY gp;
24.9 Folding a List of Numbers
It is possible to use the Sequence table to give columns in the same row, which are related to each other, values with a little math instead of self-joins
For example, given the numbers 1 to (n), you might want to spread them out across (k) columns Let (k = 3) so we can see the pattern
SELECT seq_nbr,
CASE WHEN MOD((seq_nbr + 1), 3) = 2
AND seq_nbr + 1 <= :n
THEN (seq_nbr + 1)
ELSE NULL END AS second,
CASE WHEN MOD((seq_nbr + 2), 3) = 0
AND (seq_nbr + 2) <= :n
THEN (seq_nbr + 2)
Trang 7568 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
ELSE NULL END AS third FROM Sequence
WHERE MOD((seq_nbr + 3), 3) = 1 AND seq_nbr <= :n;
Columns which have no value assigned to them will get a NULL That
is, for (n = 8) the incomplete row will be (7, 8, NULL) and for (n = 7) it
would be (7, NULL, NULL) We never get a row with (NULL, NULL, NULL)
Using math can be fancier In a golf tournament, the players with the lowest and highest scores are matched together for the next round Then the players with the second lowest and second highest scores are matched together, and so forth If the number of players is odd, the player with the middle score sits out that round These pairs can be built with a simple query
SELECT seq_nbr AS low_score, CASE WHEN seq_nbr <= (:n - seq_nbr) THEN (:n - seq_nbr) + 1 ELSE NULL END AS high_score FROM Sequence AS S1
WHERE S1.seq_nbr <= CASE WHEN MOD(:n, 2) = 1 THEN FLOOR(:n/2) + 1 ELSE (:n/2) END;
If you play around with the basic math functions, you can do quite a bit.
24.10 Coverings
Mikito Harakiri proposed the problem of writing the shortest SQL query that would return a minimal cover of a set of intervals For example, given this table, how do you find the contiguous numbers that are completely covered by the given intervals?
CREATE TABLE Intervals (x INTEGER NOT NULL,
y INTEGER NOT NULL, CHECK (x <= y), PRIMARY KEY (x, y));
Trang 8INSERT INTO Intervals VALUES (1, 3);
INSERT INTO Intervals VALUES (2, 5);
INSERT INTO Intervals VALUES (4, 11);
INSERT INTO Intervals VALUES (10, 12);
INSERT INTO Intervals VALUES (20, 21);
INSERT INTO Intervals VALUES (120, 130);
INSERT INTO Intervals VALUES (120, 128);
INSERT INTO Intervals VALUES (120, 122);
INSERT INTO Intervals VALUES (121, 132);
INSERT INTO Intervals VALUES (121, 122);
INSERT INTO Intervals VALUES (121, 124);
INSERT INTO Intervals VALUES (121, 123);
INSERT INTO Intervals VALUES (126, 127);
The query should return
Results
min_x MAX(y)
================
1 12
20 21
120 132
Dieter Nöth found an answer with OLAP functions:
SELECT min_x, MAX(y)
FROM (SELECT x, y,
MAX(CASE WHEN x <= MAX_Y THEN NULL ELSE x END) OVER (ORDER BY x, y
ROWS UNBOUNDED PRECEDING) AS min_x
FROM (SELECT x, y,
MAX(y)
OVER(ORDER BY x, y
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING) AS max_y
FROM Intervals)
AS DT)
AS DT
GROUP BY min_x;
Trang 9570 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
Here is a query that uses a self-join and three-level, nested correlated subquery that uses the same approach
SELECT I1.x, MAX(I2.y) AS y FROM Intervals AS I1 INNER JOIN Intervals AS I2
ON I2.y > I1.x WHERE NOT EXISTS (SELECT * FROM Intervals AS I3 WHERE I1.x - 1 BETWEEN I3.x AND I3.y) AND NOT EXISTS
(SELECT * FROM Intervals AS I4 WHERE I4.y > I1.x AND I4.y < I2.y AND NOT EXISTS (SELECT * FROM Intervals AS I5 WHERE I4.y + 1 BETWEEN I5.x AND I5.y)) GROUP BY I1.x;
This is essentially the same format, but converted to use left anti-semi-joins instead of subqueries I do not think it is shorter, but it might execute better on some platforms, and some people prefer this format to subqueries
SELECT I1.x, MAX(I2.y) AS y FROM Intervals AS I1 INNER JOIN Intervals AS I2
ON I2.y > I1.x LEFT OUTER JOIN Intervals AS I3
ON I1.x - 1 BETWEEN I3.x AND I3.y LEFT OUTER JOIN
(Intervals AS I4 LEFT OUTER JOIN Intervals AS I5
ON I4.y + 1 BETWEEN I5.x AND I5.y)
Trang 10ON I4.y > I1.x
AND I4.y < I2.y
AND I5.x IS NULL
WHERE I3.x IS NULL
AND I4.x IS NULL
GROUP BY I1.x;
If the table is large, the correlated subqueries (version 1) or the quintuple self-join (version 2) will probably make it slow But we were asked for a short query, not for a quick one
Tony Andrews came up with this answer
SELECT Starts.x, Ends.y
FROM (SELECT x, ROW_NUMBER() OVER(ORDER BY x) AS rn
FROM (SELECT x, y,
LAG(y) OVER(ORDER BY x) AS prev_y
FROM Intervals)
WHERE prev_y IS NULL
OR prev_y < x) AS Starts,
(SELECT y, ROW_NUMBER() OVER(ORDER BY y) AS rn
FROM (SELECT x, y,
LEAD(x) OVER(ORDER BY y) AS next_x
FROM Intervals)
WHERE next_x IS NULL
OR y < next_x) AS Ends
WHERE Starts.rn = Ends.rn;
John Gilson decided that using recursion is an interesting take on this problem and made this offering:
WITH RECURSIVE Cover (x, y, n)
AS (SELECT x, y, (SELECT COUNT(*) FROM Intervals)
FROM Intervals
UNION ALL
SELECT CASE WHEN I3.x <= I.x THEN I3.x ELSE I.x END,
CASE WHEN I3.y >= I.y THEN I3.y ELSE I.y END,
I3.n - 1
FROM Intervals AS I, Cover AS C
WHERE I.x <= I3.y
AND I.y >= I3.x
AND (I.x <> I3.x OR I.y <> I3.y)