Well, we can start with a fact that anyone who has done inventory knows: the number of elements in a sequence is equal to the ending sequence number minus the starting sequence number pl
Trang 1552 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
(1004, 'N'), (1003, 'Y'), (1002, 'Y'), (1001, 'Y'), (1000, 'N');
The results we want assign a grouping number to each run of on-time/late payments, thus:
Results grping payment_nbr paid_on_time
===============================
1 1006 'Y'
1 1005 'Y'
2 1004 'N'
3 1003 'Y'
3 1002 'Y'
3 1001 'Y'
4 1000 'N'
A solution by Hugo Kornelis depends on the payments always being numbered consecutively
SELECT (SELECT COUNT(*) FROM PaymentHistory AS H2, PaymentHistory AS H3 WHERE H3.payment_nbr = H2.payment_nbr + 1 AND H3.paid_on_time <> H2.paid_on_time AND H2.payment_nbr >= H1.payment_nbr) + 1 AS grping, payment_nbr, paid_on_time
FROM PaymentHistory AS H1;
This can be modified for more types of behavior
24.3 Finding Regions of Maximum Size
A query to find a region, rather than a subregion of a known size, of seats
1993)
SELECT T1.seat_nbr, ' thru ', T2.seat_nbr FROM Theater AS T1, Theater AS T2 WHERE T1.seat_nbr < T2.seat_nbr
Trang 224.3 Finding Regions of Maximum Size 553
AND NOT EXISTS
(SELECT *
FROM Theater AS T3
WHERE (T3.seat_nbr BETWEEN T1.seat_nbr AND
T2.seat_nbr
AND T3.occupancy_status <> 'A')
OR (T3.seat_nbr = T2.seat_nbr + 1
AND T3.occupancy_status = 'A')
OR (T3.seat_nbr = T1.seat_nbr - 1
AND T3.occupancy_status = 'A'));
The trick here is to look for the starting and ending seats in the region The starting seat_nbr of a region is to the right of a sold seat_nbr, and the ending seat_nbr is to the left of a sold seat_nbr No seat_nbr between the start and the end has been sold
If you only keep the available seat_nbrs in a table, the solution is a bit easier It is also a more general problem that applies to any table of sequential, possibly noncontiguous, data:
CREATE TABLE AvailableSeating
(seat_nbr INTEGER NOT NULL
CONSTRAINT valid_seat_nbr
CHECK (seat_nbr BETWEEN 001 AND 999));
INSERT INTO Seatings
VALUES (199), (200), (201), (202), (204),
(210), (211), (212), (214), (218);
You need to create a result that will show the start and finish values of each sequence in the table, thus:
Results
start finish
============
199 202
204 204
210 212
214 214
218 218
Trang 3554 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
This is a common way of finding the missing values in a sequence of tickets sold, unaccounted-for invoices, and so forth Imagine a number line with closed dots for the numbers that are in the table and open dots for the numbers that are not What do you see about a sequence? Well,
we can start with a fact that anyone who has done inventory knows: the number of elements in a sequence is equal to the ending sequence number minus the starting sequence number plus one This is a basic property of ordinal numbers:
(finish - start + 1) = Length of open seats
table, one for the starting value and one for the ending value of each sequence Once we have those two items, we can compute the length with our formula and see if it is equal to the count of the items between the start and finish
SELECT S1.seat_nbr, MAX(S2.seat_nbr) start and rightmost item FROM AvailableSeating AS S1
INNER JOIN
AvailableSeating AS S2 self-join
ON S1.seat_nbr <= S2.seat_nbr
AND (S2.seat_nbr - S1.seat_nbr + 1) formula for length = (SELECT COUNT(*) items in the sequence
FROM AvailableSeating AS S3
WHERE S3.seat_nbr BETWEEN S1.seat_nbr AND S2.seat_nbr) AND NOT EXISTS (SELECT *
FROM AvailableSeating AS S4 WHERE S1.seat_nbr - 1 = S4.seat_nbr) GROUP BY S1.seat_nbr;
Finally, we need to be sure that we have the furthest item to the right
as the end item Each sequence of (n) items has (n) subsequences that all
However, there is a faster version with three tables This solution is based on another property of the longest possible sequences If you look
to the right of the last item, you do not find anything Likewise, if you look to the left of the first item, you do not find anything either These missing items that are “just over the border” define a sequence by
Trang 424.3 Finding Regions of Maximum Size 555
framing it There also cannot be any “gaps”—missing items—inside those borders That translates into SQL as:
SELECT S1.seat_nbr, MIN(S2.seat_nbr) start and leftmost border FROM AvailableSeating AS S1, AvailableSeating AS S2
WHERE S1.seat_nbr <= S2.seat_nbr
AND NOT EXISTS border items of the sequence
(SELECT *
FROM AvailableSeating AS S3
WHERE S3.seat_nbr NOT BETWEEN S1.seat_nbr AND
S2.seat_nbr
AND (S3.seat_nbr = S1.seat_nbr - 1
OR S3.seat_nbr = S2.seat_nbr + 1))
GROUP BY S1.seat_nbr;
We do not have to worry about getting the rightmost item in the sequence, but we do have to worry about getting the leftmost border
Since the second approach uses only three copies of the original
take advantage of indexing and thus run faster than subquery
expressions, which require a table scan
Michel Walsh came up with two novel ways of getting the range of seat numbers that have been used in the table He saw that the difference between the value and its rank is a constant for all values in the same consecutive sequence, so we just have to group, and count, on the value minus its rank to get the various consecutive runs (or just keep the maximum) It is so simple, an example will show everything:
data = {1, 2, 5, 6, 7, 8, 9, 11, 12, 22}
data rank (data_rank) AS absent
================================
1 1 0
2 2 0
5 3 2
6 4 2
7 5 2
8 6 2
9 7 2
11 8 3
Trang 5556 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
12 9 3
22 10 12
absent COUNT(*) ================
0 2
2 5
3 2
12 1
As you can see, the maximum contiguous sequence is 5 (for rows
than or equal to the actual value, with the assumption of a set of integers without repeated values This is the query:
SELECT X.absent, COUNT(*) FROM (SELECT my_data, (SELECT COUNT(*) FROM Foobar AS F2 WHERE F2.my_data <= F1.my_data), (SELECT COUNT(*)
FROM Foobar AS F2 WHERE F2.my_data <= F1.my_data) - F1.my_data FROM Foobar AS F1)
AS X(my_data, rank, absent);
Playing with this basic idea, Mr Walsh came up with this second query
SELECT MIN(Z.seat_nbr), MAX(Z.seat_nbr) FROM (SELECT S1.seat_nbr,
S1.seat_nbr
- (SELECT COUNT(*) FROM Seating AS S2 WHERE S2.seat_nbr <= S1.seat_nbr) FROM Seating AS S1)
AS Z (seat_nbr, dif_rank) GROUP BY Z.dif_rank;
The derived table finds the lengths of the blocks of seats to the left of each seat_nbr and uses that length to form groups
Trang 624.5 Run and Sequence Queries 557
24.4 Bound Queries
Another form of query asks whether there is an overall trend between two points in time bounded by a low value and a high value in the sequence of data This is easier to show with an example Let us assume that we have data on the selling prices of a stock in a table We want to find periods of time when the price was generally increasing
Consider this data:
MyStock
sale_date price
=====================
'2006-12-01' 10.00
'2006-12-02' 15.00
'2006-12-03' 13.00
'2006-12-04' 12.00
'2006-12-05' 20.00
The stock was generally increasing in all the periods that began on December 1 or ended on December 5—that is, it finished higher at the ends of those periods, in spite of the slump in the middle A query for this problem is:
SELECT S1.sale_date AS start_date, S2.sale_date AS finish_date FROM MyStock AS S1, MyStock AS S2
WHERE S1.sale_date < S2.sale_date
AND NOT EXISTS
(SELECT *
FROM MyStock AS S3
WHERE S3.sale_date BETWEEN S1.sale_date AND
S2.sale_date
AND S3.price NOT BETWEEN S1.price AND S2.price);
24.5 Run and Sequence Queries
Runs are informally defined as sequences with gaps That is, we have a set of unique numbers whose order has some meaning, but the numbers are not all consecutive Time series information in which the samples are taken at irregular intervals is an example of this sort of data Runs can be constructed in the same manner as the sequences by making a minor change in the search condition Let’s do these queries with an abstract table made up of a sequence number and a value:
Trang 7558 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
CREATE TABLE Runs (seq_nbr INTEGER NOT NULL PRIMARY KEY, val INTEGER NOT NULL);
Runs seq_nbr val ==========
1 6
2 41
3 12
4 51
5 21
6 70
7 79
8 62
9 30
10 31
11 32
12 34
13 35
14 57
15 19
16 84
17 80
18 90
19 63
20 53
21 3
22 59
23 69
24 27
25 33
One problem is that we do not want to get back all the runs and
adjustable This query will find runs of length (n) or greater; if you want runs of exactly (n), change the “greater than” sign to an equal sign
SELECT R1.seq_nbr AS start_seq_nbr, R2.seq_nbr AS end_seq_nbr nbr
FROM Runs AS R1, Runs AS R2
Trang 824.5 Run and Sequence Queries 559
WHERE R1.seq_nbr < R2.seq_nbr start and end points AND (R2.seq_nbr - R1.seq_nbr) > (:(n) - 1) length
restrictions
AND NOT EXISTS ordering within the end points
(SELECT *
FROM Runs AS R3, Runs AS R4
WHERE R4.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND R3.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND R3.seq_nbr < R4.seq_nbr
AND R3.val > R4.val);
This query sets up the S1 sequence number as the starting point and the S2 sequence number as the ending point of the run The monster
middle of the run that violates the ordering of the run If there is none, the run is valid The best way to understand what is happening is to draw
a linear diagram This shows that as the ordering (seq_nbr) increases, so must the corresponding values (val)
A sequence has the additional restriction that every value increases by one as you scan the run from left to right This means that in a sequence, the highest value minus the lowest value, plus one, is the length of the sequence
SELECT R1.seq_nbr AS start_seq_nbr, R2.seq_nbr AS
end_seq_nbr nbr
FROM Runs AS R1, Runs AS R2
WHERE R1.seq_nbr < R2.seq_nbr
AND (R2.seq_nbr - R1.seq_nbr) = (R2.val - R1.val) order condition
AND (R2.seq_nbr - R1.seq_nbr) > (:(n) - 1) length
restrictions
AND NOT EXISTS
(SELECT *
FROM Runs AS R3
WHERE R3.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND((R3.seq_nbr - R1.seq_nbr) <> (R3.val - R1.val)
OR (R2.seq_nbr - R3.seq_nbr) <> (R2.val - R3.val)));
point in between the start and the end of the sequence that violates the ordering condition
Trang 9560 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES
Obviously, any of these queries can be changed from increasing to decreasing, or from strictly increasing to simply increasing or simply decreasing, and so on, by changing the comparison predicates You can also change the query for finding sequences in a table by altering the size
difference between the starting value and the ending value
24.5.1 Filling in Sequence Numbers
A fair number of SQL programmers want to reuse a sequence of numbers for keys While I do not approve of the practice of generating a
meaningless, unverifiable key after the creation of an entity, the problem
of inserting missing numbers is interesting The usual specifications are:
empty
then give us a warning or a NULL Another option is to give us
list
This answer is a good example of thinking in terms of sets rather than doing row-at-a-time processing
SELECT MIN(new_seq_nbr) FROM (SELECT CASE WHEN (seq_nbr + 1) NOT IN (SELECT seq_nbr FROM List) THEN (seq_nbr + 1)
WHEN (seq_nbr - 1) NOT IN (SELECT seq_nbr FROM List) THEN (seq_nbr - 1)
WHEN 1 NOT IN (SELECT seq_nbr FROM List) THEN 1 ELSE NULL END
FROM List WHERE seq_nbr BETWEEN 1 AND (SELECT MAX(seq_nbr) FROM List)
AS P(new_seq_nbr);
Trang 1024.5 Run and Sequence Queries 561
The idea is to build a table expression of some of the missing values,
then pick the minimum one The starting value, one, is treated as an
exception Since an aggregate function cannot take a query expression as
a parameter, we have to use a derived table
expression:
SELECT CASE WHEN MAX(seq_nbr) = COUNT(*)
THEN CAST(NULL AS INTEGER)
THEN MAX(seq_nbr) + 1 as other option
WHEN MIN(seq_nbr) > 1
THEN 1
WHEN MAX(seq_nbr) <> COUNT(*)
THEN (SELECT MIN(seq_nbr)+1
FROM List
WHERE (seq_nbr)+1
NOT IN (SELECT seq_nbr FROM List))
ELSE NULL END
FROM List;
The first WHEN clause sees whether the table is already full and returns
increment the list by the next value
The second WHEN clause looks to see whether the minimum sequence
number is one or not If so, it uses one as the next value
The third WHEN clause handles the situation when there is a gap in the
clause is in case of errors and should not be executed
way of forcing an inspection of the table’s values from front to back
Simpler methods based on group characteristics would be:
SELECT COALESCE(MIN(L1.seq_nbr) + 1, 1)
FROM List AS L1
LEFT OUTER JOIN
List AS L2
ON L1.seq_nbr = L2.seq_nbr - 1
WHERE L2.seq_nbr IS NULL;