1. Trang chủ
  2. » Công Nghệ Thông Tin

Joe Celko s SQL for Smarties - Advanced SQL Programming P59 pptx

10 162 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 133,84 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Well, we can start with a fact that anyone who has done inventory knows: the number of elements in a sequence is equal to the ending sequence number minus the starting sequence number pl

Trang 1

552 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

(1004, 'N'), (1003, 'Y'), (1002, 'Y'), (1001, 'Y'), (1000, 'N');

The results we want assign a grouping number to each run of on-time/late payments, thus:

Results grping payment_nbr paid_on_time

===============================

1 1006 'Y'

1 1005 'Y'

2 1004 'N'

3 1003 'Y'

3 1002 'Y'

3 1001 'Y'

4 1000 'N'

A solution by Hugo Kornelis depends on the payments always being numbered consecutively

SELECT (SELECT COUNT(*) FROM PaymentHistory AS H2, PaymentHistory AS H3 WHERE H3.payment_nbr = H2.payment_nbr + 1 AND H3.paid_on_time <> H2.paid_on_time AND H2.payment_nbr >= H1.payment_nbr) + 1 AS grping, payment_nbr, paid_on_time

FROM PaymentHistory AS H1;

This can be modified for more types of behavior

24.3 Finding Regions of Maximum Size

A query to find a region, rather than a subregion of a known size, of seats

1993)

SELECT T1.seat_nbr, ' thru ', T2.seat_nbr FROM Theater AS T1, Theater AS T2 WHERE T1.seat_nbr < T2.seat_nbr

Trang 2

24.3 Finding Regions of Maximum Size 553

AND NOT EXISTS

(SELECT *

FROM Theater AS T3

WHERE (T3.seat_nbr BETWEEN T1.seat_nbr AND

T2.seat_nbr

AND T3.occupancy_status <> 'A')

OR (T3.seat_nbr = T2.seat_nbr + 1

AND T3.occupancy_status = 'A')

OR (T3.seat_nbr = T1.seat_nbr - 1

AND T3.occupancy_status = 'A'));

The trick here is to look for the starting and ending seats in the region The starting seat_nbr of a region is to the right of a sold seat_nbr, and the ending seat_nbr is to the left of a sold seat_nbr No seat_nbr between the start and the end has been sold

If you only keep the available seat_nbrs in a table, the solution is a bit easier It is also a more general problem that applies to any table of sequential, possibly noncontiguous, data:

CREATE TABLE AvailableSeating

(seat_nbr INTEGER NOT NULL

CONSTRAINT valid_seat_nbr

CHECK (seat_nbr BETWEEN 001 AND 999));

INSERT INTO Seatings

VALUES (199), (200), (201), (202), (204),

(210), (211), (212), (214), (218);

You need to create a result that will show the start and finish values of each sequence in the table, thus:

Results

start finish

============

199 202

204 204

210 212

214 214

218 218

Trang 3

554 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

This is a common way of finding the missing values in a sequence of tickets sold, unaccounted-for invoices, and so forth Imagine a number line with closed dots for the numbers that are in the table and open dots for the numbers that are not What do you see about a sequence? Well,

we can start with a fact that anyone who has done inventory knows: the number of elements in a sequence is equal to the ending sequence number minus the starting sequence number plus one This is a basic property of ordinal numbers:

(finish - start + 1) = Length of open seats

table, one for the starting value and one for the ending value of each sequence Once we have those two items, we can compute the length with our formula and see if it is equal to the count of the items between the start and finish

SELECT S1.seat_nbr, MAX(S2.seat_nbr) start and rightmost item FROM AvailableSeating AS S1

INNER JOIN

AvailableSeating AS S2 self-join

ON S1.seat_nbr <= S2.seat_nbr

AND (S2.seat_nbr - S1.seat_nbr + 1) formula for length = (SELECT COUNT(*) items in the sequence

FROM AvailableSeating AS S3

WHERE S3.seat_nbr BETWEEN S1.seat_nbr AND S2.seat_nbr) AND NOT EXISTS (SELECT *

FROM AvailableSeating AS S4 WHERE S1.seat_nbr - 1 = S4.seat_nbr) GROUP BY S1.seat_nbr;

Finally, we need to be sure that we have the furthest item to the right

as the end item Each sequence of (n) items has (n) subsequences that all

However, there is a faster version with three tables This solution is based on another property of the longest possible sequences If you look

to the right of the last item, you do not find anything Likewise, if you look to the left of the first item, you do not find anything either These missing items that are “just over the border” define a sequence by

Trang 4

24.3 Finding Regions of Maximum Size 555

framing it There also cannot be any “gaps”—missing items—inside those borders That translates into SQL as:

SELECT S1.seat_nbr, MIN(S2.seat_nbr) start and leftmost border FROM AvailableSeating AS S1, AvailableSeating AS S2

WHERE S1.seat_nbr <= S2.seat_nbr

AND NOT EXISTS border items of the sequence

(SELECT *

FROM AvailableSeating AS S3

WHERE S3.seat_nbr NOT BETWEEN S1.seat_nbr AND

S2.seat_nbr

AND (S3.seat_nbr = S1.seat_nbr - 1

OR S3.seat_nbr = S2.seat_nbr + 1))

GROUP BY S1.seat_nbr;

We do not have to worry about getting the rightmost item in the sequence, but we do have to worry about getting the leftmost border

Since the second approach uses only three copies of the original

take advantage of indexing and thus run faster than subquery

expressions, which require a table scan

Michel Walsh came up with two novel ways of getting the range of seat numbers that have been used in the table He saw that the difference between the value and its rank is a constant for all values in the same consecutive sequence, so we just have to group, and count, on the value minus its rank to get the various consecutive runs (or just keep the maximum) It is so simple, an example will show everything:

data = {1, 2, 5, 6, 7, 8, 9, 11, 12, 22}

data rank (data_rank) AS absent

================================

1 1 0

2 2 0

5 3 2

6 4 2

7 5 2

8 6 2

9 7 2

11 8 3

Trang 5

556 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

12 9 3

22 10 12

absent COUNT(*) ================

0 2

2 5

3 2

12 1

As you can see, the maximum contiguous sequence is 5 (for rows

than or equal to the actual value, with the assumption of a set of integers without repeated values This is the query:

SELECT X.absent, COUNT(*) FROM (SELECT my_data, (SELECT COUNT(*) FROM Foobar AS F2 WHERE F2.my_data <= F1.my_data), (SELECT COUNT(*)

FROM Foobar AS F2 WHERE F2.my_data <= F1.my_data) - F1.my_data FROM Foobar AS F1)

AS X(my_data, rank, absent);

Playing with this basic idea, Mr Walsh came up with this second query

SELECT MIN(Z.seat_nbr), MAX(Z.seat_nbr) FROM (SELECT S1.seat_nbr,

S1.seat_nbr

- (SELECT COUNT(*) FROM Seating AS S2 WHERE S2.seat_nbr <= S1.seat_nbr) FROM Seating AS S1)

AS Z (seat_nbr, dif_rank) GROUP BY Z.dif_rank;

The derived table finds the lengths of the blocks of seats to the left of each seat_nbr and uses that length to form groups

Trang 6

24.5 Run and Sequence Queries 557

24.4 Bound Queries

Another form of query asks whether there is an overall trend between two points in time bounded by a low value and a high value in the sequence of data This is easier to show with an example Let us assume that we have data on the selling prices of a stock in a table We want to find periods of time when the price was generally increasing

Consider this data:

MyStock

sale_date price

=====================

'2006-12-01' 10.00

'2006-12-02' 15.00

'2006-12-03' 13.00

'2006-12-04' 12.00

'2006-12-05' 20.00

The stock was generally increasing in all the periods that began on December 1 or ended on December 5—that is, it finished higher at the ends of those periods, in spite of the slump in the middle A query for this problem is:

SELECT S1.sale_date AS start_date, S2.sale_date AS finish_date FROM MyStock AS S1, MyStock AS S2

WHERE S1.sale_date < S2.sale_date

AND NOT EXISTS

(SELECT *

FROM MyStock AS S3

WHERE S3.sale_date BETWEEN S1.sale_date AND

S2.sale_date

AND S3.price NOT BETWEEN S1.price AND S2.price);

24.5 Run and Sequence Queries

Runs are informally defined as sequences with gaps That is, we have a set of unique numbers whose order has some meaning, but the numbers are not all consecutive Time series information in which the samples are taken at irregular intervals is an example of this sort of data Runs can be constructed in the same manner as the sequences by making a minor change in the search condition Let’s do these queries with an abstract table made up of a sequence number and a value:

Trang 7

558 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

CREATE TABLE Runs (seq_nbr INTEGER NOT NULL PRIMARY KEY, val INTEGER NOT NULL);

Runs seq_nbr val ==========

1 6

2 41

3 12

4 51

5 21

6 70

7 79

8 62

9 30

10 31

11 32

12 34

13 35

14 57

15 19

16 84

17 80

18 90

19 63

20 53

21 3

22 59

23 69

24 27

25 33

One problem is that we do not want to get back all the runs and

adjustable This query will find runs of length (n) or greater; if you want runs of exactly (n), change the “greater than” sign to an equal sign

SELECT R1.seq_nbr AS start_seq_nbr, R2.seq_nbr AS end_seq_nbr nbr

FROM Runs AS R1, Runs AS R2

Trang 8

24.5 Run and Sequence Queries 559

WHERE R1.seq_nbr < R2.seq_nbr start and end points AND (R2.seq_nbr - R1.seq_nbr) > (:(n) - 1) length

restrictions

AND NOT EXISTS ordering within the end points

(SELECT *

FROM Runs AS R3, Runs AS R4

WHERE R4.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND R3.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND R3.seq_nbr < R4.seq_nbr

AND R3.val > R4.val);

This query sets up the S1 sequence number as the starting point and the S2 sequence number as the ending point of the run The monster

middle of the run that violates the ordering of the run If there is none, the run is valid The best way to understand what is happening is to draw

a linear diagram This shows that as the ordering (seq_nbr) increases, so must the corresponding values (val)

A sequence has the additional restriction that every value increases by one as you scan the run from left to right This means that in a sequence, the highest value minus the lowest value, plus one, is the length of the sequence

SELECT R1.seq_nbr AS start_seq_nbr, R2.seq_nbr AS

end_seq_nbr nbr

FROM Runs AS R1, Runs AS R2

WHERE R1.seq_nbr < R2.seq_nbr

AND (R2.seq_nbr - R1.seq_nbr) = (R2.val - R1.val) order condition

AND (R2.seq_nbr - R1.seq_nbr) > (:(n) - 1) length

restrictions

AND NOT EXISTS

(SELECT *

FROM Runs AS R3

WHERE R3.seq_nbr BETWEEN R1.seq_nbr AND R2.seq_nbr AND((R3.seq_nbr - R1.seq_nbr) <> (R3.val - R1.val)

OR (R2.seq_nbr - R3.seq_nbr) <> (R2.val - R3.val)));

point in between the start and the end of the sequence that violates the ordering condition

Trang 9

560 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

Obviously, any of these queries can be changed from increasing to decreasing, or from strictly increasing to simply increasing or simply decreasing, and so on, by changing the comparison predicates You can also change the query for finding sequences in a table by altering the size

difference between the starting value and the ending value

24.5.1 Filling in Sequence Numbers

A fair number of SQL programmers want to reuse a sequence of numbers for keys While I do not approve of the practice of generating a

meaningless, unverifiable key after the creation of an entity, the problem

of inserting missing numbers is interesting The usual specifications are:

empty

then give us a warning or a NULL Another option is to give us

list

This answer is a good example of thinking in terms of sets rather than doing row-at-a-time processing

SELECT MIN(new_seq_nbr) FROM (SELECT CASE WHEN (seq_nbr + 1) NOT IN (SELECT seq_nbr FROM List) THEN (seq_nbr + 1)

WHEN (seq_nbr - 1) NOT IN (SELECT seq_nbr FROM List) THEN (seq_nbr - 1)

WHEN 1 NOT IN (SELECT seq_nbr FROM List) THEN 1 ELSE NULL END

FROM List WHERE seq_nbr BETWEEN 1 AND (SELECT MAX(seq_nbr) FROM List)

AS P(new_seq_nbr);

Trang 10

24.5 Run and Sequence Queries 561

The idea is to build a table expression of some of the missing values,

then pick the minimum one The starting value, one, is treated as an

exception Since an aggregate function cannot take a query expression as

a parameter, we have to use a derived table

expression:

SELECT CASE WHEN MAX(seq_nbr) = COUNT(*)

THEN CAST(NULL AS INTEGER)

THEN MAX(seq_nbr) + 1 as other option

WHEN MIN(seq_nbr) > 1

THEN 1

WHEN MAX(seq_nbr) <> COUNT(*)

THEN (SELECT MIN(seq_nbr)+1

FROM List

WHERE (seq_nbr)+1

NOT IN (SELECT seq_nbr FROM List))

ELSE NULL END

FROM List;

The first WHEN clause sees whether the table is already full and returns

increment the list by the next value

The second WHEN clause looks to see whether the minimum sequence

number is one or not If so, it uses one as the next value

The third WHEN clause handles the situation when there is a gap in the

clause is in case of errors and should not be executed

way of forcing an inspection of the table’s values from front to back

Simpler methods based on group characteristics would be:

SELECT COALESCE(MIN(L1.seq_nbr) + 1, 1)

FROM List AS L1

LEFT OUTER JOIN

List AS L2

ON L1.seq_nbr = L2.seq_nbr - 1

WHERE L2.seq_nbr IS NULL;

Ngày đăng: 06/07/2014, 09:20

TỪ KHÓA LIÊN QUAN