1. Trang chủ
  2. » Công Nghệ Thông Tin

Joe Celko s SQL for Smarties - Advanced SQL Programming P68 pps

10 95 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 125,8 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

You can assume that your SQL implementation has simple date arithmetic functions, although with different syntax from product to product, such as: 1... The problem is how to write an SQL

Trang 1

642 CHAPTER 29: TEMPORAL QUERIES

1 hens * 11/2 days * rate = 1 egg; multiply by eggs per hen

11/2 days * rate = 1 egg per hen; divide by the number of hens

rate = egg per hen per day; divide by 11/2 days

If you still do not get it, draw a graph

Almost every SQL implementation has a DATE data type, but the functions available for that data type vary quite a bit The most common ones are a constructor that builds a date from integers or strings; extractors to pull out the month, day, or year; and some display options

to format output

You can assume that your SQL implementation has simple date arithmetic functions, although with different syntax from product to product, such as:

1 A date plus or minus a number of days yields a new date

2 A date minus a second date yields an integer number of days

Table 29.1 shows the valid combinations of <datetime> and

<interval> data types in Standard SQL:

Table 29.1 Valid Combinations of Temporal Data Types in Standard SQL

<datetime> - <datetime> = <interval>

<datetime> + <interval> = <datetime>

<interval> (* or/) <numeric> = <interval>

<interval> + <datetime> = <datetime>

<interval> + <interval> = <interval>

<numeric> * <interval> = <interval>

Trang 2

29.2 Personal Calendars 643

Other rules, dealing with time zones and the relative precision of the two operands, are intuitively obvious

There should also be a function that returns the current date from the system clock This function has a different name with each vendor: TODAY, SYSDATE, NOW(), CURRENT DATE, and getdate() are some examples There may also be a function to return the day of the week from a date, which is sometimes called DOW() or WEEKDAY() Standard SQL provides for CURRENT_ DATE, CURRENT_TIME [(<time precision>)], and CURRENT_TIMESTAMP [(<timestamp

precision>)] functions, which are self-explanatory

29.2 Personal Calendars

One of the most common applications of dates is to build calendars that list upcoming events or actions to be taken by their user People have no trouble with using a paper calendar to trigger their own actions, but the idea of having an internal calendar as a table in their database is

somehow strange Programmers seem to prefer to write a function that calculates the date and matches it to events

It is easier to create a table for cyclic data than people initially think The months and days of the week within a year repeat themselves in a cycle of 28 years A table of just more than 10,000 rows can hold a complete cycle The cycle has to repeat itself every 400 years, so today is

on the same day of the week that it was on 400 years ago

As an example, consider the rule that a stockbroker must settle a transaction within three business days after a trade Business days are defined as excluding Saturdays, Sundays, and certain holidays The holidays are determined at the start of the year by the New York Stock Exchange, but this can be changed by an act of Congress or presidential decree, or the SEC can order that trading stop in a security The problem

is how to write an SQL query that will return the proper settlement date given a trade date

There are several tricks in this problem The real trick is to decide what you want and not to be fooled by what you have You have a list of holidays, but you want a list of settlement days Let’s start with a table of the given holidays and their names:

CREATE TABLE Holidays Insert holiday list into this table (holiday_date DATE NOT NULL PRIMARY KEY,

holiday_name CHAR(20) NOT NULL);

Trang 3

644 CHAPTER 29: TEMPORAL QUERIES

The next step is to build a table of trade and settlement dates for the whole year Building the INSERT INTO statements to load the second table is easily done with a spreadsheet; these always have good date functions

Let’s start by building a simple list of the dates over the range we want use and putting them into a table called Settlements:

CREATE TABLE Settlements (trade_date DATE NOT NULL PRIMARY KEY, settle_date DATE NOT NULL);

INSERT INTO Settlements VALUES ('2005-02-01', '2005-02-01'), ('2005-02-02', '2005-02-02'), ('2005-02-03', '2005-02-03');

etc.

We know that we cannot trade on a holiday or weekend You probably could have excluded weekends in the spreadsheet, but if not, use this statement

DELETE FROM Settlements WHERE trade_date IN (SELECT holi_date FROM Holidays)

OR DayOfWeek(trade_date) IN ('Saturday', 'Sunday');

This does not handle the holiday settlements, however The trouble with a holiday is that it can fall on a weekend, in which case we just handled it, it can last only one day, or it can last any number of days The table of holidays is built on the assumption that each day of a multiday holiday has a row in the table

We now have to update the table so that the regular settlement days are three business-days forward of the trade date But we have all the business days in the trade_date column of the Settlements table now

UPDATE Settlements SET settle_date = (SELECT trade_date FROM Settlements AS S1 WHERE Settlements.trade_date < S1.trade_date AND (SELECT COUNT(*)

FROM Settlements AS S2

Trang 4

29.3 Time Series 645

WHERE S2.trade_date

BETWEEN Settlements.trade_date

AND S1.trade_date) = 3);

The final settlement table will be about 250 rows per year and only two columns wide This is quite small; it will fit into main storage easily

on any machine Finding the settlement day is a straight, simple query; if you had built only the Holiday table, you would have had to provide procedural code

29.3 Time Series

One of the major problems in the real world is how to handle a series of events that occur in the same time period or in some particular order The code is tricky and a bit hard to understand, but the basic idea is that you have a table with start and stop times for events, and you want to get information about them as a group

The timeline can be partitioned into intervals, and a set of intervals can

be drawn from that partition for reporting One of the stock questions

on an employment form asks the prospective employee to explain any gaps in his record of employment Most of the time this gap means that you were unemployed If you are in data processing, you answer that you were consulting, which is a synonym for unemployed

Given this table, how would you write an SQL query to display the time periods and their durations for each of the candidates? You will have to assume that your version of SQL has DATE functions that can do some simple calendar math

CREATE TABLE JobApps

(candidate CHAR(25) NOT NULL,

jobtitle CHAR(15) NOT NULL,

start_date DATE NOT NULL,

end_date DATE null means still employed

CONSTRAINT started_before_ended

CHECK(start_date <= end_date)

);

Trang 5

646 CHAPTER 29: TEMPORAL QUERIES

Notice that the end date of the current job_code is set to NULL because SQL does not support an “eternity” or “end of time” value for temporal data types Using ‘9999-12-31 23:59:59.999999’, the highest possible date value that SQL can represent, is not a correct model and can cause problems when you do temporal arithmetic The NULL can be handled with a COALESCE() function in the code, as I will demonstrate later

It is obvious that this has to be a self-JOIN query, so you have to do some date arithmetic The first day of each gap is the last day of an employment period plus one day, and that the last day of each gap is the first day of the next job_code minus one day This start-point and end-point problem is the reason that SQL defined the OVERLAPS predicate this way

All versions of SQL support temporal data types and arithmetic But unfortunately, no two implementations look alike, and few look like the ANSI standard The first attempt at this query is usually something like the following, which will produce the right results, but with a lot of extra rows that are just plain wrong Assume that if I add a number of days to

a date, or subtract a number of days from it, I get a new date

SELECT J1.candidate, (J1.end_date + INTERVAL '1' DAY) AS gap_start, (J2.start_date - INTERVAL '1' DAY) AS gap_end, (J2.start_date - J1.end_date) AS gaplength FROM JobApps AS J1, JobApps AS J2

WHERE J1.candidate = J2.candidate AND (J1.end_date + INTERVAL '1' DAY) < J2.start_date;

Here is why this does not work Imagine that we have a table that includes a candidate named ‘Bill Jones’ with the following work history:

Result candidate jobtitle start_date end_date =======================================================

'John Smith' 'Vice Pres' '1999-01-10' '1999-12-31' 'John Smith' 'President' '2000-01-12' '2001-12-31' 'Bill Jones' 'Scut Worker' '2000-02-24' '2000-04-21' 'Bill Jones' 'Manager' '2001-01-01' '2001-01-05' 'Bill Jones' 'Grand Poobah' '2001-04-04' '2001-05-15'

Trang 6

29.3 Time Series 647

We would get this as a result:

Result

candidate gap_start gap_end gaplength

==================================================

'John Smith' '2000-01-01' '200001-11' 12

'Bill Jones' '2000-04-22' '200012-31' 255

'Bill Jones' '2001-01-06' '2001-04-03' 89

'Bill Jones' '2000-04-22' '2001-04-03' 348 <= false data

The problem is that the ‘John Smith’ row looks just fine and can fool you into thinking that you are doing fine He had two jobs; therefore, there was one gap in between However, ‘Bill Jones’ cannot be right— only two gaps separate three jobs, yet the query shows three gaps The query does its JOIN on all possible combinations of start and end dates in the original table This gives false data in the results by counting the end of one job_code, ‘Scut Worker’ and the start of another, ‘Grand Poobah’, as a gap The idea is to use only the most recently ended job_code for the gap This can be done with a MIN() function and a correlated subquery The final result is this:

SELECT J1.candidate, (J1.end_date + INTERVAL '1' DAY) AS

gap_start,

(J2.start_date - INTERVAL '1' DAY) AS gap_end

FROM JobApps AS J1, JobApps AS J2

WHERE J1.candidate = J2.candidate

AND J2.start_date

= (SELECT MIN(J3.start_date)

FROM JobApps AS J3

WHERE J3.candidate = J1.candidate

AND J3.start_date > J1.end_date)

AND (J1.end_date + INTERVAL '1' DAY)

< (J2.start_date - INTERVAL '1' DAY)

UNION ALL

SELECT J1.candidate, MAX(J1.end_date) + INTERVAL '1' DAY, CURRENT_TIMESTAMP

FROM JobApps AS J1

GROUP BY J1.candidate

HAVING COUNT(*) = COUNT(DISTINCT J1.end_date);

Trang 7

648 CHAPTER 29: TEMPORAL QUERIES

The length of the gap can be determined with simple temporal arithmetic The purpose of the UNION ALL is to add the current period

of unemployment, if any, to the final answer

Given a series of jobs that can start and stop at any time, how can you be sure that an employee doing all these jobs was really working without any gaps? Let’s build a table of timesheets for one employee

CREATE TABLE Timesheets (job_code CHAR(5) NOT NULL PRIMARY KEY, start_date DATE NOT NULL,

end_date DATE NOT NULL, CHECK (start_date <= end_date));

INSERT INTO Timesheets (job_code, start_date, end_date) VALUES ('j1', '2008-01-01', '2008-01-03');

('j2', '2008-01-06', '2008-01-10'), ('j3', '2008-01-05', '2008-01-08'), ('j4', '2008-01-20', '2008-01-25'), ('j5', '2008-01-18', '2008-01-23'), ('j6', '2008-02-01', '2008-02-05'), ('j7', '2008-02-03', '2008-02-08'), ('j8', '2008-02-07', '2008-02-11'), ('j9', '2008-02-09', '2008-02-10'), ('j10', '2008-02-01', '2008-02-11'), ('j11', '2008-03-01', '2008-03-05'), ('j12', '2008-03-04', '2008-03-09'), ('j13', '2008-03-08', '2008-03-14'), ('j14', '2008-03-13', '2008-03-20');

The most immediate answer is to build a search condition for all of the characteristics of a continuous time period

This algorithm is due to Mike Arney, a DBA at BORN Consulting It uses derived tables to get the extreme start and ending dates of a contiguous run of durations

SELECT Early.start_date, MIN(Latest.end_date) FROM (SELECT DISTINCT start_date

FROM Timesheets AS T1

Trang 8

29.3 Time Series 649

WHERE NOT EXISTS

(SELECT *

FROM Timesheets AS T2

WHERE T2.start_date < T1.start_date

AND T2.end_date >= T1.start_date)

) AS Early (start_date)

INNER JOIN

(SELECT DISTINCT end_date

FROM Timesheets AS T3

WHERE NOT EXISTS

(SELECT *

FROM Timesheets AS T4

WHERE T4.end_date > T3.end_date

AND T4.start_date <= T3.end_date)

) AS Latest (end_date)

ON Early.start_date <= Latest.end_date

GROUP BY Early.start_date;

Result

start_date end_date

===========================

'2008-01-01' '2008-01-03'

'2008-01-05' '2008-01-10'

'2008-01-18' '2008-01-25'

'2008-02-01' '2008-02-11'

'2008-03-01' '2008-03-20'

However, another way of doing this is a query, which will also tell you which jobs bound the continuous periods

SELECT T2.start_date,

MAX(T1.end_date) AS finish_date,

MAX(T1.job_code || ' to ' || T2.job_code) AS job_code_pair FROM Timesheets AS T1, Timesheets AS T2

WHERE T2.job_code <> T1.job_code

AND T1.start_date BETWEEN T2.start_date AND T2.end_date

AND T2.end_date BETWEEN T1.start_date AND T1.end_date

GROUP BY T2.start_date;

Trang 9

650 CHAPTER 29: TEMPORAL QUERIES

Result start_date finish_date job_code_pair

=========================================

'2008-01-05' '2008-01-10' 'j2 to J3' '2008-01-18' '2008-01-25' 'j4 to J5' '2008-02-01' '2008-02-08' 'j7 to J6' '2008-02-03' '2008-02-11' 'j8 to J7'

DELETE FROM Results WHERE EXISTS (SELECT R1.job_code_list FROM Results AS R1 WHERE POSITION (Results.job_code_list

IN R1.job_code_list) > 0);

A third solution will handle an isolated job_code like ‘j1’, as well as three or more overlapping jobs, like ‘j6’, ‘j7’, and ‘j8’

SELECT T1.start_date, MIN(T2.end_date) AS finish_date, MIN(T2.end_date + INTERVAL '1' DAY)

- MIN(T1.start_date) AS duration find any (T1.start_date)

FROM Timesheets AS T1, Timesheets AS T2 WHERE T2.start_date >= T1.start_date AND T2.end_date >= T1.end_date AND NOT EXISTS

(SELECT * FROM Timesheets AS T3 WHERE (T3.start_date <= T2.end_date AND T3.end_date > T2.end_date)

OR (T3.end_date >= T1.start_date AND T3.start_date < T1.start_date)) GROUP BY T1.start_date;

You will also want to look at how to consolidate overlapping intervals

of integers

A fourth solution uses the auxiliary Calendar table (see Section 29.9 for details) to find the dates that are and are not covered by any of the durations The coverage flag and calendar date can then be used directly

Trang 10

29.3 Time Series 651

by other queries that need to look at the status of single days instead of

date ranges

SELECT C1.cal_date,

SUM(DISTINCT

CASE

WHEN C1.cal_date BETWEEN T1.start_date AND T1.end_date

THEN 1 ELSE 0 END) AS covered_date_flag

FROM Calendar AS C1, Timesheets AS T1

WHERE C1.cal_date BETWEEN (SELECT MIN(start_date FROM

Timesheets)

AND (SELECT MAX(end_date FROM Timesheets)

GROUP BY C1.cal_date;

This is reasonably fast because the WHERE clause uses static scalar

queries to set the bounds, and the Calendar table uses cal_date as a

primary key, so it will have an index

A slightly different version of the problem is to group contiguous

measurements into durations that have the value of that measurement

I have the following table:

CREATE TABLE Calibrations

(start_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP

NOT NULL PRIMARY KEY

end_time TIMESTAMP NOT NULL,

CHECK (end_time = start_time + INTERVAL '1' MINUTE,

cal_value INTEGER NOT NULL);

The table has this data:

Calibrations

start_time end_time cal_value

==========================================================

'2005-05-11 02:52:00.000' '2005-05-11 02:53:00.000' 8

'2005-05-11 02:53:00.000' '2005-05-11 02:54:00.000' 8

'2005-05-11 02:54:00.000' '2005-05-11 02:55:00.000' 8

'2005-05-11 02:55:00.000' '2005-05-11 02:56:00.000' 8

'2005-05-11 02:56:00.000' '2005-05-11 02:57:00.000' 8

'2005-05-11 02:57:00.000' '2005-05-11 02:58:00.000' 9

'2005-05-11 02:58:00.000' '2005-05-11 02:59:00.000' 9

'2005-05-11 02:59:00.000' '2005-05-11 03:00:00.000' 9

Ngày đăng: 06/07/2014, 09:20

TỪ KHÓA LIÊN QUAN