Now we can construct a VIEW like this: CREATE VIEW Attendees guest_name, celeb_name, entered, exited AS SELECT guest_name, celeb_name, MAXentered, MINexited FROM Working GROUP BY guest
Trang 1celebration A little algebra tells you that the length of an event is (Event.finish_date - Event.start_date + INTERVAL '1' DAY) and that the length of a guest’s stay is(Guest.depart_date - Guest.arrival_date + INTERVAL '1' DAY) Let’s do one of those timeline charts again:
What we want is the part of the Guests interval that is inside the Celebrations interval
Guests 1 and 2 spent only part of their time at the celebration; Guest
3 spent all of his time at the celebration and Guest 4 stayed even longer than the celebration That interval is defined by the two points
(GREATEST(arrival_date, start_date), LEAST(depart_date, finish_date)) Instead, you can use the aggregate functions in SQL to build a VIEW
on a VIEW, like this:
CREATE VIEW Working (guest_name, celeb_name, entered, exited)
AS SELECT GE.guest_name, GE.celeb_name, start_date, finish_date
FROM GuestCelebrations AS GE, Celebrations AS E1
WHERE E1.celeb_name = GE.celeb_name
UNION
SELECT GE.guest_name, GE.celeb_name, arrival_date, depart_date FROM GuestCelebrations AS GE, Guests AS G1
WHERE G1.guest_name = GE.guest_name;
VIEW Working
guest_name celeb_name entered exited
================================================================ 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28' 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Dorothy Gale' 'St Fred's Day' '2005-02-01' '2005-11-01'
Figure 13.3
Timeline Diagram.
Trang 213.2 OVERLAPS Predicate 283
'Dorothy Gale' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-28' 'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28' 'Don Quixote' 'Apple Month' '2005-01-01' '2005-10-01' 'Don Quixote' 'Garlic Festival' '2005-01-01' '2005-10-01' 'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-10-01' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-10-01' 'Don Quixote' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Don Quixote' 'St Fred's Day' '2005-01-01' '2005-10-01' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01' 'James T Kirk' 'Apple Month' '2005-02-01' '2005-02-28' 'James T Kirk' 'Garlic Festival' '2005-02-01' '2005-02-28' 'James T Kirk' 'Garlic Festival' '2005-01-15' '2005-02-15' 'James T Kirk' 'St Fred's Day' '2005-02-01' '2005-02-28' 'James T Kirk' 'St Fred's Day' '2005-02-24' '2005-02-24' 'James T Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28' 'James T Kirk' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-01-01' '2005-12-31'
This will put the earliest and latest points in both intervals into one column Now we can construct a VIEW like this:
CREATE VIEW Attendees (guest_name, celeb_name, entered, exited)
AS SELECT guest_name, celeb_name, MAX(entered), MIN(exited)
FROM Working
GROUP BY guest_name, celeb_name;
VIEW Attendees
Trang 3guest_name celeb_name entered exited
=============================================================== 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28' 'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-02-15' 'Dorothy Gale' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01' 'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28' 'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01' 'Don Quixote' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01' 'James T Kirk' 'Apple Month' '2005-02-01' '2005-02-28' 'James T Kirk' 'Garlic Festival' '2005-02-01' '2005-02-15' 'James T Kirk' 'St Fred's Day' '2005-02-24' '2005-02-24' 'James T Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28' 'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25'
The Attendees VIEW can be used to compute the total number of room days for each celebration Assume that the difference between two dates will return an integer that is the number of days between them: SELECT celeb_name,
SUM(exited - entered + INTERVAL '1' DAY) AS roomdays FROM Attendees
GROUP BY celeb_name;
Result celeb_name roomdays ============================
'Apple Month' 85 'Christmas Season' 25 'Garlic Festival' 63 'National Pear Week' 7 'New Year's Day' 1 'St Fred's Day' 3 'Year of the Prune' 602
Trang 413.2 OVERLAPS Predicate 285
If you would like to get a count of the room days sold in the month of January, you could use this query, which avoids a BETWEEN or
OVERLAPS predicate completely:
SELECT SUM(CASE WHEN depart > DATE '2005-01-31'
THEN DATE '2005-01-31'
ELSE depart END
- CASE WHEN arrival_date < DATE '2005-01-01'
THEN DATE '2005-01-01'
ELSE arrival_date END + INTERVAL '1' DAY) AS room_days
FROM Guests
WHERE depart > DATE '2005-01-01' AND arrival_date <= DATE '2005-01-31';
Trang 6C H A P T E R
14 The [NOT] IN() Predicate
THE IN() PREDICATE IS very natural It takes a value and sees whether that value is in a list of comparable values Standard SQL allows value expressions in the list, or for you to use a query to construct the list The syntax is:
<in predicate> ::=
<row value constructor> [NOT] IN <in predicate value>
<in predicate value> ::=
<table subquery> | (<in value list>)
<in value list> ::=
<row value expression> { <comma> <row value expression> }
The expression <row value constructor> NOT IN <in predicate value> has the same effect as NOT (<row value constructor> IN <in predicate value>) This pattern for the use of the keyword NOT is found in most of the other predicates The expression <row value constructor> IN <in predicate value> has, by definition, the same effect as <row value constructor> = ANY <in predicate value> Most optimizers will recognize this and execute the same code for both
Trang 7288 CHAPTER 14: THE [NOT] IN() PREDICATE
expressions This means that if the <in predicate value> is empty, such as one you would get from a subquery that returns no rows, the results will be equivalent to (<row value constructor> = (NULL, ., NULL)), which is always evaluated to UNKNOWN Likewise, if the
<in predicate value> is an explicit list of NULLs, the results will be UNKNOWN However, please remember that there is a difference between
an empty table and a table with rows of all NULLs
IN() predicates with a subquery can sometimes be converted into EXISTS predicates, but there are some problems and differences in the predicates The conversion to an EXISTS predicate is often a good way
to improve performance, but it will not be as easy to read as the original IN() predicate An EXISTS predicate can use indexes to find (or fail to find) a single value that confirms (or denies) the predicate, whereas the IN() predicate often has to build the results of the subquery in a working table
14.1 Optimizing the IN() Predicate
Most database engines have no statistics about the relative frequency of the values in a list of constants, so they will scan them in the order in which they appear in the list People like to order lists alphabetically or
by magnitude, but it would be better to order the list from most frequently occurring values to least frequent It is also pointless to have duplicate values in the constant list, since the predicate will return TRUE
if it matches the first duplicate it finds, and never get to the second occurrence Likewise, if the predicate is FALSE for that value, it wastes computer time to traverse a needlessly long list
Many SQL engines perform an IN() predicate with a subquery by building the result set of the subquery first as a temporary working table, then scanning that result table from left to right This can be expensive in many cases; for example, in a query to find employees in a city with a major sport team (we want them to get tickets for us), we could write (assuming that city names are unique):
SELECT * FROM Personnel WHERE city_name
IN (SELECT city_name _name FROM SportTeams);
Trang 814.1 Optimizing the IN() Predicate 289
But let us further assume that our personnel are located in (n) cities and the sports teams are in (m) cities, where (m) is much greater than (n) If the matching cities appear near the front of the list generated by the subquery expression, it will perform much faster than if they appear
at the end of the list In the case of a subquery expression, you have no control over how the subquery is presented back in the containing query
However, you can order the expressions in a list in the order in which they are most likely to occur, such as:
SELECT *
FROM Personnel
WHERE city_name
IN ('New York', 'Chicago', 'Atlanta', , 'Austin');
Incidentally, Standard SQL allows row expression comparisons, so if you have a Standard SQL implementation with separate columns for the city and state, you could write:
SELECT *
FROM Personnel
WHERE (city_name , state)
IN (SELECT city_name , state
FROM SportTeams);
Teradata did not get correlated subqueries until 1996, so they often used this syntax as a workaround I am not sure if you should count them as being ahead or behind the technology for that
Today, all major versions of SQL remove duplicates in the result table
of the subquery, so you do not have to use a SELECT DISTINCT in the subquery You might see this in legacy code A trick that can work for large lists on some products is to force the engine to construct a list ordered by frequency This involves first constructing a VIEW that has an ORDER BY clause; this practice is not part of the SQL standard, which does not allow a VIEW to have an ORDER BY clause For example, a paint company wants to find all the products offered by their
competitors who use the same color as one of their products First construct a VIEW that orders the colors by frequency of appearance: CREATE VIEW PopColor (color, tally)
AS SELECT color, COUNT(*) AS tally
Trang 9290 CHAPTER 14: THE [NOT] IN() PREDICATE
FROM Paints GROUP BY color ORDER BY tally DESC;
Then go to the Competitor data and do a simple column SELECT on the VIEW, thus:
SELECT * FROM Competitor WHERE color IN (SELECT color FROM PopColor);
The VIEW is grouped, so it will be materialized in sort order The subquery will then be executed and (we hope) the sort order will be maintained and passed along to the IN() predicate Another trick is to replace the IN() predicate with a JOIN operation For example, you have a table of restaurant telephone numbers and a guidebook, and you want to pick out the four-star places, so you write this query:
SELECT restaurant_name, phone_nbr FROM Restaurants
WHERE restaurant_name
IN (SELECT restaurant_name FROM QualityGuide WHERE stars = 4);
If there is an index on QualityGuide.stars, the SQL engine will probably build a temporary table of the four-star places and pass it on to the outer query The outer query will then handle it as if it were a list of constants
However, this is not the sort of column that you would normally index Without an index on stars, the engine will simply do a sequential search of the QualityGuide table This query can be replaced with a JOIN query, thus:
SELECT restaurant_name, phone_nbr FROM Restaurants, QualityGuide WHERE stars = 4
AND Restaurants.restaurant_name = QualityGuide.restaurant_name;
Trang 1014.1 Optimizing the IN() Predicate 291
This query should run faster, since restaurant_name is a key for both tables and will be indexed to ensure uniqueness However, this can
return duplicate rows in the result table that you can handle with a
SELECT DISTINCT Consider a more budget-minded query, where we want places with a meal that costs less than $10, and the menu
guidebook lists all the meals The query looks about the same:
SELECT restaurant_name, phone_nbr
FROM Restaurants
WHERE restaurant_name
IN (SELECT restaurant_name
FROM MenuGuide
WHERE price <= 10.00);
And you would expect to be able to replace it with:
SELECT restaurant_name, phone_nbr
FROM Restaurants, MenuGuide
WHERE price <= 10.00
AND Restaurants.restaurant_name = MenuGuide.restaurant_name;
Every item in Murphy’s Two-Dollar Hash House will get a line in the results of the JOINed version However, this can be fixed by changing SELECT restaurant_name, phone_nbr to SELECT DISTINCT restaurant_name, phone_nbr, but it will cost more time to do a sort to remove the duplicates There is no good general advice, except to experiment with your particular product
The NOT IN() predicate is probably better replaced with a NOT
EXISTS predicate Using the restaurant example again, our friend John has a list of eateries and we want to see those that are not in the
guidebook The natural formation of the query is:
SELECT *
FROM JohnsBook
WHERE restaurant_name
NOT IN (SELECT restaurant_name
FROM QualityGuide);
But you can write the same query with a NOT EXISTS predicate and
it will probably run faster: