Joe Celko s SQL for Smarties - Advanced SQL Programming P32 ppt

Now we can construct a VIEW like this: CREATE VIEW Attendees guest_name, celeb_name, entered, exited AS SELECT guest_name, celeb_name, MAXentered, MINexited FROM Working GROUP BY guest

Trang 1

celebration A little algebra tells you that the length of an event is (Event.finish_date - Event.start_date + INTERVAL '1' DAY) and that the length of a guest’s stay is(Guest.depart_date - Guest.arrival_date + INTERVAL '1' DAY) Let’s do one of those timeline charts again:

What we want is the part of the Guests interval that is inside the Celebrations interval

Guests 1 and 2 spent only part of their time at the celebration; Guest

3 spent all of his time at the celebration and Guest 4 stayed even longer than the celebration That interval is defined by the two points

(GREATEST(arrival_date, start_date), LEAST(depart_date, finish_date)) Instead, you can use the aggregate functions in SQL to build a VIEW

on a VIEW, like this:

CREATE VIEW Working (guest_name, celeb_name, entered, exited)

AS SELECT GE.guest_name, GE.celeb_name, start_date, finish_date

FROM GuestCelebrations AS GE, Celebrations AS E1

WHERE E1.celeb_name = GE.celeb_name

UNION

SELECT GE.guest_name, GE.celeb_name, arrival_date, depart_date FROM GuestCelebrations AS GE, Guests AS G1

WHERE G1.guest_name = GE.guest_name;

VIEW Working

guest_name celeb_name entered exited

================================================================ 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28' 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Dorothy Gale' 'St Fred's Day' '2005-02-01' '2005-11-01'

Figure 13.3

Timeline Diagram.

Trang 2

13.2 OVERLAPS Predicate 283

'Dorothy Gale' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01' 'Dorothy Gale' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-28' 'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28' 'Don Quixote' 'Apple Month' '2005-01-01' '2005-10-01' 'Don Quixote' 'Garlic Festival' '2005-01-01' '2005-10-01' 'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-10-01' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-10-01' 'Don Quixote' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Don Quixote' 'St Fred's Day' '2005-01-01' '2005-10-01' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01' 'James T Kirk' 'Apple Month' '2005-02-01' '2005-02-28' 'James T Kirk' 'Garlic Festival' '2005-02-01' '2005-02-28' 'James T Kirk' 'Garlic Festival' '2005-01-15' '2005-02-15' 'James T Kirk' 'St Fred's Day' '2005-02-01' '2005-02-28' 'James T Kirk' 'St Fred's Day' '2005-02-24' '2005-02-24' 'James T Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28' 'James T Kirk' 'Year of the Prune' '2005-01-01' '2005-12-31' 'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-01-01' '2005-12-31'

This will put the earliest and latest points in both intervals into one column Now we can construct a VIEW like this:

CREATE VIEW Attendees (guest_name, celeb_name, entered, exited)

AS SELECT guest_name, celeb_name, MAX(entered), MIN(exited)

FROM Working

GROUP BY guest_name, celeb_name;

VIEW Attendees

Trang 3

guest_name celeb_name entered exited

=============================================================== 'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28' 'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-02-15' 'Dorothy Gale' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01' 'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01' 'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01' 'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28' 'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15' 'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07' 'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01' 'Don Quixote' 'St Fred's Day' '2005-02-24' '2005-02-24' 'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01' 'James T Kirk' 'Apple Month' '2005-02-01' '2005-02-28' 'James T Kirk' 'Garlic Festival' '2005-02-01' '2005-02-15' 'James T Kirk' 'St Fred's Day' '2005-02-24' '2005-02-24' 'James T Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28' 'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25' 'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25'

The Attendees VIEW can be used to compute the total number of room days for each celebration Assume that the difference between two dates will return an integer that is the number of days between them: SELECT celeb_name,

SUM(exited - entered + INTERVAL '1' DAY) AS roomdays FROM Attendees

GROUP BY celeb_name;

Result celeb_name roomdays ============================

'Apple Month' 85 'Christmas Season' 25 'Garlic Festival' 63 'National Pear Week' 7 'New Year's Day' 1 'St Fred's Day' 3 'Year of the Prune' 602

Trang 4

13.2 OVERLAPS Predicate 285

If you would like to get a count of the room days sold in the month of January, you could use this query, which avoids a BETWEEN or

OVERLAPS predicate completely:

SELECT SUM(CASE WHEN depart > DATE '2005-01-31'

THEN DATE '2005-01-31'

ELSE depart END

- CASE WHEN arrival_date < DATE '2005-01-01'

THEN DATE '2005-01-01'

ELSE arrival_date END + INTERVAL '1' DAY) AS room_days

FROM Guests

WHERE depart > DATE '2005-01-01' AND arrival_date <= DATE '2005-01-31';

Trang 6

C H A P T E R

14 The [NOT] IN() Predicate

THE IN() PREDICATE IS very natural It takes a value and sees whether that value is in a list of comparable values Standard SQL allows value expressions in the list, or for you to use a query to construct the list The syntax is:

<in predicate> ::=

<in predicate value> ::=

<table subquery> | (<in value list>)

<in value list> ::=

<row value expression> { <comma> <row value expression> }

The expression <row value constructor> NOT IN <in predicate value> has the same effect as NOT (<row value constructor> IN <in predicate value>) This pattern for the use of the keyword NOT is found in most of the other predicates The expression <row value constructor> IN <in predicate value> has, by definition, the same effect as <row value constructor> = ANY <in predicate value> Most optimizers will recognize this and execute the same code for both

Trang 7

288 CHAPTER 14: THE [NOT] IN() PREDICATE

expressions This means that if the <in predicate value> is empty, such as one you would get from a subquery that returns no rows, the results will be equivalent to (<row value constructor> = (NULL, ., NULL)), which is always evaluated to UNKNOWN Likewise, if the

<in predicate value> is an explicit list of NULLs, the results will be UNKNOWN However, please remember that there is a difference between

an empty table and a table with rows of all NULLs

IN() predicates with a subquery can sometimes be converted into EXISTS predicates, but there are some problems and differences in the predicates The conversion to an EXISTS predicate is often a good way

to improve performance, but it will not be as easy to read as the original IN() predicate An EXISTS predicate can use indexes to find (or fail to find) a single value that confirms (or denies) the predicate, whereas the IN() predicate often has to build the results of the subquery in a working table

14.1 Optimizing the IN() Predicate

Most database engines have no statistics about the relative frequency of the values in a list of constants, so they will scan them in the order in which they appear in the list People like to order lists alphabetically or

by magnitude, but it would be better to order the list from most frequently occurring values to least frequent It is also pointless to have duplicate values in the constant list, since the predicate will return TRUE

if it matches the first duplicate it finds, and never get to the second occurrence Likewise, if the predicate is FALSE for that value, it wastes computer time to traverse a needlessly long list

Many SQL engines perform an IN() predicate with a subquery by building the result set of the subquery first as a temporary working table, then scanning that result table from left to right This can be expensive in many cases; for example, in a query to find employees in a city with a major sport team (we want them to get tickets for us), we could write (assuming that city names are unique):

SELECT * FROM Personnel WHERE city_name

IN (SELECT city_name _name FROM SportTeams);

Trang 8

14.1 Optimizing the IN() Predicate 289

But let us further assume that our personnel are located in (n) cities and the sports teams are in (m) cities, where (m) is much greater than (n) If the matching cities appear near the front of the list generated by the subquery expression, it will perform much faster than if they appear

at the end of the list In the case of a subquery expression, you have no control over how the subquery is presented back in the containing query

However, you can order the expressions in a list in the order in which they are most likely to occur, such as:

SELECT *

FROM Personnel

WHERE city_name

IN ('New York', 'Chicago', 'Atlanta', , 'Austin');

Incidentally, Standard SQL allows row expression comparisons, so if you have a Standard SQL implementation with separate columns for the city and state, you could write:

SELECT *

FROM Personnel

WHERE (city_name , state)

IN (SELECT city_name , state

FROM SportTeams);

Teradata did not get correlated subqueries until 1996, so they often used this syntax as a workaround I am not sure if you should count them as being ahead or behind the technology for that

Today, all major versions of SQL remove duplicates in the result table

of the subquery, so you do not have to use a SELECT DISTINCT in the subquery You might see this in legacy code A trick that can work for large lists on some products is to force the engine to construct a list ordered by frequency This involves first constructing a VIEW that has an ORDER BY clause; this practice is not part of the SQL standard, which does not allow a VIEW to have an ORDER BY clause For example, a paint company wants to find all the products offered by their

competitors who use the same color as one of their products First construct a VIEW that orders the colors by frequency of appearance: CREATE VIEW PopColor (color, tally)

AS SELECT color, COUNT(*) AS tally

Trang 9

290 CHAPTER 14: THE [NOT] IN() PREDICATE

FROM Paints GROUP BY color ORDER BY tally DESC;

Then go to the Competitor data and do a simple column SELECT on the VIEW, thus:

SELECT * FROM Competitor WHERE color IN (SELECT color FROM PopColor);

The VIEW is grouped, so it will be materialized in sort order The subquery will then be executed and (we hope) the sort order will be maintained and passed along to the IN() predicate Another trick is to replace the IN() predicate with a JOIN operation For example, you have a table of restaurant telephone numbers and a guidebook, and you want to pick out the four-star places, so you write this query:

SELECT restaurant_name, phone_nbr FROM Restaurants

WHERE restaurant_name

IN (SELECT restaurant_name FROM QualityGuide WHERE stars = 4);

If there is an index on QualityGuide.stars, the SQL engine will probably build a temporary table of the four-star places and pass it on to the outer query The outer query will then handle it as if it were a list of constants

However, this is not the sort of column that you would normally index Without an index on stars, the engine will simply do a sequential search of the QualityGuide table This query can be replaced with a JOIN query, thus:

SELECT restaurant_name, phone_nbr FROM Restaurants, QualityGuide WHERE stars = 4

AND Restaurants.restaurant_name = QualityGuide.restaurant_name;

Trang 10

14.1 Optimizing the IN() Predicate 291

This query should run faster, since restaurant_name is a key for both tables and will be indexed to ensure uniqueness However, this can

return duplicate rows in the result table that you can handle with a

SELECT DISTINCT Consider a more budget-minded query, where we want places with a meal that costs less than $10, and the menu

guidebook lists all the meals The query looks about the same:

SELECT restaurant_name, phone_nbr

FROM Restaurants

IN (SELECT restaurant_name

FROM MenuGuide

WHERE price <= 10.00);

And you would expect to be able to replace it with:

SELECT restaurant_name, phone_nbr

FROM Restaurants, MenuGuide

WHERE price <= 10.00

AND Restaurants.restaurant_name = MenuGuide.restaurant_name;

Every item in Murphy’s Two-Dollar Hash House will get a line in the results of the JOINed version However, this can be fixed by changing SELECT restaurant_name, phone_nbr to SELECT DISTINCT restaurant_name, phone_nbr, but it will cost more time to do a sort to remove the duplicates There is no good general advice, except to experiment with your particular product

The NOT IN() predicate is probably better replaced with a NOT

EXISTS predicate Using the restaurant example again, our friend John has a list of eateries and we want to see those that are not in the

guidebook The natural formation of the query is:

SELECT *

FROM JohnsBook

NOT IN (SELECT restaurant_name

FROM QualityGuide);

But you can write the same query with a NOT EXISTS predicate and

it will probably run faster:

Định dạng
Số trang	10
Dung lượng	327,09 KB