302 CHAPTER 15: EXISTS PREDICATE SELECT P1.emp_name, ' was born on a day without a famous New Yorker!' FROM Personnel AS P1 WHERE P1.birthday NOT IN SELECT C1.birthday FROM Celebrities
Trang 1302 CHAPTER 15: EXISTS() PREDICATE
SELECT P1.emp_name, ' was born on a day without a famous New Yorker!'
FROM Personnel AS P1 WHERE P1.birthday NOT IN (SELECT C1.birthday FROM Celebrities AS C1 WHERE C1.birth_city = 'New York');
and you would think that the EXISTS version would be:
SELECT P1.emp_name, ' was born on a day without a famous New Yorker!'
FROM Personnel AS P1 WHERE NOT EXISTS (SELECT * FROM Celebrities AS C1 WHERE C1.birth_city = 'New York' AND C1.birthday = P1.birthday);
Assume that Gloria Glamour is our only New Yorker and we still do not know her birthday The subquery will be empty for every employee
in the NOT EXISTS predicate version, because her NULL birthday will not test equal to the known employee birthdays
That means that the NOT EXISTS predicate will return TRUE and we will get every employee to match to Ms Glamour But now look at the
IN predicate version, which will have a single NULL in the subquery result This predicate will be equivalent to (Personnel.birthday = NULL), which is always UNKNOWN, and we will get no employees back
Likewise, you cannot, in general, transform the quantified comparison predicates into EXISTS predicates, because of the possibility of NULL values Remember that x <> ALL <subquery> is shorthand for x NOT IN <subquery>, and x = ANY <subquery> is shorthand for x IN <subquery>, and it will not surprise you
In general, the EXISTS predicates will run faster than the IN predicates The problem is in deciding whether to build the query or the subquery first; the optimal approach depends on the size and distribution
of values in each, and that cannot usually be known until runtime
15.2 EXISTS and INNER JOINs
The [NOT] EXISTS predicate is almost always used with a correlated subquery Very often the subquery can be “flattened” into a JOIN, which
Trang 215.3 NOT EXISTS and OUTER JOINs 303
will frequently run faster than the original query Our sample query can
be converted into:
SELECT P1.emp_name, ' has the same birthday as a famous person!'
FROM Personnel AS P1, Celebrities AS C1
WHERE P1.birthday = C1.birthday;
The advantage of the JOIN version is that it allows us to show
columns from both tables We should make the query more informative
by rewriting it:
SELECT P1.emp_name, ' has the same birthday as ', C1.emp_name
FROM Personnel AS P1, Celebrities AS C1
WHERE P1.birthday = C1.birthday;
This new query could be written with an EXISTS() predicate, but
that is a waste of resources
SELECT P1.emp_name, ' has the same birthday as ', C1.emp_name
FROM Personnel AS P1, Celebrities AS C1
WHERE EXISTS
(SELECT *
FROM Celebrities AS C2
WHERE P1.birthday = C2.birthday
AND C1.emp_name = C2.emp_name);
15.3 NOT EXISTS and OUTER JOINs
The NOT EXISTS version of this predicate is almost always used with a
correlated subquery Very often the subquery can be “flattened” into an
OUTER JOIN, which will frequently run faster than the original query
Our other sample query was:
SELECT P1.emp_name, ' was born on a day without a famous New
Yorker!'
FROM Personnel AS P1
WHERE NOT EXISTS
(SELECT *
FROM Celebrities AS C1
WHERE C1.birth_city = 'New York'
AND C1.birthday = P1.birthday);
Trang 3304 CHAPTER 15: EXISTS() PREDICATE
Which we can replace with:
SELECT P1.emp_name, ' was born on a day without a famous New Yorker!'
FROM Personnel AS P1 LEFT OUTER JOIN Celebrities AS C1
ON C1.birth_city = 'New York' AND C1.birthday = E2.birthday WHERE C1.emp_name IS NULL;
This is assuming that we know each and every celebrity name in the Celebrities table If the column in the WHERE clause could have NULLs in its base table, then we could not prune out the generated NULLs The test for NULL should always be on (a column of) the primary key, which cannot be NULL Relating this back to the example, how could a celebrity be a celebrity with an unknown name? Even The Unknown Comic had a name (“The Unknown Comic”)
15.4 EXISTS() and Quantifiers
Formal logic makes use of quantifiers that can be applied to
propositions The two forms are “For allx, P(x)” and “For somex, P(x)”
The first is written as {{inverted uppercase A }} and the second is written
as {{reversed uppercase E}}, if you want to look up formulas in a textbook The quantifiers put into symbols such statements as “all men are mortal” or “some Cretans are liars” so they can be manipulated The big question more than 100 years ago was that of existential import in formal logic Everyone agreed that saying “all men are mortal” implies that “no men are not mortal,” but does it also imply that “some men are mortal”—that we have to have at least one man who is mortal? Existential import lost the battle and the modern convention is that
“All men are mortal” has the same meaning as “There are no men who are immortal,” but does not imply that any men exist at all This is the convention followed in the design of SQL Consider the statement “some salesmen are liars” and the way we would write it with the EXISTS() predicate in SQL:
EXISTS(SELECT *
Trang 4FROM Personnel AS P1, Liars AS L1
WHERE P1.job = 'Salesman'
AND P1.emp_name = L1.emp_name);
If we are more cynical about salesmen, we might want to formulate the predicate “all salesmen are liars” with the EXISTS predicate in SQL, using the transform rule just discussed:
NOT EXISTS(SELECT *
FROM Personnel AS P1
WHERE P1.job = 'Salesman'
AND P1.emp_name
NOT IN
(SELECT L1.emp_name
FROM Liars AS L1));
That says, informally, “there are no salesmen who are not liars” in English In this case, the IN predicate can be changed into JOIN, which should improve performance and be a bit easier to read
15.5 EXISTS() and Referential Constraints
Standard SQL was designed so that the declarative referential constraints could be expressed as EXISTS() predicates in a CHECK() clause For example:
CREATE TABLE Addresses
(addressee_name CHAR(25) NOT NULL PRIMARY KEY,
street_loc CHAR(25) NOT NULL,
city_name CHAR(20) NOT NULL,
state_code CHAR(2) NOT NULL
REFERENCES ZipCodeData(state_code),
);
could be written as:
CREATE TABLE Addresses
(addressee_name CHAR(25) NOT NULL PRIMARY KEY,
street_loc CHAR(25) NOT NULL,
Trang 5306 CHAPTER 15: EXISTS() PREDICATE
city_name CHAR(20) NOT NULL, state_code CHAR(2) NOT NULL, CONSTRAINT valid_state_code CHECK (EXISTS(SELECT * FROM ZipCodeData AS Z1 WHERE Z1.state_code = Addresses.state_code)), .);
There is no advantage to this expression for the DBA, since you cannot attach referential actions with the CHECK() constraint However,
an SQL database can use the same mechanisms in the SQL compiler for both constructions
15.6 EXISTS and Three-Valued Logic
This example is due to an article by Lee Fesperman at FirstSQL Using Chris Date’s “SupplierParts” table with three rows:
CREATE TABLE SupplierPart (sup_nbr CHAR(2) NOT NULL PRIMARY KEY, part_nbr CHAR(2) NOT NULL,
qty INTEGER CHECK (qty > 0));
sup_nbr part_nbr qty
======================
'S1' 'P1' NULL 'S2' 'P1' 200 'S3' 'P1' 1000
The row (‘S1’, ‘P1’, NULL) means that supplier ‘S1’ supplies part ‘P1’ but we do not know what quantity he has
The query we wish to answer is “Find suppliers of part ‘P1’, but not in
a quantity of 1000 on hand.” The correct answer is ‘S2’ All suppliers in the table supply ‘P1’, but we do know ‘S3’ supplies the part in quantity
1000 and we do not know in what quantity ‘S1’ supplies the part The only supplier we eliminate for certain is ‘S2’
An SQL query to retrieve this result would be:
SELECT spx.sup_nbr FROM SupplierParts AS spx WHERE px.part_nbr = 'P1'
Trang 6AND 1000
NOT IN (SELECT spy.qty
FROM SupplierParts AS spy
WHERE spy.sup_nbr = spx.sup_nbr
AND spy.part_nbr = 'P1');
According to Standard SQL, this query should return only ‘S2’, but when we transform the query into an equivalent version, using EXISTS instead, we obtain:
SELECT spx.sup_nbr
FROM SupplierParts AS spx
WHERE spx.part_nbr = 'P1'
AND NOT EXISTS
(SELECT *
FROM SupplierParts AS spy
WHERE spy.sup_nbr = spx.sup_nbr
AND spy.part_nbr = 'P1'
AND spy.qty = 1000);
Which will return (‘S1’, ‘S2’) You can argue that this is the wrong answer because we do not definitely know whether or not ‘S1’ supplies
‘P1’ in quantity 1000 The EXISTS() predicate will return TRUE or FALSE, even in situations where a subquery’s predicate returns an UNKNOWN (i.e., NULL = 1000)
The solution is to modify the predicate that deals with the quantity in the subquery to explicitly say that you do or not want to give the “benefit
of the doubt” to the NULL You have several alternatives:
1 (spy.qty = 1000) IS NOT FALSE
This uses the new predicates in Standard SQL for testing logical values Frankly, this is confusing to read and worse to maintain
2 (spy.qty = 1000 OR spy.qty IS NULL)
This uses another test predicate, but the optimizer can probably use any index on the qty column
Trang 7308 CHAPTER 15: EXISTS() PREDICATE
3 (COALESCE(spy.qty, 1000) = 1000)
This is portable and easy to maintain The only disadvantage is that some SQL products might not be able to use an index on the qty column, because it is in an expression
The real problem is that the query was formed with a double negative in the form of a NOT EXISTS and an implicit IS NOT FALSE condition The problem stems from the fact that the EXISTS() predicate is one of the few two-value predicates in SQL, and that (NOT (NOT UNKNOWN)) = UNKNOWN
For another approach based on Dr Codd’s second relational model, visit www.FirstSQL.com and read some of the white papers by Lee Fesperman He used the two NULLs Codd proposed to develop a product
Trang 8C H A P T E R
16
Quantified Subquery Predicates
A QUANTIFIER IS A logical operator that states the quantity of objects for which a statement is TRUE This is a logical quantity, not a numeric quantity; it relates a statement to the whole set of possible objects In everyday life, you see statements like “There is only one mouthwash that stops dinosaur breath,” “All doctors drive Mercedes,” or “Some people got rich investing in cattle futures,” which are quantified The first statement, about the mouthwash, is a uniqueness quantifier If there were two or more products that could save us from dinosaur breath, the statement would be FALSE The second statement has what is called a universal quantifier, since it deals with all
doctors—find one exception and the statement is FALSE The last statement has an existential quantifier, since it asserts that one or more people exist who got rich on cattle futures—find one example and the statement is TRUE
SQL has forms of these quantifiers that are not quite like those in formal logic They are based on extending the use of comparison predicates to allow result sets to be quantified, and they use SQL’s three-valued logic, so they do not return just TRUE or FALSE
Trang 9310 CHAPTER 16: QUANTIFIED SUBQUERY PREDICATES
16.1 Scalar Subquery Comparisons
Standard SQL allows both scalar and row comparisons, but most queries use only scalar expressions If a subquery returns a row, single-column result table, it is treated as a scalar value in Standard SQL in virtually any place a scalar could appear For example, to find out if we have any teachers who are more than one year older than the students, I could write:
SELECT T1.teacher_name FROM Teachers AS T1 WHERE
T1.birthday > (SELECT MAX(S1.birthday) - INTERVAL '365' DAY FROM Students AS S1);
In this case, the scalar subquery will be run only once and reduced to
a constant value by the optimizer before scanning the Teachers table
A correlated subquery is more complex, because it will have to be executed for each value from the containing query For example, to find which suppliers have sent us fewer than 100 parts, we would use this query Notice how the SUM(quantity) has to be computed for each supplier number, sup_nbr
SELECT sup_nbr, sup_name FROM Suppliers
WHERE 100 > (SELECT SUM(quantity) FROM Shipments WHERE Shipments.sup_nbr = Suppliers.sup_nbr);
If a scalar subquery returns a NULL, we have rules for handling comparison with NULLs But what if it returns an empty result—a supplier that has not shipped us anything? In Standard SQL, the empty result table is converted to a NULL of the appropriate data type
In Standard SQL, you can place scalar or row subqueries on either side of a comparison predicate as long as they return comparable results But you must be aware of the rules for row comparisons For example, the following query will find the product manager who has more of his product at the stores than in the warehouse:
SELECT manager_name, product_nbr FROM Stores AS S1
Trang 1016.2 Quantifiers and Missing Data 311
WHERE (SELECT SUM(qty)
FROM Warehouses AS W1
WHERE S1.product_nbr = W1.product_nbr)
< (SELECT SUM(qty)
FROM RetailStores AS R1
WHERE S1.product_nbr = R1.product_nbr);
Here is a programming tip: the main problem with writing these queries is getting a result with more than one row in it You can
guarantee uniqueness in several ways An aggregate function on an ungrouped table will always be a single value A JOIN with the
containing query based on a key will always be a single value
16.2 Quantifiers and Missing Data
The quantified predicates are used with subquery expressions to
compare a single value to those of the subquery, and take the general form <value expression> <comp op> <quantifier>
<subquery> The predicate "<value expression> <comp op> [ANY|SOME] <table expression>" is equivalent to taking each row, s, (assume that they are numbered from 1 to n) of <table expression> and testing "<value expression> <comp op> s" with ORs between the expanded expressions:
((<value expression> <comp op> s1)
OR (<value expression> <comp op> s2)
OR (<value expression> <comp op> sn))
When you get a single TRUE result, the whole predicate is TRUE
As long as <table expression> has cardinality greater than zero and one non-NULL value, you will get a result of TRUE or FALSE The keyword SOME is the same as ANY, and the choice is just a matter of style and readability Likewise, "<value expression> <comp op> ALL
<table expression>" takes each row, s, of <table expression> and tests <value expression> <comp op> s with ANDs between the expanded expressions:
((<value expression> <comp op> s1)
AND (<value expression> <comp op> s2)
AND (<value expression> <comp op> sn))