612 CHAPTER 27: SUBSETS = SELECT MINkeycol + MAX keycol - MIN keycol * RANDOM FROM LotteryDrawing AS L2; Here is a version which uses the COUNT* functions and a self-join instead.. FROM
Trang 1612 CHAPTER 27: SUBSETS
= (SELECT MIN(keycol) + (MAX (keycol) - MIN (keycol) * RANDOM())) FROM LotteryDrawing AS L2);
Here is a version which uses the COUNT(*) functions and a self-join instead
SELECT L1.*
FROM LotteryDrawing AS L1 WHERE CEILING ((SELECT COUNT(*) FROM LotteryDrawing)
* RANDOM()) = (SELECT COUNT(*) FROM LotteryDrawing AS L2 WHERE L1.keycol <= L2.keycol);
The rounding away from zero is important, since we are in effect numbering the rows from one The idea is to use the decimal fraction to hit the row that is far into the table when the rows are ordered by the key
Having shown you this code, I have to warn you that the pure SQL has a good number of self-joins, and they will be expensive to run
27.3 The CONTAINS Operators
Set theory has two symbols for subsets One, , means that set A is contained within set B; this is sometimes said to denote a proper subset
The other, means “is contained in or equal to,” and is sometimes called just a subset or containment operator
Standard SQL has never had an operator to compare tables against each other for equality or containment Several college textbooks on relational databases mention a CONTAINS predicate, which does not exist in Standard SQL This predicate existed in the original System R, IBM’s first experimental SQL system, but it was dropped from later SQL implementations because of the expense of running it
27.3.1 Proper Subset Operators
The IN predicate is a test for membership For those of you who remember your high school set theory, membership is shown with a stylized epsilon with the containing set on the right side: a ∈ A
Membership is for one element, whereas a subset is itself a set, not just
an element As an example of a subset predicate, consider a query to tell
Trang 2you the names of each employee who works on all of the projects in department 5 Using the System R syntax:
SELECT name Not valid SQL!
FROM Personnel
WHERE (SELECT project_nbr
FROM WorksOn
WHERE Personnel.emp_nbr = WorksOn.emp_nbr)
CONTAINS
(SELECT project_nbr
FROM Projects
WHERE dept_nbr = 5);
In the second SELECT statement of the CONTAINS predicate, we build a table of all the projects in department 5 In the first SELECT
statement of the CONTAINS predicate, we have a correlated subquery that will build a table of all the projects each employee works on If the table of the employee’s projects is equal to or a superset of the
department 5 table, the predicate is TRUE
You must first decide what you are going to do about duplicate rows
in either or both tables That is, does the set { a, b, c } contain the multiset { a, b, b } or not? Some SQL set operations, such as SELECT and
I would argue that duplicates should be ignored, and that the multiset is a subset of the other For our example, let us use a table of employees and another table with the names of the company bowling team members, which should be a proper subset of the Personnel table For the bowling team to be contained in the set of employees, each bowler must be an employee; or, to put it another way, there must be no bowler who is not an employee
NOT EXISTS (SELECT *
FROM Bowling AS B1
WHERE B1.emp_nbr NOT IN (SELECT emp_nbr FROM Personnel))
27.3.2 Table Equality
How can I find out if two tables are equal to each other? This is a common programming problem, and the specification sounds obvious When two sets, A and B, are equal, then we know that:
Trang 31 Both have the same number of elements
2 No elements in A are not in B
3 No elements in B are not in A
4 Set A is equal to the intersection of A and B
5 Set B is equal to the intersection of A and B
6 Set B is a subset of A
7 Set A is a subset of B
as well as probably a few other things vaguely remembered from an old math class But equality is not as easy as it sounds in SQL, because the language is based on multisets or bags, which allow duplicate elements, and the language has NULLs Given this list of multisets, which pairs are equal to each other?
S0 = {a, b, c}
S1 = {a, b, NULL}
S2 = {a, b, b, c, c}
S3 = {a, b, NULL}
S4 = {a, b, c}
S5 = {x, y, z}
Everyone will agree that S0 = S4, because they are identical
Everyone will agree that S5 is not equal to any other set because it has
no elements in common with any of them How do you handle redundant duplicates? If you ignore them, then S0 = S2 Should NULLs
be given the benefit of the doubt and matched to any known value or not? If so, then S0 = S1 and S0 = S3 But then do you want to say that S1
= S3 because we can pair up the NULLs with each other?
To make matters even worse: are two rows equal if they match on just their keys, on a particular subset of their columns, or on all their columns? The reason this question comes up in practice is that you often have to match up data from two sources that have slightly different versions of the same information (i.e., “Joe F Celko” and “Joseph Frank Celko” are probably the same person)
The good part about matching things on the keys is that you do have
a true set—keys are unique and cannot have NULLs If you go back to the list of set equality tests that I gave at the start of this article, you can see some possible ways to code a solution
Trang 4If you use facts 2 and 3 in the list, then you might use NOT
WHERE NOT EXISTS (SELECT *
FROM A
WHERE A.keycol
NOT IN (SELECT keycol
FROM B
WHERE A.keycol = B.keycol)) AND NOT EXISTS (SELECT *
FROM B
WHERE B.keycol
NOT IN (SELECT keycol
FROM A
WHERE A.keycol = B.keycol))
This query can also be written as:
WHERE NOT EXISTS
(SELECT *
FROM A
EXCEPT [ALL]
SELECT *
FROM B
WHERE A.keycol = B.keycol)
UNION
SELECT *
FROM B
EXCEPT [ALL]
SELECT *
FROM A
WHERE A.keycol = B.keycol))
The use of the optional EXCEPT ALL operators will determine how duplicates are handled
However, if you look at 1, 4, and 5, you might come up with this answer:
WHERE (SELECT COUNT(*)FROM A)
Trang 5= (SELECT COUNT(*) FROM A INNER JOIN B
ON A.keycol = B.keycol) AND (SELECT COUNT(*)FROM B)
= (SELECT COUNT(*) FROM A INNER JOIN B
ON A.keycol = B.keycol)
This query will produce a list of the unmatched values; you might want to keep them in two columns instead of coalescing them as I have shown here
SELECT DISTINCT COALESCE(A.keycol, B.keycol) AS non_matched_key FROM A
FULL OUTER JOIN B
ON A.keycol = B.keycol WHERE A.keycol IS NULL
OR B.keycol IS NULL;
Eventually, you will be able to handle this with the INTERSECT
query to whatever definition of equality you wish to use
Unfortunately, these examples are for just comparing the keys What
do we do if we have tables without keys, or if we want to compare all the columns?
if they were equal to each other This is probably the definition of equality we would like to use
Remember that if one table has more columns or more rows than the other, we can stop right there, since they cannot possibly be equal under that definition We have to assume that the tables have the same number
of columns, of the same type, and in the same positions But row counts look useful Imagine that there are two children, each with a bag of candy To determine that both bags are identical, the first children can start by pulling a piece of candy out and asking the other, “How many red ones do you have?” If the two counts disagree, we know that the bags are different Now ask about the green pieces We do not have to match each particular piece of candy in one bag with a particular piece of candy
in the other bag The counts are enough information, only if they differ
If the counts are the same, more work needs to be done We could each
Trang 6have one brown piece of candy, but mine could be an M&M, and yours could be a malted milk ball
Now, generalize that idea Let’s combine the two tables into one big table, with an extra column, x0, to show from where each row originally came
Now form groups based on all the original columns Within each group, count the number of rows from one table and the number of rows from the second table If the counts are different, there are
unmatched rows
This will handle redundant duplicate rows within one table This query does not require that the tables have keys The assumption in a
Here is the final query
SELECT x1, x2, , xn,
COUNT(CASE WHEN x0 = 'A'
THEN 1 ELSE 0 END) AS a_tally,
COUNT(CASE WHEN x0 = 'B'
THEN 1 ELSE 0 END) AS b_tally
FROM (SELECT 'A', A.* FROM A
UNION ALL
SELECT 'B', B.* FROM B) AS X (x0, x1, x2, , xn) GROUP BY x1, x2, x3, x4, xn
HAVING COUNT(CASE WHEN x0 = 'A' THEN 1 ELSE 0 END)
<> COUNT(CASE WHEN x0 = 'B' THEN 1 ELSE 0 END);
You might want to think about the differences that changing the expression for the derived table X can make If you use a UNION instead
of a UNION ALL, then the row count for each group in both tables will
be one If you use a SELECT DISTINCT instead of a SELECT, then the row count in just that table will be one for each group
Subset Equality
A surprisingly usable version of set equality is finding identical subsets within the same table These identical subsets can build partitions that are known as equivalence classes in set theory Let’s use Chris Date’s suppliers-and-parts table to find pairs of suppliers who provide exactly the same parts—that is, the set of parts from one supplier is equal to the set of parts from the other supplier
Trang 7CREATE TABLE SupParts (sup_nbr CHAR(2) NOT NULL, part_nbr CHAR(2) NOT NULL, PRIMARY KEY (sup_nbr, part_nbr));
The usual way of proving that two sets are equal is to show that set A contains set B and set B contains set A
Any of the methods given above can be modified to handle two copies of the same table under aliases Instead, consider another approach First JOIN one supplier to another on their common parts, eliminating the situation where the first supplier is also the second supplier, so that you have the intersection of the two subsets If the intersection has the same number of pairs as each of the two subsets has elements, the two subsets are equal
SELECT SP1.sup_nbr, SP2.sup_nbr, COUNT(*) AS part_count FROM SupParts AS SP1
INNER JOIN SupParts AS SP2
ON SP1.part_nbr = SP2.part_nbr AND SP1.sup_nbr < SP2.sup_nbr GROUP BY SP1.sup_nbr, SP2.sup_nbr HAVING COUNT(*) = (SELECT COUNT(*) FROM SupParts AS SP3 WHERE SP3.sup_nbr = SP1.sup_nbr) AND COUNT(*) = (SELECT COUNT(*)
FROM SupParts AS SP4 WHERE SP4.sup_nbr = SP2.sup_nbr);
If there is an index on the supplier number in the SupParts table, it can provide the counts directly, as well as helping with the JOIN
operation The only problem with this answer is that it is hard to see the groups of suppliers among the pairs The part_count column helps a bit, but it does not assign a grouping identifier to the rows
27.4 Picking a Representative Subset
This problem and solution for it are due to Ross Presser The problem is
to find a subset of rows such that each value in each of two columns appears in at least one row The purpose is to produce a set of samples from a large table The table has a club_name column and an ifc column;
Trang 8I want a set of samples that contains at least one of each club_name and
at least one of each ifc, but no more than necessary
CREATE TABLE Memberships
(member_id INTEGER NOT NULL PRIMARY KEY,
club_name CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);
CREATE TABLE Samples
(member_id INTEGER NOT NULL PRIMARY KEY,
club_name CHAR(7) NOT NULL,
ifc CHAR(4) NOT NULL);
INSERT INTO Memberships
VALUES (6401715, 'aarprat', 'ic17'),
(1058337, 'aarprat', 'ic17'),
(459443, 'aarpprt', 'ic25'),
(4018210, 'aarpbas', 'ig21'),
(2430656, 'aarpbas', 'ig21'),
(6802081, 'aarpprd', 'ig29'),
(4236511, 'aarpprd', 'ig29'),
(2162104, 'aarpbas', 'ig21'),
(2073679, 'aarpprd', 'ig29'),
(8148891, 'aarpbas', 'ig21'),
(1868445, 'aarpbas', 'ig21'),
(6749213, 'aarpbas', 'ig21'),
(8363621, 'aarppup', 'ig29'),
(9999, 'aarppup', 'ic17'); redundant
To help frame the problem better, consider this subset, which has a row with both a redundant club_name value and ifc value
Non-Minimal subset
member_id club_name ifc
=========================
9999 aarppup ic17 <== redundant row
1058337 aarprat ic17 <== ifc
459443 aarpprt ic25
1868445 aarpbas ig21
2073679 aarpprd ig29
8363621 aarppup ig29 <== club_name
Trang 9There can be more than one minimal solution But we would be happy to simply find a near-minimal solution
David Portas came up with a query that gives a near-minimal solution This will produce a sample containing at least one row of each
value in the two columns It isn’t guaranteed to give the minimum subset,
but it should contain at most (c + i − 1) rows, where (c) is the number of distinct clubs and (i) the number of distinct ifcs
SELECT member_id, club_name, ifc FROM Memberships AS M
WHERE member_id IN
(SELECT MIN(member_id) FROM Memberships GROUP BY club_name UNION ALL
SELECT MIN(member_id) FROM Memberships AS M2 GROUP BY ifc
HAVING NOT EXISTS (SELECT * FROM Memberships WHERE member_id
IN (SELECT MIN(member_id) FROM Memberships GROUP BY club_name) AND ifc = M2.ifc));
I am not sure it’s possible to find the minimum subset every time, unless you use an iterative solution The results are very dependent on the exact data involved
Ross Presser’s iterative solution used the six-step system below, and found that the number of rows resulting depended on both the order of the insert queries and on whether we used MAX() or MIN() That said, the resulting row count only varied from 403 to 410 rows on a real run
of 52,776 invoices for a set where (c = 325) and (i = 117) Portas’s query gave a result of 405 rows, which is worse but not fatally worse
first step: unique clubs
INSERT INTO Samples (member_id, club_name, ifc)
SELECT MIN(Randommid), club_name, MIN(ifc)
FROM Memberships
Trang 10GROUP BY club_name
HAVING COUNT(*) = 1;
second step: unique ifcs where club_name not already there
INSERT INTO Samples (member_id, club_name, ifc)
SELECT MIN(Memberships.Member_id), MIN(Memberships.club_name),
Memberships.ifc
FROM Memberships
GROUP BY Memberships.ifc
HAVING MIN(Memberships.club_name)
NOT IN (SELECT club_name FROM Samples)
AND COUNT(*) = 1;
intermezzo: views for missing ifcs, missing clubs
CREATE VIEW MissingClubs (club_name)
AS
SELECT Memberships.club_name
FROM Memberships
LEFT OUTER JOIN
Samples
ON Memberships.club_name = Samples.club_name
WHERE Samples.club_name IS NULL
GROUP BY Memberships.club_name;
CREATE VIEW MissingIfcs (ifc)
AS
SELECT Memberships.ifc
FROM Memberships
LEFT OUTER JOIN
Samples
ON Memberships.ifc = Samples.ifc
WHERE Samples.ifc IS NULL
GROUP BY Memberships.ifc;
third step: distinct missing clubs that are also missing ifcs INSERT INTO Samples (member_id, club_name, ifc)
SELECT MIN(Memberships.Member_id), Memberships.club_name,
MIN(Memberships.ifc)
FROM Memberships, MissingClubs, MissingIfcs
WHERE Memberships.club_name = MissingClubs.club_name
AND Memberships.ifc = MissingIfcs.ifc
GROUP BY Memberships.club_name;