Joe Celko s SQL for Smarties - Advanced SQL Programming P65 ppt

612 CHAPTER 27: SUBSETS = SELECT MINkeycol + MAX keycol - MIN keycol * RANDOM FROM LotteryDrawing AS L2; Here is a version which uses the COUNT* functions and a self-join instead.. FROM

Trang 1

612 CHAPTER 27: SUBSETS

= (SELECT MIN(keycol) + (MAX (keycol) - MIN (keycol) * RANDOM())) FROM LotteryDrawing AS L2);

Here is a version which uses the COUNT(*) functions and a self-join instead

SELECT L1.*

FROM LotteryDrawing AS L1 WHERE CEILING ((SELECT COUNT(*) FROM LotteryDrawing)

* RANDOM()) = (SELECT COUNT(*) FROM LotteryDrawing AS L2 WHERE L1.keycol <= L2.keycol);

The rounding away from zero is important, since we are in effect numbering the rows from one The idea is to use the decimal fraction to hit the row that is far into the table when the rows are ordered by the key

Having shown you this code, I have to warn you that the pure SQL has a good number of self-joins, and they will be expensive to run

27.3 The CONTAINS Operators

Set theory has two symbols for subsets One, , means that set A is contained within set B; this is sometimes said to denote a proper subset

The other, means “is contained in or equal to,” and is sometimes called just a subset or containment operator

Standard SQL has never had an operator to compare tables against each other for equality or containment Several college textbooks on relational databases mention a CONTAINS predicate, which does not exist in Standard SQL This predicate existed in the original System R, IBM’s first experimental SQL system, but it was dropped from later SQL implementations because of the expense of running it

27.3.1 Proper Subset Operators

The IN predicate is a test for membership For those of you who remember your high school set theory, membership is shown with a stylized epsilon with the containing set on the right side: a ∈ A

Membership is for one element, whereas a subset is itself a set, not just

an element As an example of a subset predicate, consider a query to tell

Trang 2

you the names of each employee who works on all of the projects in department 5 Using the System R syntax:

SELECT name Not valid SQL!

FROM Personnel

WHERE (SELECT project_nbr

FROM WorksOn

WHERE Personnel.emp_nbr = WorksOn.emp_nbr)

CONTAINS

(SELECT project_nbr

FROM Projects

WHERE dept_nbr = 5);

In the second SELECT statement of the CONTAINS predicate, we build a table of all the projects in department 5 In the first SELECT

statement of the CONTAINS predicate, we have a correlated subquery that will build a table of all the projects each employee works on If the table of the employee’s projects is equal to or a superset of the

department 5 table, the predicate is TRUE

You must first decide what you are going to do about duplicate rows

in either or both tables That is, does the set { a, b, c } contain the multiset { a, b, b } or not? Some SQL set operations, such as SELECT and

I would argue that duplicates should be ignored, and that the multiset is a subset of the other For our example, let us use a table of employees and another table with the names of the company bowling team members, which should be a proper subset of the Personnel table For the bowling team to be contained in the set of employees, each bowler must be an employee; or, to put it another way, there must be no bowler who is not an employee

NOT EXISTS (SELECT *

FROM Bowling AS B1

WHERE B1.emp_nbr NOT IN (SELECT emp_nbr FROM Personnel))

27.3.2 Table Equality

How can I find out if two tables are equal to each other? This is a common programming problem, and the specification sounds obvious When two sets, A and B, are equal, then we know that:

Trang 3

1 Both have the same number of elements

2 No elements in A are not in B

3 No elements in B are not in A

4 Set A is equal to the intersection of A and B

5 Set B is equal to the intersection of A and B

6 Set B is a subset of A

7 Set A is a subset of B

as well as probably a few other things vaguely remembered from an old math class But equality is not as easy as it sounds in SQL, because the language is based on multisets or bags, which allow duplicate elements, and the language has NULLs Given this list of multisets, which pairs are equal to each other?

S0 = {a, b, c}

S1 = {a, b, NULL}

S2 = {a, b, b, c, c}

S3 = {a, b, NULL}

S4 = {a, b, c}

S5 = {x, y, z}

Everyone will agree that S0 = S4, because they are identical

Everyone will agree that S5 is not equal to any other set because it has

no elements in common with any of them How do you handle redundant duplicates? If you ignore them, then S0 = S2 Should NULLs

be given the benefit of the doubt and matched to any known value or not? If so, then S0 = S1 and S0 = S3 But then do you want to say that S1

= S3 because we can pair up the NULLs with each other?

To make matters even worse: are two rows equal if they match on just their keys, on a particular subset of their columns, or on all their columns? The reason this question comes up in practice is that you often have to match up data from two sources that have slightly different versions of the same information (i.e., “Joe F Celko” and “Joseph Frank Celko” are probably the same person)

The good part about matching things on the keys is that you do have

a true set—keys are unique and cannot have NULLs If you go back to the list of set equality tests that I gave at the start of this article, you can see some possible ways to code a solution

Trang 4

If you use facts 2 and 3 in the list, then you might use NOT

WHERE NOT EXISTS (SELECT *

FROM A

WHERE A.keycol

NOT IN (SELECT keycol

FROM B

WHERE A.keycol = B.keycol)) AND NOT EXISTS (SELECT *

FROM B

WHERE B.keycol

NOT IN (SELECT keycol

FROM A

WHERE A.keycol = B.keycol))

This query can also be written as:

WHERE NOT EXISTS

(SELECT *

FROM A

EXCEPT [ALL]

SELECT *

FROM B

WHERE A.keycol = B.keycol)

UNION

SELECT *

FROM B

EXCEPT [ALL]

SELECT *

FROM A

WHERE A.keycol = B.keycol))

The use of the optional EXCEPT ALL operators will determine how duplicates are handled

However, if you look at 1, 4, and 5, you might come up with this answer:

WHERE (SELECT COUNT(*)FROM A)

Trang 5

= (SELECT COUNT(*) FROM A INNER JOIN B

ON A.keycol = B.keycol) AND (SELECT COUNT(*)FROM B)

= (SELECT COUNT(*) FROM A INNER JOIN B

ON A.keycol = B.keycol)

This query will produce a list of the unmatched values; you might want to keep them in two columns instead of coalescing them as I have shown here

SELECT DISTINCT COALESCE(A.keycol, B.keycol) AS non_matched_key FROM A

FULL OUTER JOIN B

ON A.keycol = B.keycol WHERE A.keycol IS NULL

OR B.keycol IS NULL;

Eventually, you will be able to handle this with the INTERSECT

query to whatever definition of equality you wish to use

Unfortunately, these examples are for just comparing the keys What

do we do if we have tables without keys, or if we want to compare all the columns?

if they were equal to each other This is probably the definition of equality we would like to use

Remember that if one table has more columns or more rows than the other, we can stop right there, since they cannot possibly be equal under that definition We have to assume that the tables have the same number

of columns, of the same type, and in the same positions But row counts look useful Imagine that there are two children, each with a bag of candy To determine that both bags are identical, the first children can start by pulling a piece of candy out and asking the other, “How many red ones do you have?” If the two counts disagree, we know that the bags are different Now ask about the green pieces We do not have to match each particular piece of candy in one bag with a particular piece of candy

in the other bag The counts are enough information, only if they differ

If the counts are the same, more work needs to be done We could each

Trang 6

have one brown piece of candy, but mine could be an M&M, and yours could be a malted milk ball

Now, generalize that idea Let’s combine the two tables into one big table, with an extra column, x0, to show from where each row originally came

Now form groups based on all the original columns Within each group, count the number of rows from one table and the number of rows from the second table If the counts are different, there are

unmatched rows

This will handle redundant duplicate rows within one table This query does not require that the tables have keys The assumption in a

Here is the final query

SELECT x1, x2, , xn,

COUNT(CASE WHEN x0 = 'A'

THEN 1 ELSE 0 END) AS a_tally,

COUNT(CASE WHEN x0 = 'B'

THEN 1 ELSE 0 END) AS b_tally

FROM (SELECT 'A', A.* FROM A

UNION ALL

SELECT 'B', B.* FROM B) AS X (x0, x1, x2, , xn) GROUP BY x1, x2, x3, x4, xn

HAVING COUNT(CASE WHEN x0 = 'A' THEN 1 ELSE 0 END)

<> COUNT(CASE WHEN x0 = 'B' THEN 1 ELSE 0 END);

You might want to think about the differences that changing the expression for the derived table X can make If you use a UNION instead

of a UNION ALL, then the row count for each group in both tables will

be one If you use a SELECT DISTINCT instead of a SELECT, then the row count in just that table will be one for each group

Subset Equality

A surprisingly usable version of set equality is finding identical subsets within the same table These identical subsets can build partitions that are known as equivalence classes in set theory Let’s use Chris Date’s suppliers-and-parts table to find pairs of suppliers who provide exactly the same parts—that is, the set of parts from one supplier is equal to the set of parts from the other supplier

Trang 7

CREATE TABLE SupParts (sup_nbr CHAR(2) NOT NULL, part_nbr CHAR(2) NOT NULL, PRIMARY KEY (sup_nbr, part_nbr));

The usual way of proving that two sets are equal is to show that set A contains set B and set B contains set A

Any of the methods given above can be modified to handle two copies of the same table under aliases Instead, consider another approach First JOIN one supplier to another on their common parts, eliminating the situation where the first supplier is also the second supplier, so that you have the intersection of the two subsets If the intersection has the same number of pairs as each of the two subsets has elements, the two subsets are equal

SELECT SP1.sup_nbr, SP2.sup_nbr, COUNT(*) AS part_count FROM SupParts AS SP1

INNER JOIN SupParts AS SP2

ON SP1.part_nbr = SP2.part_nbr AND SP1.sup_nbr < SP2.sup_nbr GROUP BY SP1.sup_nbr, SP2.sup_nbr HAVING COUNT(*) = (SELECT COUNT(*) FROM SupParts AS SP3 WHERE SP3.sup_nbr = SP1.sup_nbr) AND COUNT(*) = (SELECT COUNT(*)

FROM SupParts AS SP4 WHERE SP4.sup_nbr = SP2.sup_nbr);

If there is an index on the supplier number in the SupParts table, it can provide the counts directly, as well as helping with the JOIN

operation The only problem with this answer is that it is hard to see the groups of suppliers among the pairs The part_count column helps a bit, but it does not assign a grouping identifier to the rows

27.4 Picking a Representative Subset

This problem and solution for it are due to Ross Presser The problem is

to find a subset of rows such that each value in each of two columns appears in at least one row The purpose is to produce a set of samples from a large table The table has a club_name column and an ifc column;

Trang 8

I want a set of samples that contains at least one of each club_name and

at least one of each ifc, but no more than necessary

CREATE TABLE Memberships

(member_id INTEGER NOT NULL PRIMARY KEY,

club_name CHAR(7) NOT NULL,

ifc CHAR(4) NOT NULL);

CREATE TABLE Samples

(member_id INTEGER NOT NULL PRIMARY KEY,

club_name CHAR(7) NOT NULL,

ifc CHAR(4) NOT NULL);

INSERT INTO Memberships

VALUES (6401715, 'aarprat', 'ic17'),

(1058337, 'aarprat', 'ic17'),

(459443, 'aarpprt', 'ic25'),

(4018210, 'aarpbas', 'ig21'),

(2430656, 'aarpbas', 'ig21'),

(6802081, 'aarpprd', 'ig29'),

(4236511, 'aarpprd', 'ig29'),

(2162104, 'aarpbas', 'ig21'),

(2073679, 'aarpprd', 'ig29'),

(8148891, 'aarpbas', 'ig21'),

(1868445, 'aarpbas', 'ig21'),

(6749213, 'aarpbas', 'ig21'),

(8363621, 'aarppup', 'ig29'),

(9999, 'aarppup', 'ic17'); redundant

To help frame the problem better, consider this subset, which has a row with both a redundant club_name value and ifc value

Non-Minimal subset

member_id club_name ifc

=========================

9999 aarppup ic17 <== redundant row

1058337 aarprat ic17 <== ifc

459443 aarpprt ic25

1868445 aarpbas ig21

2073679 aarpprd ig29

8363621 aarppup ig29 <== club_name

Trang 9

There can be more than one minimal solution But we would be happy to simply find a near-minimal solution

David Portas came up with a query that gives a near-minimal solution This will produce a sample containing at least one row of each

value in the two columns It isn’t guaranteed to give the minimum subset,

but it should contain at most (c + i − 1) rows, where (c) is the number of distinct clubs and (i) the number of distinct ifcs

SELECT member_id, club_name, ifc FROM Memberships AS M

WHERE member_id IN

(SELECT MIN(member_id) FROM Memberships GROUP BY club_name UNION ALL

SELECT MIN(member_id) FROM Memberships AS M2 GROUP BY ifc

HAVING NOT EXISTS (SELECT * FROM Memberships WHERE member_id

IN (SELECT MIN(member_id) FROM Memberships GROUP BY club_name) AND ifc = M2.ifc));

I am not sure it’s possible to find the minimum subset every time, unless you use an iterative solution The results are very dependent on the exact data involved

Ross Presser’s iterative solution used the six-step system below, and found that the number of rows resulting depended on both the order of the insert queries and on whether we used MAX() or MIN() That said, the resulting row count only varied from 403 to 410 rows on a real run

of 52,776 invoices for a set where (c = 325) and (i = 117) Portas’s query gave a result of 405 rows, which is worse but not fatally worse

first step: unique clubs

INSERT INTO Samples (member_id, club_name, ifc)

SELECT MIN(Randommid), club_name, MIN(ifc)

FROM Memberships

Trang 10

GROUP BY club_name

HAVING COUNT(*) = 1;

second step: unique ifcs where club_name not already there

INSERT INTO Samples (member_id, club_name, ifc)

SELECT MIN(Memberships.Member_id), MIN(Memberships.club_name),

Memberships.ifc

FROM Memberships

GROUP BY Memberships.ifc

HAVING MIN(Memberships.club_name)

NOT IN (SELECT club_name FROM Samples)

AND COUNT(*) = 1;

intermezzo: views for missing ifcs, missing clubs

CREATE VIEW MissingClubs (club_name)

AS

SELECT Memberships.club_name

FROM Memberships

LEFT OUTER JOIN

Samples

ON Memberships.club_name = Samples.club_name

WHERE Samples.club_name IS NULL

GROUP BY Memberships.club_name;

CREATE VIEW MissingIfcs (ifc)

AS

SELECT Memberships.ifc

FROM Memberships

LEFT OUTER JOIN

Samples

ON Memberships.ifc = Samples.ifc

WHERE Samples.ifc IS NULL

GROUP BY Memberships.ifc;

third step: distinct missing clubs that are also missing ifcs INSERT INTO Samples (member_id, club_name, ifc)

SELECT MIN(Memberships.Member_id), Memberships.club_name,

MIN(Memberships.ifc)

FROM Memberships, MissingClubs, MissingIfcs

WHERE Memberships.club_name = MissingClubs.club_name

AND Memberships.ifc = MissingIfcs.ifc

GROUP BY Memberships.club_name;

Định dạng
Số trang	10
Dung lượng	131,44 KB