The two tables must be union-compatible, which means that they have exactly the same number of columns, and that each column in the first table has the same data type or automatically ca
Trang 1592 CHAPTER 26: SET OPERATIONS
For the rest of this discussion, let us create two tables with the same structure, which we can use for examples
CREATE TABLE S1 (a1 CHAR(1));
INSERT INTO S1 VALUES ('a'), ('a'), ('b'), ('b'), ('c');
CREATE TABLE S2 (a2 CHAR(1));
INSERT INTO S2 VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d');
26.1 UNION and UNION ALL
<table expression> UNION [ALL] <table expression>
The two versions of the UNION statement take two tables and build a result table from them The two tables must be union-compatible, which means that they have exactly the same number of columns, and that each column in the first table has the same data type (or automatically cast to it) as the column in the same position in the second table That is, their rows must have the same structure, so they can be put in the same final result table Most implementations will do some data type
conversions to create the result table, but this can depend on your implementation, and you should check it out for yourself
There are two forms of the UNION statement: the UNION and the
school set theory; it returns the rows that appear in either or both tables and removes redundant duplicates from the result table The phrase
“redundant duplicates” sounds funny, but it means that you leave one copy of the row in the table The sample tables will yield:
(SELECT a1 FROM S1 UNION
SELECT a2 FROM S2)
============
a b c d
Trang 226.1 UNION and UNION ALL 593
In many early SQL implementations, merge-sorting the two tables and discarding duplicates during the sorting did this removal This had the side effect that the result table was sorted, but you could not depend
on that Later implementations use hashing, indexing, and parallel processing to find the duplicates
result table Most early implementations simply appended one table to the other in physical storage They used file systems based on physically contiguous storage, so this was easy and used the file system code But, again, you cannot depend on any ordering in the results of either version
of the UNION statement Again, the sample tables will yield:
(SELECT a1 FROM S1
UNION ALL
SELECT a2 FROM S2)
====
'a'
'a'
'a'
'b'
'b'
'b'
'b'
'b'
'c'
'c'
'd'
You can assign names to the columns by using the AS operator to make the result set into a derived table, thus:
SELECT rent, utilities, phone
FROM
(SELECT a, b, c FROM OldLocations WHERE city = 'Boston'
UNION
SELECT x, y, z FROM NewLocations WHERE city = 'New York')
AS Cities (rent, utilities, phone);
A few SQL products will attempt to optimize UNIONs if they are made
on the same table Those UNIONs can often be replaced with ORed predicates For example:
Trang 3594 CHAPTER 26: SET OPERATIONS
SELECT city_name, 'Western' FROM Cities
WHERE market_code = 't' UNION ALL
SELECT city_name, 'Eastern' FROM Cities
WHERE market_code = 'v';
This could be rewritten (probably more efficiently) as:
SELECT city_name, CASE market_code WHEN 't' THEN 'Western' WHEN 'v' THEN 'Eastern' END FROM Cities
WHERE market_code IN ('v', 't');
A system architecture based on domains rather than tables is necessary to optimize UNIONs if they are made on different tables Doing a UNION to the same table is the same as a SELECT
preserve the column names too
26.1.1 Order of Execution
parentheses change the order of execution Since the UNION operator is associative and commutative, the order of a chain of UNIONs will not affect the results However, order and grouping can affect performance Consider two small tables that have many duplicates between them If the optimizer does not consider table sizes, use this query:
( SELECT * FROM SmallTable1) UNION
( SELECT * FROM BigTable) UNION
( SELECT * FROM SmallTable2);
It will merge SmallTable1 into BigTable, then merge SmallTable2 into that first result If the rows of SmallTable1 are spread out in the first
Trang 426.1 UNION and UNION ALL 595
result table, locating duplicates from SmallTable2 will take longer than if
we had written the query thus:
( SELECT * FROM SmallTable1)
UNION
( SELECT * FROM SmallTable2)
UNION
( SELECT * FROM BigTable);
Again, optimization of UNIONs is highly product-dependent, so you should experiment with it
26.1.2 Mixed UNION and UNION ALL Operators
If you know that there are no duplicates, or that duplicates are not a problem in your situation, use UNION ALL, instead of UNION, for speed For example, if we are sure that BigTable has no duplicates in common with SmallTable1 and SmallTable2, this query will produce the same results as before, but should run much faster:
(( SELECT * FROM SmallTable1)
UNION
( SELECT * FROM SmallTable2))
UNION ALL
( SELECT * FROM BigTable);
But be careful when mixing UNION and UNION ALL operators The left-to-right order of execution will cause the last operator in the chain to have an effect on the results
26.1.3 UNION of Columns from the Same Table
A useful trick for building the union of columns from the same table is
to use a CROSS JOIN and a CASE expression:
SELECT CASE WHEN S1.seq_nbr = 1 THEN F1.col1
WHEN S1.seq_nbr = 2 THEN F1.col2
ELSE NULL END
FROM Foobar AS F1
CROSS JOIN
Sequence AS S1(seq_nbr)
WHERE S1.seq_nbr IN (1, 2)
Trang 5596 CHAPTER 26: SET OPERATIONS
This query acts like the UNION ALL statement, but change the
this statement over the more obvious UNION is that it makes only one pass through the table If you are working with a large table, that can be important for good performance
26.2 INTERSECT and EXCEPT
Intersection and set difference are part of Standard SQL, but few products have implemented them yet
a new table from them The two tables must be “union-compatible,”
which means that they have the same number of columns, and that each column in the first table has the same data type (or automatically casts to it) as the column in the same position in the second table
That is, their rows have the same structure, so they can be put in the same final result table Most implementations will do some data type conversions to create the result table, but this is very implementation dependent, and you should check it out for yourself Like the UNION, the result of an INTERSECT or EXCEPT should use an AS operator if you want to have names for the result table and its columns
Oracle was the first major vendor to have the EXCEPT operator with the keyword MINUS The set difference is the rows in the first table, except for those that also appear in the second table It answers requests like “Give me all the employees except the salesmen” in a natural manner
Let’s take our two multisets and use them to explain the basic model,
by making a mapping between them:
S1 = {a, a, b, b, c } | | | | S2 = {a, b, b, b, c, d}
The INTERSECT and EXCEPT operators remove all duplicates from both sets, so we would have:
S1 = {a, b, c } | | | S2 = {a, b, c, d}
Therefore,
Trang 626.2 INTERSECT and EXCEPT 597
S1 INTERSECT S2 = {a, b, c}
and
S2 EXCEPT S1 = {d}
S1 EXCEPT S2 = {}
When you add the ALL option, things are trickier The mapped pairs become the unit of work The INTERSECT ALL keeps each pairing, so that:
S1 INTERSECT ALL S2 = {a, b, b, c}
set, thus:
S2 EXCEPT ALL S1 = {b, d}
Trying to write the INTERSECT and EXCEPT with other operators is trickier than it looks It must be general enough to handle situations where there is no key available and the number of columns is not known
Standard SQL defines the actions for duplicates in terms of the count
of duplicates of matching rows Let (m) be the number of rows of one kind in S1 and (n) be the number in S2 The UNION ALL will have (m+n)
copies of the row The INTERSECT ALL will have LEAST(m, n) copies
EXCEPT ALL will have the greater of either the first table’s count minus the second table’s count, or zero copies
The immediate impulse of a programmer is to write the code with
NULLs This is easier to show with code Let’s redo our two sample tables
CREATE TABLE S1 (a1 CHAR(1));
INSERT INTO S1
VALUES ('a'), ('a'), ('b'), ('b'), ('c'), (NULL), (NULL);
CREATE TABLE S2 (a2 CHAR(1));
INSERT INTO S2
VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'), (NULL);
Trang 7598 CHAPTER 26: SET OPERATIONS
Now build a view to hold the tally of each value in each table
CREATE VIEW DupCounts (a, s1_dup, s2_dup) AS
SELECT S.a, SUM(s1_dup), SUM(s2_dup) FROM (SELECT S1.a1, 1, 0
FROM S1 UNION ALL SELECT S2.a2, 0, 1 FROM S2) AS S(a, s1_dup, s2_dup) GROUP BY S.a, s1_dup, s2_dup;
The GROUP BY will put the NULLs into a separate group, giving them the right tallies Now code is a straightforward implementation of the definitions in Standard SQL
S1 EXCEPT ALL S2 SELECT DISTINCT D1.a, (s1_dup - s2_dup) AS dups FROM DupCounts AS D1,
Sequence AS S1 WHERE S1.seq_nbr <= (s1_dup - s2_dup);
S1 INTERSECT ALL S2 SELECT DISTINCT D1.a, CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END
AS tally FROM DupCounts AS D1, Sequence AS S1 WHERE S1.seq_nbr <= CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END;
Notice that we had to use SELECT DISTINCT Without it, the sample data will produce this table
a tally
===========
NULL 1
a 1
b 2
b 2 <== redundant row
c 1
Trang 826.2 INTERSECT and EXCEPT 599
The nonduplicated versions are easy to write from the definitions in the Standards In effect, their duplication tallies are set to one
S1 INTERSECT S2
SELECT D1.a
FROM DupCounts AS D1
WHERE s1_dup > 0
AND s2_dup > 0;
S1 EXCEPT S2
SELECT D1.a
FROM DupCounts AS D1
WHERE s1_dup > 0
AND s2_dup = 0;
S2 EXCEPT S1
SELECT D1.a
FROM DupCounts AS D1
WHERE s2_dup > 0
AND s1_dup = 0;
26.2.1 INTERSECT and EXCEPT without NULLs
and Duplicates
not have NULLs and duplicate values in them Intersection is simply done thus:
SELECT *
FROM S1
WHERE EXISTS
(SELECT *
FROM S2
WHERE S1.a1 = S2.a2);
or
SELECT *
FROM S2
WHERE EXISTS
(SELECT *
Trang 9600 CHAPTER 26: SET OPERATIONS
FROM S1 WHERE S1.a1 = S2.a2);
You can also use the following:
SELECT DISTINCT S2.*
FROM (S2 INNER JOIN S1 ON S1.a1 = S2.a2);
This is given as a motivation for the next piece of code, but you may find that some SQL engines do joins faster than EXISTS() predicates, and vice versa, so it is a good idea to have more than one trick in your bag
The set difference can be written with an OUTER JOIN operator This code is due to Jim Panttaja
SELECT DISTINCT S2.*
FROM (S2 LEFT OUTER JOIN S1
ON S1.a1 = S2.a2) WHERE S1.a1 IS NULL;
26.2.2 INTERSECT and EXCEPT with NULLs and Duplicates
These versions of INTERSECT and EXCEPT are due to Itzik Ben-Gan They make very good use of the UNION and DISTINCT operators to implement set theory definitions
S1 INTERSECT S2 SELECT D.a
FROM (SELECT DISTINCT a1 FROM S1 UNION ALL
SELECT DISTINCT a2 FROM S2) AS D(a) GROUP BY D.a
HAVING COUNT(*) > 1;
S1 INTERSECT ALL S2 SELECT D2.a
FROM (SELECT D1.a, MIN(cnt) AS mincnt FROM (SELECT a1, COUNT(*) FROM S1
GROUP BY a1
Trang 1026.3 A Note on ALL and SELECT DISTINCT 601
UNION ALL
SELECT a2, COUNT(*)
FROM S2
GROUP BY a2) AS D1(a, cnt)
GROUP BY D1.a
HAVING COUNT(*) > 1) AS D2
INNER JOIN
Sequence
ON seq_nbr <= mincnt;
S1 EXCEPT ALL S2
SELECT D2.a
FROM (SELECT D1.a, SUM(cnt)
FROM (SELECT a1, COUNT(*)
FROM S1
GROUP BY a1
UNION ALL
SELECT a2, -COUNT(*)
FROM S2
GROUP BY a2)
AS D1(a, cnt)
GROUP BY D1.a
HAVING SUM(cnt) > 0)
AS D2(a, dups)
INNER JOIN
Sequence ON seq_nbr <= D2.dups;
The Sequence table is discussed in other places in this book It is a
table of integers from 1 to (n) that is used to replace iteration and counting in SQL Obviously, (n) must be large enough for these
statements to work
26.3 A Note on ALL and SELECT DISTINCT
Here is a series of observations about the relationship between the ALL option in set operations and the SELECT DISTINCT options in a query from Beught Gunne
Given two tables with duplicate values:
CREATE TABLE A (i INTEGER NOT NULL);
INSERT INTO A VALUES (1), (1), (2), (2), (4), (4);