Joe Celko s SQL for Smarties - Advanced SQL Programming P63 ppsx

The two tables must be union-compatible, which means that they have exactly the same number of columns, and that each column in the first table has the same data type or automatically ca

Trang 1

592 CHAPTER 26: SET OPERATIONS

For the rest of this discussion, let us create two tables with the same structure, which we can use for examples

CREATE TABLE S1 (a1 CHAR(1));

INSERT INTO S1 VALUES ('a'), ('a'), ('b'), ('b'), ('c');

INSERT INTO S2 VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d');

26.1 UNION and UNION ALL

<table expression> UNION [ALL] <table expression>

The two versions of the UNION statement take two tables and build a result table from them The two tables must be union-compatible, which means that they have exactly the same number of columns, and that each column in the first table has the same data type (or automatically cast to it) as the column in the same position in the second table That is, their rows must have the same structure, so they can be put in the same final result table Most implementations will do some data type

conversions to create the result table, but this can depend on your implementation, and you should check it out for yourself

There are two forms of the UNION statement: the UNION and the

school set theory; it returns the rows that appear in either or both tables and removes redundant duplicates from the result table The phrase

“redundant duplicates” sounds funny, but it means that you leave one copy of the row in the table The sample tables will yield:

(SELECT a1 FROM S1 UNION

SELECT a2 FROM S2)

============

a b c d

Trang 2

26.1 UNION and UNION ALL 593

In many early SQL implementations, merge-sorting the two tables and discarding duplicates during the sorting did this removal This had the side effect that the result table was sorted, but you could not depend

on that Later implementations use hashing, indexing, and parallel processing to find the duplicates

result table Most early implementations simply appended one table to the other in physical storage They used file systems based on physically contiguous storage, so this was easy and used the file system code But, again, you cannot depend on any ordering in the results of either version

of the UNION statement Again, the sample tables will yield:

(SELECT a1 FROM S1

UNION ALL

SELECT a2 FROM S2)

====

'a'

'b'

'c'

'd'

You can assign names to the columns by using the AS operator to make the result set into a derived table, thus:

SELECT rent, utilities, phone

FROM

(SELECT a, b, c FROM OldLocations WHERE city = 'Boston'

UNION

SELECT x, y, z FROM NewLocations WHERE city = 'New York')

AS Cities (rent, utilities, phone);

A few SQL products will attempt to optimize UNIONs if they are made

on the same table Those UNIONs can often be replaced with ORed predicates For example:

Trang 3

SELECT city_name, 'Western' FROM Cities

WHERE market_code = 't' UNION ALL

SELECT city_name, 'Eastern' FROM Cities

WHERE market_code = 'v';

This could be rewritten (probably more efficiently) as:

SELECT city_name, CASE market_code WHEN 't' THEN 'Western' WHEN 'v' THEN 'Eastern' END FROM Cities

WHERE market_code IN ('v', 't');

A system architecture based on domains rather than tables is necessary to optimize UNIONs if they are made on different tables Doing a UNION to the same table is the same as a SELECT

preserve the column names too

26.1.1 Order of Execution

parentheses change the order of execution Since the UNION operator is associative and commutative, the order of a chain of UNIONs will not affect the results However, order and grouping can affect performance Consider two small tables that have many duplicates between them If the optimizer does not consider table sizes, use this query:

( SELECT * FROM SmallTable1) UNION

( SELECT * FROM BigTable) UNION

( SELECT * FROM SmallTable2);

It will merge SmallTable1 into BigTable, then merge SmallTable2 into that first result If the rows of SmallTable1 are spread out in the first

Trang 4

26.1 UNION and UNION ALL 595

result table, locating duplicates from SmallTable2 will take longer than if

we had written the query thus:

( SELECT * FROM SmallTable1)

UNION

( SELECT * FROM SmallTable2)

UNION

( SELECT * FROM BigTable);

Again, optimization of UNIONs is highly product-dependent, so you should experiment with it

26.1.2 Mixed UNION and UNION ALL Operators

If you know that there are no duplicates, or that duplicates are not a problem in your situation, use UNION ALL, instead of UNION, for speed For example, if we are sure that BigTable has no duplicates in common with SmallTable1 and SmallTable2, this query will produce the same results as before, but should run much faster:

(( SELECT * FROM SmallTable1)

UNION

( SELECT * FROM SmallTable2))

UNION ALL

( SELECT * FROM BigTable);

But be careful when mixing UNION and UNION ALL operators The left-to-right order of execution will cause the last operator in the chain to have an effect on the results

26.1.3 UNION of Columns from the Same Table

A useful trick for building the union of columns from the same table is

to use a CROSS JOIN and a CASE expression:

SELECT CASE WHEN S1.seq_nbr = 1 THEN F1.col1

WHEN S1.seq_nbr = 2 THEN F1.col2

ELSE NULL END

FROM Foobar AS F1

CROSS JOIN

Sequence AS S1(seq_nbr)

WHERE S1.seq_nbr IN (1, 2)

Trang 5

This query acts like the UNION ALL statement, but change the

this statement over the more obvious UNION is that it makes only one pass through the table If you are working with a large table, that can be important for good performance

26.2 INTERSECT and EXCEPT

Intersection and set difference are part of Standard SQL, but few products have implemented them yet

a new table from them The two tables must be “union-compatible,”

which means that they have the same number of columns, and that each column in the first table has the same data type (or automatically casts to it) as the column in the same position in the second table

That is, their rows have the same structure, so they can be put in the same final result table Most implementations will do some data type conversions to create the result table, but this is very implementation dependent, and you should check it out for yourself Like the UNION, the result of an INTERSECT or EXCEPT should use an AS operator if you want to have names for the result table and its columns

Oracle was the first major vendor to have the EXCEPT operator with the keyword MINUS The set difference is the rows in the first table, except for those that also appear in the second table It answers requests like “Give me all the employees except the salesmen” in a natural manner

Let’s take our two multisets and use them to explain the basic model,

by making a mapping between them:

S1 = {a, a, b, b, c } | | | | S2 = {a, b, b, b, c, d}

The INTERSECT and EXCEPT operators remove all duplicates from both sets, so we would have:

S1 = {a, b, c } | | | S2 = {a, b, c, d}

Therefore,

Trang 6

26.2 INTERSECT and EXCEPT 597

S1 INTERSECT S2 = {a, b, c}

and

S2 EXCEPT S1 = {d}

S1 EXCEPT S2 = {}

When you add the ALL option, things are trickier The mapped pairs become the unit of work The INTERSECT ALL keeps each pairing, so that:

S1 INTERSECT ALL S2 = {a, b, b, c}

set, thus:

S2 EXCEPT ALL S1 = {b, d}

Trying to write the INTERSECT and EXCEPT with other operators is trickier than it looks It must be general enough to handle situations where there is no key available and the number of columns is not known

Standard SQL defines the actions for duplicates in terms of the count

of duplicates of matching rows Let (m) be the number of rows of one kind in S1 and (n) be the number in S2 The UNION ALL will have (m+n)

copies of the row The INTERSECT ALL will have LEAST(m, n) copies

EXCEPT ALL will have the greater of either the first table’s count minus the second table’s count, or zero copies

The immediate impulse of a programmer is to write the code with

NULLs This is easier to show with code Let’s redo our two sample tables

INSERT INTO S1

VALUES ('a'), ('a'), ('b'), ('b'), ('c'), (NULL), (NULL);

INSERT INTO S2

VALUES ('a'), ('b'), ('b'), ('b'), ('c'), ('d'), (NULL);

Trang 7

Now build a view to hold the tally of each value in each table

CREATE VIEW DupCounts (a, s1_dup, s2_dup) AS

SELECT S.a, SUM(s1_dup), SUM(s2_dup) FROM (SELECT S1.a1, 1, 0

FROM S1 UNION ALL SELECT S2.a2, 0, 1 FROM S2) AS S(a, s1_dup, s2_dup) GROUP BY S.a, s1_dup, s2_dup;

The GROUP BY will put the NULLs into a separate group, giving them the right tallies Now code is a straightforward implementation of the definitions in Standard SQL

S1 EXCEPT ALL S2 SELECT DISTINCT D1.a, (s1_dup - s2_dup) AS dups FROM DupCounts AS D1,

Sequence AS S1 WHERE S1.seq_nbr <= (s1_dup - s2_dup);

S1 INTERSECT ALL S2 SELECT DISTINCT D1.a, CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END

AS tally FROM DupCounts AS D1, Sequence AS S1 WHERE S1.seq_nbr <= CASE WHEN s1_dup <= s2_dup THEN s1_dup ELSE s2_dup END;

Notice that we had to use SELECT DISTINCT Without it, the sample data will produce this table

a tally

===========

NULL 1

a 1

b 2

b 2 <== redundant row

c 1

Trang 8

26.2 INTERSECT and EXCEPT 599

The nonduplicated versions are easy to write from the definitions in the Standards In effect, their duplication tallies are set to one

S1 INTERSECT S2

SELECT D1.a

FROM DupCounts AS D1

WHERE s1_dup > 0

AND s2_dup > 0;

S1 EXCEPT S2

SELECT D1.a

WHERE s1_dup > 0

AND s2_dup = 0;

S2 EXCEPT S1

SELECT D1.a

WHERE s2_dup > 0

AND s1_dup = 0;

26.2.1 INTERSECT and EXCEPT without NULLs

and Duplicates

not have NULLs and duplicate values in them Intersection is simply done thus:

SELECT *

FROM S1

WHERE EXISTS

(SELECT *

FROM S2

WHERE S1.a1 = S2.a2);

or

SELECT *

FROM S2

WHERE EXISTS

(SELECT *

Trang 9

FROM S1 WHERE S1.a1 = S2.a2);

You can also use the following:

SELECT DISTINCT S2.*

FROM (S2 INNER JOIN S1 ON S1.a1 = S2.a2);

This is given as a motivation for the next piece of code, but you may find that some SQL engines do joins faster than EXISTS() predicates, and vice versa, so it is a good idea to have more than one trick in your bag

The set difference can be written with an OUTER JOIN operator This code is due to Jim Panttaja

SELECT DISTINCT S2.*

FROM (S2 LEFT OUTER JOIN S1

ON S1.a1 = S2.a2) WHERE S1.a1 IS NULL;

26.2.2 INTERSECT and EXCEPT with NULLs and Duplicates

These versions of INTERSECT and EXCEPT are due to Itzik Ben-Gan They make very good use of the UNION and DISTINCT operators to implement set theory definitions

S1 INTERSECT S2 SELECT D.a

FROM (SELECT DISTINCT a1 FROM S1 UNION ALL

SELECT DISTINCT a2 FROM S2) AS D(a) GROUP BY D.a

HAVING COUNT(*) > 1;

S1 INTERSECT ALL S2 SELECT D2.a

FROM (SELECT D1.a, MIN(cnt) AS mincnt FROM (SELECT a1, COUNT(*) FROM S1

GROUP BY a1

Trang 10

26.3 A Note on ALL and SELECT DISTINCT 601

UNION ALL

SELECT a2, COUNT(*)

FROM S2

GROUP BY a2) AS D1(a, cnt)

GROUP BY D1.a

HAVING COUNT(*) > 1) AS D2

INNER JOIN

Sequence

ON seq_nbr <= mincnt;

S1 EXCEPT ALL S2

SELECT D2.a

FROM (SELECT D1.a, SUM(cnt)

FROM (SELECT a1, COUNT(*)

FROM S1

GROUP BY a1

UNION ALL

SELECT a2, -COUNT(*)

FROM S2

GROUP BY a2)

AS D1(a, cnt)

GROUP BY D1.a

HAVING SUM(cnt) > 0)

AS D2(a, dups)

INNER JOIN

Sequence ON seq_nbr <= D2.dups;

The Sequence table is discussed in other places in this book It is a

table of integers from 1 to (n) that is used to replace iteration and counting in SQL Obviously, (n) must be large enough for these

statements to work

26.3 A Note on ALL and SELECT DISTINCT

Here is a series of observations about the relationship between the ALL option in set operations and the SELECT DISTINCT options in a query from Beught Gunne

Given two tables with duplicate values:

CREATE TABLE A (i INTEGER NOT NULL);

INSERT INTO A VALUES (1), (1), (2), (2), (4), (4);

Định dạng
Số trang	10
Dung lượng	124,64 KB