354 CHAPTER 17: THE SELECT STATEMENT SELECT a, b, c FROM Foo, Bar, Flub WHERE y BETWEEN x AND w But this statement will work from inside the parentheses first, and then does the outermos
Trang 1352 CHAPTER 17: THE SELECT STATEMENT
implemented in actual products yet, and nobody seems to be missing the OUTER UNION or CORRESPONDING clause
The INNER JOIN operator did get to be popular This was fairly easy
to implement, since vendors only had to extend the parser without having to add more functionality Additionally, it is a binary operator, and programmers are used to binary operators—add, subtract, multiply, and divide are all binary operators E-R diagrams use lines between tables to show a relational schema
But this leads to a linear approach to problem solving that might not
be such a good thing in SQL Consider this statement, which would have been written in the traditional syntax as:
SELECT a, b, c FROM Foo, Bar, Flub WHERE Foo.y BETWEEN Bar.x AND Flub.z;
With the infixed syntax, I can write this same statement in any of several ways For example:
SELECT * FROM Foo INNER JOIN Bar ON Foo.y >= Bar.x INNER JOIN
Flub ON Foo.y <= Flub.z;
Humans tend to see things that are close together as a unit or as having a relationship The extra reserved words in the infixed notation tend to work against that perception
The infixed notation invites a programmer to add one table at a time
to the chain of joins First I built and tested the Foo-Bar join, and when I was happy with the results, I added Flub “Step-wise” program
refinement was one of the mantras of structured programming
But look at the code; can you see that there is a BETWEEN relationship among the three tables? It is not easy, is it? In effect, you see only pairs of tables and not the whole problem SQL is an “all-at-once” set-oriented language, not a “step-wise” language
Technically, the SQL engine is supposed to perform the infixed joins
in left to right order as they appear in the FROM clause It is free to rearrange the order of the joins, if the rearrangement does not change
Trang 2the results Order of execution does not make a difference with INNER JOINs, but it is very important with OUTER JOINs
Another problem is that many SQL programmers do not fully
understand the rules for the scope of names If an infixed join is given a derived table name, then all of the table names inside it are hidden from containing expressions For example, this will fail:
SELECT a, b, c wrong!
FROM (Foo
INNER JOIN
Bar ON Foo.y >= Bar.x) AS Foobar (x, y)
INNER JOIN
Flub ON Foo.y <= Flub.z;
It fails because the table name Foo is not available to the second INNER JOIN However, this will work:
SELECT a, b, c
FROM (Foo
INNER JOIN
Bar ON Foo.y >= Bar.x) AS Foobar (x, y)
INNER JOIN
Flub ON Foobar.y <= Flub.z;
If you start nesting lots of derived table expressions, you can force an order of execution in the query It is generally not a good idea to try to outguess the optimizer
So far, I have shown fully qualified column names It is a good programming practice, but it is not required Assume that Foo and Bar both have a column named w These statements will produce an
ambiguous name error:
SELECT a, b, c
FROM Foo
INNER JOIN
Bar ON y >= x
INNER JOIN
Flub ON y <= w;
Trang 3354 CHAPTER 17: THE SELECT STATEMENT
SELECT a, b, c FROM Foo, Bar, Flub WHERE y BETWEEN x AND w But this statement will work from inside the parentheses first, and then does the outermost INNER JOIN last
SELECT a, b, c FROM Foo INNER JOIN (Bar INNER JOIN Flub ON y <= w)
ON y >= x;
If Bar did not have a column named w, then the parser would go to the next containing expression, find Foo.w, and use it
As an aside, there is a myth among new SQL programmers that the join conditions must be in the ON clause, and the search argument predicates (SARGs) must be in the WHERE clause It is a nice programming style and isolates the search arguments to one location for easy changes But it is not a requirement
Am I against infixed joins? No, but they are a bit more complicated than they first appear, and if there are some OUTER JOINs in the mix, things can be very complicated Just be careful with the new toys, kids
17.5 JOINs by Function Calls
JOINs can also be done inside functions that relate columns from one or more tables in their parameters This is easier to explain with an actual example, from John Botibol of Deverill plc in Dorset, U.K His problem was how to “flatten” legacy data stored in a flat file database into a relational format for a data warehouse The data included a vast amount
of demographic information on people, related to their subjects of interest The subjects of interest were selected from a list; some subjects required just one answer, and others allowed multiple selections The problem was that the data for multiple selections was stored as a string with a one or a zero in positional places to indicate “interested” or
“not interested” in that item The actual list of products was stored in another file as a list Thus, for one person we might have something like
Trang 4‘101110’ together with a list like 1 = Bananas, 2 = Apples, 3 = Bread, 4 = Fish, 5 = Meat, 6 = Butter, if the subject area was foods
The data was first moved into working tables like this:
CREATE TABLE RawSurvey
(rawkey INTEGER NOT NULL PRIMARY KEY,
rawstring CHAR(20) NOT NULL);
CREATE TABLE SurveyList
(survey_id INTEGER NOT NULL PRIMARY KEY,
surveytext CHAR(30) NOT NULL);
There were always the correct number of ones and zeros for the number of question options in any group (thus, in this case, the answer strings always have six characters) and the list was in the correct order to match the positions in the string The data had to be ported into SQL, which meant that each survey had to be broken down into a row for each response
CREATE TABLE Surveys
(survey_id INTEGER NOT NULL,
surveytext CHAR(30) NOT NULL,
ticked INTEGER NOT NULL
CONSTRAINT tick_mark
CHECK (ticked IN (0, 1)) DEFAULT 0,
PRIMARY KEY (survey_id, surveytext));
This table can be loaded with the query:
INSERT INTO Surveys(survey_id, surveytext, ticked)
SELECT rawkey, surveytext,
SUBSTRING(rawstring FROM survey_id FOR 1)
FROM RawSurvey, SurveyList;
The tables are joined in the SUBSTRING() function, instead of with a theta operator The SUBSTRING() function returns an empty string if survey_id goes beyond the end of the string The query will always return a number of rows that is equal to or less than the number of characters in rawstring The technique will adjust itself correctly for any number of possible survey answers
Trang 5356 CHAPTER 17: THE SELECT STATEMENT
In the real problem, the table SurveyList always contained exactly the right number of entries for the length of the string to be exploded, and the string to be exploded always had exactly the right number of characters, so you did not need a WHERE clause to check for bad data
The UNION JOIN was defined in Standard SQL, but I know of no SQL product that has implemented it As the name implies, it is a cross between a UNION and a FULL OUTER JOIN The definition followed easily from the other infixed JOIN operators The syntax has no searched clause:
<table expression 1> UNION JOIN <table expression 2>
The statement takes two dissimilar tables and puts them into one result table It preserves all the rows from both tables and does not try to consolidate them Columns that do not exist in one table are simply padded out with NULLs in the result rows Columns with the same names in the tables have to be renamed differently in the result It is equivalent to:
<table expression 1>
FULL OUTER JOIN <table expression 2>
ON 1 = 2;
Any searched expression that is always FALSE will work As an example of this, you might want to combine the medical records of male and female patients into one table with this query:
SELECT * FROM (SELECT 'male', prostate FROM Males) OUTER UNION
(SELECT 'female', pregnancy FROM Females);
to get a result table like this:
Result male prostate female pregnancy
==================================
'male' no NULL NULL
Trang 6'male' no NULL NULL
'male' yes NULL NULL
'male' yes NULL NULL
NULL NULL 'female' no
NULL NULL 'female' no
NULL NULL 'female' yes
NULL NULL 'female' yes
Frédéric Brouard came up with a nice trick for writing a similar join—that is, a join on one table, say a basic table of student data, with either a table of data particular to domestic students or another table of data particular to foreign students, based on the value of a parameter This differs from a true UNION JOIN in that it must have a “root” table
to use for the outer joins
CREATE TABLE Students
(student_nbr INTEGER NOT NULL PRIMARY KEY,
student_type CHAR(1) NOT NULL DEFAULT 'D'
CHECK (student_type IN ('D', 'F', ))
);
CREATE TABLE DomesticStudents
(student_nbr INTEGER NOT NULL PRIMARY KEY,
REFERENCES Students(student_nbr),
);
CREATE TABLE ForeignStudents
(student_nbr INTEGER NOT NULL PRIMARY KEY,
REFERENCES Students(student_nbr),
);
SELECT Students.*, DomesticStudents.*, ForeignStudents.*
FROM Students
LEFT OUTER JOIN
DomesticStudents
ON CASE Students.student_type
WHEN 'D' THEN 1 ELSE NULL END
= 1
LEFT OUTER JOIN
ForeignStudents
ON CASE Students.student_type
WHEN 'F'
THEN 1 ELSE NULL END = 1;
Trang 7358 CHAPTER 17: THE SELECT STATEMENT
We can relate two tables together based on quantities in each of them The simplest example is filling customer orders from our inventories at various stores To make life easier, let’s assume that we have only one product, we process orders in increasing customer_id order, and we draw from store inventory by increasing store_id
CREATE TABLE Inventory (store_id INTEGER NOT NULL PRIMARY KEY, item_qty INTEGER NOT NULL CHECK (item_qty >= 0));
INSERT INTO Inventory (store_id, item_qty) VALUES (10, 2),(20, 3), (30, 2);
CREATE TABLE Orders (customer_id CHAR(5) NOT NULL PRIMARY KEY, item_qty INTEGER NOT NULL CHECK (item_qty > 0));
INSERT INTO Orders (customer_id, item_qty) VALUES ('Bill', 4), ('Fred', 2);
What we want to do is fill Bill’s order for four units by taking two units from store 1 and two units from store 2 Next we process Fred’s order with the one unit left in store 1, and one unit from store 3 SELECT I.store_id, O.customer_id,
(CASE WHEN O.end_running_qty <= I.end_running_qty THEN O.end_running_qty
ELSE I.end_running_qty END
- CASE WHEN O.start_running_qty >= I.start_running_qty THEN O.start_running_qty
ELSE I.start_running_qty END)
AS items_consumed_tally FROM (SELECT I1.store_id, SUM(I2.item_qty) - I1.item_qty, SUM(I2.item_qty)
FROM Inventory AS I1, Inventory AS I2 WHERE I2.store_id <= I1.store_id GROUP BY I1.store_id, I1.item_qty)
AS I (store_id, start_running_qty, end_running_qty)
Trang 8INNER JOIN
(SELECT O1.customer_id,
SUM(O2.item_qty) - O1.item_qty,
SUM(O2.item_qty) AS end_running_qty
FROM Orders AS O1, Orders AS O2
WHERE O2.customer_id <= O1.customer_id
GROUP BY O1.customer_id, O1.item_qty)
AS O (store_id, start_running_qty, end_running_qty)
ON O.start_running_qty < I.end_running_qty
AND O.end_running_qty > I.start_running_qty;
ORDER BY store_id, customer_id;
This can also be done with the new SQL-99 OLAP operators
17.8 Dr Codd’s T-Join
Dr E F Codd introduced a set of new theta operators, called
T-operators, which were based on the idea of a best-fit or approximate equality (Codd 1990) The algorithm for the operators is easier to understand with an example modified from Dr Codd (Codd 1990) The problem is to assign the classes to the available classrooms We want (class_size < room_size) to be true after the assignments are made This will allow us a few empty seats in each room for late students We can do this in one of two ways The first way is to sort the tables in ascending order by classroom size and the number of students
in a class We start with the following tables:
CREATE TABLE Rooms
(room_nbr CHAR(2) PRIMARY KEY,
room_size INTEGER NOT NULL);
CREATE TABLE Classes
(class_nbr CHAR(2) PRIMARY KEY,
class_size INTEGER NOT NULL);
These tables have the following rows in them:
Classes
class_nbr class_size
=====================
'c1' 80
'c2' 70
Trang 9360 CHAPTER 17: THE SELECT STATEMENT
'c3' 65 'c4' 55 'c5' 50 'c6' 40
Rooms room_nbr room_size ==================
'r1' 70 'r2' 40 'r3' 50 'r4' 85 'r5' 30 'r6' 65 'r7' 55
The goal of the T-Join problem is to assign a class that is smaller than the classroom given it (class_size < room_size) Dr Codd gives two approaches to the problem
1 Ascending Order Algorithm: Sort both tables into ascending
order Reading from the top of the Rooms table, match each class with the first room that will fit
Classes Rooms class_nbr class_size room_nbr room_size ==================== ===================
'c6' 40 'r5' 30 'c5' 50 'r2' 40 'c4' 55 'r3' 50 'c3' 65 'r7' 55 'c2' 70 'r6' 65 'c1' 80 'r1' 70 'r4' 85 Results
class_nbr class_size room_nbr room_size ========================================
'c2' 70 'r4' 85 'c3' 65 'r1' 70 'c4' 55 'r6' 65 'c5' 50 'r7' 55 'c6' 40 'r3' 50
Trang 102 Descending Order Algorithm: Sort both tables into descending
order Reading from the top of the Classes table, match each class with the first room that will fit
Classes Rooms
class_nbr class_size room_nbr room_size
===================== ===================
'c1' 80 'r4' 85
'c2' 70 'r1' 70
'c3' 65 'r6' 65
'c4' 55 'r7' 55
'c5' 50 'r3' 50
'c6' 40 'r2' 40
'r5' 30
Results
class_nbr class_size room_nbr room_size
=========================================
'c1' 80 'r4' 85
'c3' 65 'r1' 70
'c4' 55 'r6' 65
'c5' 50 'r7' 55
'c6' 40 'r3' 50
Notice that the answers are different! Dr Codd has never given a definition in relational algebra of the T-Join, so I propose that we need one Informally, for each class, we want the smallest room that will hold
it, while maintaining the T-Join condition Or for each room, we want the largest class that will fill it, while maintaining the T-Join condition These can be two different things, so you must decide which table is the driver But either way, I advocate a “best fit” over Codd’s “first fit” approach
In effect, the Swedish and Croatian solutions given later in this section use my definition instead of Dr Codd’s; the Colombian solution
is true to the algorithmic approach
Other theta conditions can be used in place of the “less than” shown here If “less than or equal” is used, all the classes are assigned to a room
in this case, but not in all cases This is left to the reader as an exercise The first attempts in standard SQL are versions grouped by queries They can, however, produce some rows that would be left out of the answers Dr Codd was expecting The first JOIN can be written as