SELECT P1.emp_nbr FROM Personnel AS P1 WHERE EXISTS SELECT MAXemp_nbr FROM Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr HAVING MOD COUNT*, :n = 0; A nonnested version of the same que
Trang 1602 CHAPTER 26: SET OPERATIONS
CREATE TABLE B (i INTEGER NOT NULL);
INSERT INTO B VALUES (2), (2), (3), (3);
The UNION and INTERSECT operations have regular behavior in that: (A UNION B) = SELECT DISTINCT (A UNION ALL B) = ((1), (2), (3))
and (A INTERSECT B) = SELECT DISTINCT (A INTERSECT ALL B) = (2)
However, (A EXCEPT B) <> SELECT DISTINCT (A EXCEPT ALL B)
Or, more literally, (1) <> ((1), (2)) for the tables given in the example Likewise, we have:
(B EXCEPT A) = SELECT DISTINCT (B EXCEPT ALL A) = (3)
by a coincidence of the particular values used in these tables
26.4 Equality and Proper Subsets
At one point, when SQL was still in the laboratory at IBM, there was a CONTAINS operator that would tell you if one table was a subset of another It disappeared in later versions of the language and no vendor picked it up Set equality was never part of SQL as an operator, so you would have to have used the two expressions ((A CONTAINS B) AND (B CONTAINS A)) to find out
Today, you can use the methods shown in the section on Relational Division to determine containment or equality However, Itzik Ben-Gan came up with a novel approach for finding containment and equality that is worth a mention
SELECT SUM(DISTINCT match_col) FROM (SELECT CASE
WHEN S1.col
IN (SELECT S2.col FROM S2) THEN 1 ELSE -1 END
FROM S1) AS X(match_col) HAVING SUM(DISTINCT match_col) = :n;
Trang 226.4 Equality and Proper Subsets 603
You can set (:n) to 1, 0, or −1 for each particular test
When I find a matching row in S1, I get a +1; when I find a
mismatched row in S1, get a −1 and they sum together to give me a zero Therefore, S1 is a proper subset of S2 If they sum to +1, then they are equal If they sum to −1, they are disjoint
Trang 4C H A P T E R
27
Subsets
I AM DEFINING SUBSET operations as queries, which extract a particular subset from a given set, as opposed to set operations, which work among sets The obvious way to extract a subset from a table is just to use a WHERE clause, which will pull out the rows that meet that criterion But not all the subsets we want are easily defined by such a simple predicate This chapter is a collection of tricks for constructing useful, but not obvious, subsets from a table
27.1 Every n th Item in a Table
SQL is a set-oriented language, which cannot identify individual rows
by their physical positions in a disk file that holds a table Instead, a unique logical key is detected by logical expressions, and a row is retrieved If you are given a file of employees in which the ordering of the file is based on their employee numbers, and you want to pick out every nth employee record for a survey, the job is easy You write a procedure that loops through the file and writes every nth one to a second file
The immediate thought of how this should be done in SQL is to simply compute MOD (emp_nbr, :n), where MOD() is the modulo function found in most SQL implementations, and save those employee rows where this function is zero The trouble is that
Trang 5606 CHAPTER 27: SUBSETS
employees are not issued consecutive identification numbers The identification numbers are unique
Vendor extensions often include an exposed physical row locator that gives a sequential numbering to the physical records; this sequential numbering can be used to perform these functions This practice is a complete violation of Dr Codd’s definition of a relational database, and it requires that the underlying physical implementation use a contiguous sequential record for each row Such things are highly proprietary, but because these features are so low-level, they will run very fast on that one particular product
Row numbers have more problems than being nonstandard If the physical storage is rearranged, then the row numbers have to change Users logged on and looking at the same base table through different VIEWs may or may not get the same row number for the same physical row One of the advantages of an RDBMS was supposed to be that the logical view of the data would be consistent, even when the physical storage changed
You can get similar results with a self-JOIN on the Personnel table
to partition it into a nested series of grouped tables, just as we did for the “to top n” problem You then pick out the largest value in each group There may be an index or a uniqueness constraint on the emp_nbr column to ensure uniqueness, so the EXISTS predicate will get a performance boost
SELECT P1.emp_nbr FROM Personnel AS P1 WHERE EXISTS
(SELECT MAX(emp_nbr) FROM Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr HAVING MOD (COUNT(*), :n) = 0);
A nonnested version of the same query looks like this:
SELECT P1.emp_nbr FROM Personnel AS P1, Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr
GROUP BY P1.emp_nbr HAVING MOD (COUNT(*), :n) = 0;
Trang 627.2 Picking Random Rows from a Table 607
Both queries count the number of P2 rows with a value less than the P1 row
27.2 Picking Random Rows from a Table
The answer is that, basically, you cannot directly pick a set of random rows from a table in SQL There is no randomize operator in the standard, and you don’t often find the same pseudo-random number generator function in various vendor extensions, either
Picking random rows from a table for a statistical sample is a handy thing, and you do it in other languages with a pseudo-random number generator There are two kinds of random drawings from a set, with or without replacement If SQL had random number functions, I suppose they would be shown as RANDOM(x) and RANDOM(DISTINCT x) But there is no such function in SQL, and none is planned Examples from the real world include dealing a poker hand (a random with no
replacement situation) and shooting craps (a random with replacement situation) If two players in a poker game get identical cards, you are using a pinochle deck In a craps game, each roll of the dice is
independent of the previous one and can repeat it
The problem is that SQL is a set-oriented language, and wants to do
an operation “all at once” on a well-defined set of rows Random sets are defined by a nondeterministic procedure by definition, instead of a deterministic logic expression
The SQL/PSM language does have an option to declare or create a procedure that is DETERMINISTIC or NOT DETERMINISTIC The DETERMINISTIC option means that the optimizer can compute this function once for a set of input parameter values and then use that result everywhere in the current SQL statement that a call to the procedure with those parameters appears The NOT DETERMINISTIC option means given the same parameters, you might not get the same results for each call to the procedure within the same SQL statement
Unfortunately, most SQL products do not have this feature in their proprietary procedural languages Thus, the random number function in Oracle is nondeterministic and the one in SQL Server is deterministic For example,
CREATE TABLE RandomNbrs
(seq_nbr INTEGER NOT NULL PRIMARY KEY,
randomizer FLOAT NOT NULL);
Trang 7608 CHAPTER 27: SUBSETS
INSERT INTO RandomNbrs VALUES (1, RANDOM()), (2, RANDOM()), (3, RANDOM());
This query will result in the three rows all getting the same value in the randomizer column in a version of SQL Server, but three different numbers in a version of Oracle
While subqueries are not allowed in DEFAULT clauses, system-related functions such as CURRENT_TIMESTAMP and CURRENT_USER are allowed In some SQL implementations, this includes the RANDOM() function
CREATE TABLE RandomNbrs2 (seq_nbr INTEGER PRIMARY KEY, randomizer FLOAT warning !! not standard SQL DEFAULT (
(CASE (CAST(RANDOM() + 0.5 AS INTEGER) * -1) WHEN 0.0 THEN 1.0 ELSE -1.0 END)
* MOD (CAST(RANDOM() * 100000 AS INTEGER), 10000)
* RANDOM()) NOT NULL);
INSERT INTO RandomNbrs2 VALUES (1, DEFAULT);
(2, DEFAULT), (3, DEFAULT), (4, DEFAULT), (5, DEFAULT), (6, DEFAULT), (7, DEFAULT), (8, DEFAULT), (9, DEFAULT), (10, DEFAULT);
Here is a sample output from an SQL Server 7.0 implementation seq_nbr randomizer
============================
1 -121.89758452446999
2 -425.61113508053933
Trang 827.2 Picking Random Rows from a Table 609
3 3918.1554683876675
4 9335.2668286173412
5 54.463890640027664
6 -5.0169085346410522
7 -5430.63417246276
8 915.9835973796487
9 28.109161998753301
10 741.79452047043048
The best way to do this is to add a column to the table to hold a random number, then use an external language with a good pseudo-random number generator in its function library to load the new column with random values with a cursor in a host language You have to do it this way, because random number generators work differently from other function calls They start with an initial value called a “seed” (shown as Random[0] in the rest of this discussion) provided by the user
or the system clock The seed is used to create the first number in the sequence, Random[1] Then each call, Random[n], to the function uses the previous number to generate the next one, Random[n+1]
There is no way to do a sequence of actions in SQL without a cursor,
so you are in procedural code
The term “pseudo-random number generator” is often referred to as a just “random number generator,” but this is technically wrong All of the generators will eventually return a value that appeared in the sequence earlier and the procedure will hang in a cycle Procedures are
deterministic, and we are living in a mathematical heresy when we try to use them to produce truly random results However, if the sequence has
a very long cycle and meets some other tests for randomness over the range of the cycle, then we can use it
There are many kinds of generators The linear congruence pseudo-random number generator family has generator formulas of the form: Random[n+1] := MOD ((x * Random[n] + y), m);
There are restrictions on the relationships among x, y, and m that deal with their relative primality Knuth gives a proof that if
Random[0] is not a multiple of 2 or 5
m = 10^e where (e >= 5)
y = 0
MOD (x, 200) is in the set (3, 11, 13, 19, 21, 27, 29, 37, 53,
Trang 9610 CHAPTER 27: SUBSETS
59, 61, 67, 77, 83, 91, 109, 117, 123, 131, 133, 139, 141, 147,
163, 171, 173, 179, 181, 187, 189, 197)
then the period will be 5 * 10^(e-2)
There are old favorites that many C programmers use from this family, such as:
Random(n+1) := (Random(n) * 1103515245) + 12345;
Random(n+1) := MOD ((16807 * Random(n)), ((2^31) - 1));
The first formula has the advantage of not requiring a MOD function,
so it can be written in standard SQL However, the simplest generator that can be recommended (Park and Miller) uses:
Random(n+1) := MOD ((48271 * Random(n)), ((2^31) - 1));
Notice that the modulus is a prime number; this is important The period of this generator is ((2^31) − 2), which is 2,147,483,646,
or more than two billion numbers before this generator repeats You must determine whether this is long enough for your application
If you have an XOR function in your SQL, then you can also use shift register algorithms The XOR is the bitwise exclusive OR that works on an integer as it is stored in the hardware; I would assume 32 bits on most small computers Some usable shift register algorithms are:
Random(n+1) := Random(n-103) XOR Random(n-250);
Random(n+1) := Random(n-1063) XOR Random(n-1279);
One method for writing a random number generator on the fly when the vendor’s library does not have one is to pick a seed using one or more key columns and a call to the system clock’s fractional seconds, such as RANDOM(keycol + EXTRACT (SECOND FROM CURRENT_TIME)) *
1000 This avoids problems with patterns in the keys, while the key column values ensure uniqueness of the seed values
Another method is to use a PRIMARY KEY or UNIQUE column(s) and apply a hashing algorithm You can pick one of the random number generator functions already discussed and use the unique value, as if it were the seed, as a quick way to get a hashing function Hashing algorithms try to be uniformly distributed, so if you can find a good one, you will approach nearly unique random selection The trick is that the
Trang 1027.2 Picking Random Rows from a Table 611
hashing algorithm has to be simple enough to be written in the limited math available in SQL
Once you have a column of random numbers, you can convert the
random numbers into a randomly ordered sequence with this statement: UPDATE RandomNbrs
SET randomizer = (SELECT COUNT(*)
FROM Sequence AS S1
WHERE S1.randomizer <= Sequence.seq_nbr);
To get one random row from a table, you can use this approach:
CREATE VIEW LotteryDrawing (keycol, , spin)
AS SELECT LotteryTickets.*,
(RANDOM(<keycol> + <fractional seconds from clock>))
FROM LotteryTickets
GROUP BY spin
HAVING COUNT(*) = 1;
Then simply use this query:
SELECT *
FROM LotteryDrawing
WHERE spin = (SELECT MAX(spin)
FROM LotteryDrawing)
The pseudo-random number function is not standard SQL, but it is common enough Using the keycol as the seed probably means that you will get a different value for each row, but we can avoid duplicates with the GROUP BY HAVING Adding the fractional seconds will change the result every time, but it might be illegal in some SQL products,
which disallow variable elements in VIEW definitions
Let’s assume you have a function called RANDOM() that returns a
random number between 0.00 and 1.00 If you just want one random row out of the table, and you have a numeric key column, Tom Moreau proposed that you could find the MAX() and MIN(), then calculate a
random number between them
SELECT L1.*
FROM LotteryDrawing AS L1
WHERE col_1