1. Trang chủ
  2. » Công Nghệ Thông Tin

Joe Celko s SQL for Smarties - Advanced SQL Programming P64 pps

10 100 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 239,27 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

SELECT P1.emp_nbr FROM Personnel AS P1 WHERE EXISTS SELECT MAXemp_nbr FROM Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr HAVING MOD COUNT*, :n = 0; A nonnested version of the same que

Trang 1

602 CHAPTER 26: SET OPERATIONS

CREATE TABLE B (i INTEGER NOT NULL);

INSERT INTO B VALUES (2), (2), (3), (3);

The UNION and INTERSECT operations have regular behavior in that: (A UNION B) = SELECT DISTINCT (A UNION ALL B) = ((1), (2), (3))

and (A INTERSECT B) = SELECT DISTINCT (A INTERSECT ALL B) = (2)

However, (A EXCEPT B) <> SELECT DISTINCT (A EXCEPT ALL B)

Or, more literally, (1) <> ((1), (2)) for the tables given in the example Likewise, we have:

(B EXCEPT A) = SELECT DISTINCT (B EXCEPT ALL A) = (3)

by a coincidence of the particular values used in these tables

26.4 Equality and Proper Subsets

At one point, when SQL was still in the laboratory at IBM, there was a CONTAINS operator that would tell you if one table was a subset of another It disappeared in later versions of the language and no vendor picked it up Set equality was never part of SQL as an operator, so you would have to have used the two expressions ((A CONTAINS B) AND (B CONTAINS A)) to find out

Today, you can use the methods shown in the section on Relational Division to determine containment or equality However, Itzik Ben-Gan came up with a novel approach for finding containment and equality that is worth a mention

SELECT SUM(DISTINCT match_col) FROM (SELECT CASE

WHEN S1.col

IN (SELECT S2.col FROM S2) THEN 1 ELSE -1 END

FROM S1) AS X(match_col) HAVING SUM(DISTINCT match_col) = :n;

Trang 2

26.4 Equality and Proper Subsets 603

You can set (:n) to 1, 0, or −1 for each particular test

When I find a matching row in S1, I get a +1; when I find a

mismatched row in S1, get a −1 and they sum together to give me a zero Therefore, S1 is a proper subset of S2 If they sum to +1, then they are equal If they sum to −1, they are disjoint

Trang 4

C H A P T E R

27

Subsets

I AM DEFINING SUBSET operations as queries, which extract a particular subset from a given set, as opposed to set operations, which work among sets The obvious way to extract a subset from a table is just to use a WHERE clause, which will pull out the rows that meet that criterion But not all the subsets we want are easily defined by such a simple predicate This chapter is a collection of tricks for constructing useful, but not obvious, subsets from a table

27.1 Every n th Item in a Table

SQL is a set-oriented language, which cannot identify individual rows

by their physical positions in a disk file that holds a table Instead, a unique logical key is detected by logical expressions, and a row is retrieved If you are given a file of employees in which the ordering of the file is based on their employee numbers, and you want to pick out every nth employee record for a survey, the job is easy You write a procedure that loops through the file and writes every nth one to a second file

The immediate thought of how this should be done in SQL is to simply compute MOD (emp_nbr, :n), where MOD() is the modulo function found in most SQL implementations, and save those employee rows where this function is zero The trouble is that

Trang 5

606 CHAPTER 27: SUBSETS

employees are not issued consecutive identification numbers The identification numbers are unique

Vendor extensions often include an exposed physical row locator that gives a sequential numbering to the physical records; this sequential numbering can be used to perform these functions This practice is a complete violation of Dr Codd’s definition of a relational database, and it requires that the underlying physical implementation use a contiguous sequential record for each row Such things are highly proprietary, but because these features are so low-level, they will run very fast on that one particular product

Row numbers have more problems than being nonstandard If the physical storage is rearranged, then the row numbers have to change Users logged on and looking at the same base table through different VIEWs may or may not get the same row number for the same physical row One of the advantages of an RDBMS was supposed to be that the logical view of the data would be consistent, even when the physical storage changed

You can get similar results with a self-JOIN on the Personnel table

to partition it into a nested series of grouped tables, just as we did for the “to top n” problem You then pick out the largest value in each group There may be an index or a uniqueness constraint on the emp_nbr column to ensure uniqueness, so the EXISTS predicate will get a performance boost

SELECT P1.emp_nbr FROM Personnel AS P1 WHERE EXISTS

(SELECT MAX(emp_nbr) FROM Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr HAVING MOD (COUNT(*), :n) = 0);

A nonnested version of the same query looks like this:

SELECT P1.emp_nbr FROM Personnel AS P1, Personnel AS P2 WHERE P1.emp_nbr >= P2.emp_nbr

GROUP BY P1.emp_nbr HAVING MOD (COUNT(*), :n) = 0;

Trang 6

27.2 Picking Random Rows from a Table 607

Both queries count the number of P2 rows with a value less than the P1 row

27.2 Picking Random Rows from a Table

The answer is that, basically, you cannot directly pick a set of random rows from a table in SQL There is no randomize operator in the standard, and you don’t often find the same pseudo-random number generator function in various vendor extensions, either

Picking random rows from a table for a statistical sample is a handy thing, and you do it in other languages with a pseudo-random number generator There are two kinds of random drawings from a set, with or without replacement If SQL had random number functions, I suppose they would be shown as RANDOM(x) and RANDOM(DISTINCT x) But there is no such function in SQL, and none is planned Examples from the real world include dealing a poker hand (a random with no

replacement situation) and shooting craps (a random with replacement situation) If two players in a poker game get identical cards, you are using a pinochle deck In a craps game, each roll of the dice is

independent of the previous one and can repeat it

The problem is that SQL is a set-oriented language, and wants to do

an operation “all at once” on a well-defined set of rows Random sets are defined by a nondeterministic procedure by definition, instead of a deterministic logic expression

The SQL/PSM language does have an option to declare or create a procedure that is DETERMINISTIC or NOT DETERMINISTIC The DETERMINISTIC option means that the optimizer can compute this function once for a set of input parameter values and then use that result everywhere in the current SQL statement that a call to the procedure with those parameters appears The NOT DETERMINISTIC option means given the same parameters, you might not get the same results for each call to the procedure within the same SQL statement

Unfortunately, most SQL products do not have this feature in their proprietary procedural languages Thus, the random number function in Oracle is nondeterministic and the one in SQL Server is deterministic For example,

CREATE TABLE RandomNbrs

(seq_nbr INTEGER NOT NULL PRIMARY KEY,

randomizer FLOAT NOT NULL);

Trang 7

608 CHAPTER 27: SUBSETS

INSERT INTO RandomNbrs VALUES (1, RANDOM()), (2, RANDOM()), (3, RANDOM());

This query will result in the three rows all getting the same value in the randomizer column in a version of SQL Server, but three different numbers in a version of Oracle

While subqueries are not allowed in DEFAULT clauses, system-related functions such as CURRENT_TIMESTAMP and CURRENT_USER are allowed In some SQL implementations, this includes the RANDOM() function

CREATE TABLE RandomNbrs2 (seq_nbr INTEGER PRIMARY KEY, randomizer FLOAT warning !! not standard SQL DEFAULT (

(CASE (CAST(RANDOM() + 0.5 AS INTEGER) * -1) WHEN 0.0 THEN 1.0 ELSE -1.0 END)

* MOD (CAST(RANDOM() * 100000 AS INTEGER), 10000)

* RANDOM()) NOT NULL);

INSERT INTO RandomNbrs2 VALUES (1, DEFAULT);

(2, DEFAULT), (3, DEFAULT), (4, DEFAULT), (5, DEFAULT), (6, DEFAULT), (7, DEFAULT), (8, DEFAULT), (9, DEFAULT), (10, DEFAULT);

Here is a sample output from an SQL Server 7.0 implementation seq_nbr randomizer

============================

1 -121.89758452446999

2 -425.61113508053933

Trang 8

27.2 Picking Random Rows from a Table 609

3 3918.1554683876675

4 9335.2668286173412

5 54.463890640027664

6 -5.0169085346410522

7 -5430.63417246276

8 915.9835973796487

9 28.109161998753301

10 741.79452047043048

The best way to do this is to add a column to the table to hold a random number, then use an external language with a good pseudo-random number generator in its function library to load the new column with random values with a cursor in a host language You have to do it this way, because random number generators work differently from other function calls They start with an initial value called a “seed” (shown as Random[0] in the rest of this discussion) provided by the user

or the system clock The seed is used to create the first number in the sequence, Random[1] Then each call, Random[n], to the function uses the previous number to generate the next one, Random[n+1]

There is no way to do a sequence of actions in SQL without a cursor,

so you are in procedural code

The term “pseudo-random number generator” is often referred to as a just “random number generator,” but this is technically wrong All of the generators will eventually return a value that appeared in the sequence earlier and the procedure will hang in a cycle Procedures are

deterministic, and we are living in a mathematical heresy when we try to use them to produce truly random results However, if the sequence has

a very long cycle and meets some other tests for randomness over the range of the cycle, then we can use it

There are many kinds of generators The linear congruence pseudo-random number generator family has generator formulas of the form: Random[n+1] := MOD ((x * Random[n] + y), m);

There are restrictions on the relationships among x, y, and m that deal with their relative primality Knuth gives a proof that if

Random[0] is not a multiple of 2 or 5

m = 10^e where (e >= 5)

y = 0

MOD (x, 200) is in the set (3, 11, 13, 19, 21, 27, 29, 37, 53,

Trang 9

610 CHAPTER 27: SUBSETS

59, 61, 67, 77, 83, 91, 109, 117, 123, 131, 133, 139, 141, 147,

163, 171, 173, 179, 181, 187, 189, 197)

then the period will be 5 * 10^(e-2)

There are old favorites that many C programmers use from this family, such as:

Random(n+1) := (Random(n) * 1103515245) + 12345;

Random(n+1) := MOD ((16807 * Random(n)), ((2^31) - 1));

The first formula has the advantage of not requiring a MOD function,

so it can be written in standard SQL However, the simplest generator that can be recommended (Park and Miller) uses:

Random(n+1) := MOD ((48271 * Random(n)), ((2^31) - 1));

Notice that the modulus is a prime number; this is important The period of this generator is ((2^31) − 2), which is 2,147,483,646,

or more than two billion numbers before this generator repeats You must determine whether this is long enough for your application

If you have an XOR function in your SQL, then you can also use shift register algorithms The XOR is the bitwise exclusive OR that works on an integer as it is stored in the hardware; I would assume 32 bits on most small computers Some usable shift register algorithms are:

Random(n+1) := Random(n-103) XOR Random(n-250);

Random(n+1) := Random(n-1063) XOR Random(n-1279);

One method for writing a random number generator on the fly when the vendor’s library does not have one is to pick a seed using one or more key columns and a call to the system clock’s fractional seconds, such as RANDOM(keycol + EXTRACT (SECOND FROM CURRENT_TIME)) *

1000 This avoids problems with patterns in the keys, while the key column values ensure uniqueness of the seed values

Another method is to use a PRIMARY KEY or UNIQUE column(s) and apply a hashing algorithm You can pick one of the random number generator functions already discussed and use the unique value, as if it were the seed, as a quick way to get a hashing function Hashing algorithms try to be uniformly distributed, so if you can find a good one, you will approach nearly unique random selection The trick is that the

Trang 10

27.2 Picking Random Rows from a Table 611

hashing algorithm has to be simple enough to be written in the limited math available in SQL

Once you have a column of random numbers, you can convert the

random numbers into a randomly ordered sequence with this statement: UPDATE RandomNbrs

SET randomizer = (SELECT COUNT(*)

FROM Sequence AS S1

WHERE S1.randomizer <= Sequence.seq_nbr);

To get one random row from a table, you can use this approach:

CREATE VIEW LotteryDrawing (keycol, , spin)

AS SELECT LotteryTickets.*,

(RANDOM(<keycol> + <fractional seconds from clock>))

FROM LotteryTickets

GROUP BY spin

HAVING COUNT(*) = 1;

Then simply use this query:

SELECT *

FROM LotteryDrawing

WHERE spin = (SELECT MAX(spin)

FROM LotteryDrawing)

The pseudo-random number function is not standard SQL, but it is common enough Using the keycol as the seed probably means that you will get a different value for each row, but we can avoid duplicates with the GROUP BY HAVING Adding the fractional seconds will change the result every time, but it might be illegal in some SQL products,

which disallow variable elements in VIEW definitions

Let’s assume you have a function called RANDOM() that returns a

random number between 0.00 and 1.00 If you just want one random row out of the table, and you have a numeric key column, Tom Moreau proposed that you could find the MAX() and MIN(), then calculate a

random number between them

SELECT L1.*

FROM LotteryDrawing AS L1

WHERE col_1

Ngày đăng: 06/07/2014, 09:20

TỪ KHÓA LIÊN QUAN