1. Trang chủ
  2. » Công Nghệ Thông Tin

Joe Celko s SQL for Smarties - Advanced SQL Programming P49 pps

10 106 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 134,14 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

It looked like this: SELECT MAXsalary FROM Personnel UNION SELECT MAXsalary FROM Personnel WHERE salary < SELECT MAXsalary FROM Personnel UNION SELECT MAXsalary FROM Personnel WH

Trang 1

'Charles' 900.00 'Delta' 800.00 'Eddy' 700.00 'Fred' 700.00 'George' 700.00

Able, Baker, and Charles are the three highest paid personnel, but

$1,000.00, $900.00, and $800.00 are the three highest salaries The highest salaries belong to Able, Baker, Charles and Delta—a set with four elements

The way that most new SQL programmers do this in other SQL products is produce a result with an ORDER BY clause, then read the first so many rows from that cursor result In Standard SQL, cursors have an ORDER BY clause but no way to return a fixed number of rows However, most SQL products have propriety syntax to clip the result set at exactly some number of rows Oh, yes, did I mention that the whole table has to be sorted, and that this can take some time if the table is large?

The best algorithm for this problem is the Partition algorithm by C

A R Hoare This is the procedure in QuickSort that splits a set of values into three partitions—those greater than a pivot value, those less than the pivot and those values equal to the pivot The expected run time is

only (2*n) operations.

In practice, it is a good idea to start with a pivot at or near the kth

position you seek, because real data tends to have some ordering already

in it If the file is already in sorted order, this trick will return an answer

in one pass Here is the algorithm in Pascal

CONST list_length = { some large number };

TYPE LIST = ARRAY [1 list_length] OF REAL;

PROCEDURE FindTopK (Kth INTEGER, records : LIST);

VAR pivot, left, right, start, finish: INTEGER;

BEGIN start := 1;

finish := list_length;

WHILE start < finish

DO BEGIN

Trang 2

pivot := records[Kth];

left := start;

right := finish;

REPEAT

WHILE (records[left] > pivot) DO left := left + 1; WHILE (records[right] < pivot) DO right := right - 1;

IF (left >= right)

THEN BEGIN { swap right and left elements }

Swap (records[left], records[right]);

left := left + 1;

right := right - 1;

END;

UNTIL (left < right);

IF (right < Kth) THEN start := left;

IF (left > Kth) THEN finish := right;

END;

{ the first k numbers are in positions 1 through kth, in no particular order except that the kth highest number is in position kth }

END.

The original articles in Explain magazine gave several solutions

(Murchison n.d.; Wankowski n.d.)

One involved UNION operations on nested subqueries The first result table was the maximum for the whole table; the second result table was the maximum for the table entries less than the first maximum; and so forth The pattern is extensible It looked like this:

SELECT MAX(salary)

FROM Personnel

UNION

SELECT MAX(salary)

FROM Personnel

WHERE salary < (SELECT MAX(salary)

FROM Personnel)

UNION

SELECT MAX(salary)

FROM Personnel

WHERE salary < (SELECT MAX(salary)

FROM Personnel

WHERE salary

< (SELECT MAX(salary) FROM Personnel));

Trang 3

This answer can give you a pretty serious performance problem because of the subquery nesting and the UNION operations Every UNION will trigger a sort to remove duplicate rows from the results, since salary is not a UNIQUE column

A special case of the use of the scalar subquery with the MAX() function is finding the last two values in a set to look for a change This is most often done with date values for time series work For example, to get the last two reviews for an employee:

SELECT :search_name, MAX(P1.review_date), P2.review_date FROM Personnel AS P1, Personnel AS P2

WHERE P1.review_date < P2.review_date AND P1.emp_name = :search_name AND P2.review_date = (SELECT MAX(review_date) FROM Personnel)

GROUP BY P2.review_date;

The scalar subquery is not correlated, so it should run pretty fast and

be executed only once

An improvement on the UNION approach was to find the third highest salary with a subquery, then return all the records with salaries that were equal or higher; this would handle ties It looked like this: SELECT DISTINCT salary

FROM Personnel WHERE salary >=

(SELECT MAX(salary) FROM Personnel WHERE salary < (SELECT MAX(salary) FROM Personnel WHERE salary <

(SELECT MAX(salary) FROM Personnel)));

Another answer was to use correlation names and return a single-row result table This pattern is more easily extensible to larger groups; it also presents the results in sorted order without requiring the use of an ORDER BY clause The disadvantage of this answer is that it will return a single row, not a column result That might make it unusable for joining

to other queries It looked like this:

Trang 4

SELECT MAX(P1.salary_amt), MAX(P2.salary_amt),

MAX(P3.salary_amt)

FROM Personnel AS P1, Personnel AS P2, Personnel AS P3

WHERE P1.salary_amt > P2.salary_amt

AND P2.salary_amt > P3.salary_amt;

This approach will return the three highest salaries

The best variation on the single row approach is done with the scalar subquery expressions in SQL The query becomes:

SELECT (SELECT MAX (salary)

FROM Personnel) AS s1,

(SELECT MAX (salary)

FROM Personnel

WHERE salary NOT IN (s1)) AS s2,

(SELECT MAX (salary)

FROM Personnel

WHERE salary NOT IN (s1, s2)) AS s3,

.

(SELECT MAX (salary)

FROM Personnel

WHERE salary NOT IN (s1, s2, s[n-1])) AS sn, FROM Dummy;

In this case, the table Dummy is anything, even an empty table

There are single column answers based on the fact that SQL is a set-oriented language, so we ought to use a set-set-oriented specification We

want to get a subset of salary values that has a count of (n), has the

greatest value from the original set as an element, and includes all values greater than its least element

The idea is to take each salary and build a group of other salaries that are greater than or equal to it—this value is the boundary of the subset The groups with three or fewer rows are what we want to see The third element of an ordered list is also the maximum or minimum element of a set of three unique elements, depending on the ordering Think of concentric sets, nested inside each other This query gives a columnar answer, and the query can be extended to other numbers by changing the constant in the HAVING clause

SELECT MIN(P1.salary_amt) the element on the boundary FROM Personnel AS P1, P2 gives the elements of the subset

Trang 5

Personnel AS P2 P1 gives the boundary of the subset WHERE P1.salary_amt >= P2.salary_amt

GROUP BY P2.salary_amt HAVING COUNT(DISTINCT P1.salary_amt) <= 3;

This can also be written as:

SELECT P1.salary_amt FROM Personnel AS P1 WHERE (SELECT COUNT(*) FROM Personnel AS P2 WHERE P2.salary_amt >= P1.salary_amt) <= 3;

However, the correlated subquery might be more expensive than the GROUP BY clause

If you would like to know how many ties you have for each value, the query can be modified to this:

SELECT MIN(P1.salary_amt) AS top, COUNT (CASE WHEN P1.salary_amt = P2.salary_amt THEN 1 ELSE NULL END) / 2 AS ties FROM Personnel AS P1, Personnel AS P2

WHERE P1.salary_amt >= P2.salary_amt GROUP BY P2.salary_amt

HAVING COUNT(DISTINCT P1.salary_amt) <= 3;

If the salary is unique, the ties column will return a zero; otherwise, you will get the number of occurrences of that value on each row of the result table

Or if you would like to see the ranking next to the employees, here is another version using a GROUP BY:

SELECT P1.emp_name, SUM (CASE WHEN (P1.salary_amt || P1.emp_name) < (P2.salary_amt || P1.emp_name) THEN 1 ELSE 0 END) + 1 AS rank FROM Personnel AS P1, Personnel AS P2

WHERE P1.emp_name <> P2.emp_name GROUP BY P1.emp_name

HAVING (CASE WHEN (P1.salary_amt || P1.emp_name) < (P2.salary_amt || P1.emp_name) THEN 1 ELSE 0 END) <= (:n - 1);

Trang 6

The concatenation is to make ties in salary different by adding the key to a string conversion This query assumes automatic data type conversion, but you can use an explicit CAST() function This also assumes that the collation has a particular ordering of digits and

letters—the old “ASCII versus EBCDIC” problem You can use nested CASE expressions to get around

SELECT P1.emp_name,

SUM (CASE WHEN P1.salary_amt < P2.salary_amt THEN 1 WHEN P1.salary_amt > P2.salary_amt THEN 0 ELSE CASE WHEN P1.emp_name < P2.emp_name THEN 1 ELSE 0 END

END) + 1 AS rank

FROM

Here is another version that will produce the ties on separate lines with the names of the personnel who made the cut This answer is due to Pierre Boutquin

SELECT P1.emp_name, P1.salary_amt

FROM Personnel AS P1, Personnel AS P2

WHERE P1.salary_amt >= P2.salary_amt

GROUP BY P1.emp_name, P1.salary_amt

HAVING (SELECT COUNT(*) FROM Personnel) - COUNT(*) + 1 <= :n;

The idea is to use a little algebra If we want to find (n of k) things, then the rejected subset of the set is of size (k-n) Using the sample data,

we would get this result

Results

name salary

==================

'Able' 1000.00

'Baker' 900.00

'Charles' 900.00

If we add a new employee at $900, we would also get him, but we would not get a new employee at $800 or less In many ways, this is the most satisfying answer

Here are two more versions of the solution:

Trang 7

SELECT P1.emp_name, P1.salary_amt FROM Personnel AS P1, Personnel AS P2 GROUP BY P1.emp_name, P1.salary_amt HAVING COUNT(CASE WHEN P1.salary_amt < P2.salary_amt THEN 1

ELSE NULL END) + 1 <= :n;

SELECT P1.emp_name, P1.salary_amt FROM Personnel AS P1

LEFT OUTER JOIN Personnel AS P2

ON P1.salary_amt < P2.salary_amt GROUP BY P1.emp_name, P1.salary_amt HAVING COUNT(P2.salary_amt) + 1 <= :n;

The subquery is unnecessary and can be eliminated with either of the above solutions

As an aside, if you were awake during your college set theory course, you will remember that John von Neumann’s definition of ordinal numbers is based on nested sets You can get a lot of ideas for self-joins from set theory theorems John von Neumann was one of the greatest mathematicians of this century; he was the inventor of the modern stored program computer and Game Theory Know your nerd heritage!

It should be obvious that any number can replace three in the query

A subtle point is that the predicate “P1.salary_amt <=

P2.salary_amt” will include the boundary value, and therefore implies that if we have three or fewer personnel, then we still have a result If you want to call off the competition for lack of a quorum, then change the predicate to “P1.salary_amt < P2.salary_amt” instead

Another way to express the query would be:

SELECT Elements.name, Elements.salary_amt FROM Personnel AS Elements

WHERE (SELECT COUNT(*) FROM Personnel AS Boundary WHERE Elements.salary_amt < Boundary.salary_amt) < 3;

Likewise, the COUNT(*) and comparisons in the scalar subquery expression can be changed to give slightly different results

Trang 8

You might want to test each version to see which one runs faster on your particular SQL product If you want to swap the subquery and the constant for readability, you may do so in SQL, but not in SQL-89 What if I want to allow ties? Then just change COUNT() to a

COUNT(DISTINCT) function of the HAVING clause, thus:

SELECT Elements.name, Elements.salary_amt

FROM Personnel AS Elements, Personnel AS Boundary

WHERE Elements.salary_amt <= Boundary.salary_amt

GROUP BY Elements.name, Elements.salary_amt

HAVING COUNT(DISTINCT Boundary.salary_amt) <= 3;

This says that I want to count the values of salary, not the

salespersons, so that if two or more of the crew hit the same total, I will include them in the report as tied for a particular position This also means that the results can be more than three rows, because I can have ties As you can see, it is easy to get a subtle change in the results with just a few simple changes in predicates

Notice that you can change the comparisons from “<=” to “<” and the

“COUNT(*)” to “COUNT(DISTINCT P2.salary_amt)” to change the specification

Ken Henderson came up with another version that uses derived tables and scalar subquery expressions in SQL:

SELECT P2.salary_amt

FROM (SELECT (SELECT COUNT(DISTINCT P1.salary_amt)

FROM Personnel AS P1

WHERE P3.salary_amt >= P1.salary_amt) AS ranking,

P3.salary_amt

FROM Personnel AS P3) AS P2

WHERE P2.ranking <= 3;

You can get other aggregate functions by using this query with the IN predicate Assume that I have a SalaryHistory table from which I wish to determine the average pay for the three most recent pay changes of each employee I am going to further assume that if you had three or fewer old salaries, you would still want to average the first, second, or third values you have on record

Trang 9

SELECT S0.emp, AVG(S0.last_salary) FROM SalaryHistory AS S0

WHERE S0.change_date

IN (SELECT P1.change_date FROM SalaryHistory AS P1, SalaryHistory AS P2 WHERE P1.change_date <= P2.change_date

GROUP BY P1.change_date HAVING COUNT(*) <= 3) GROUP BY S0.emp_nbr;

21.4.3 Multiple Criteria Extrema Functions

Since the generalized extrema functions are based on sorting the data, it stands to reason that you could further generalize them to use multiple columns in a table This can be done by changing the WHERE search

condition For example, to locate the top (n) tall and heavy employees

for the basketball team, we could write:

SELECT P1.emp_id FROM Personnel AS P1, Personnel AS P2 WHERE P2.height >= P1.height major sort term

OR (P2.height = P1.height next sort term AND P2.weight >= P1.weight)

GROUP BY P1.emp_id HAVING COUNT(*) <= :n;

Procedural programmers will recognize this predicate, because it is what they used to write to do a sort on more than one field in a file system Now it is very important to look at the predicates at each level of nesting to be sure that you have the right theta operator The ordering of the predicates is also critical—there is a difference between ordering by height within weight or by weight within height

One improvement would be to use row comparisons:

SELECT P1.emp_id FROM Personnel AS P1, Personnel AS P2 WHERE (P2.height, P2.weight) <= (P1.height, P1.weight) GROUP BY P1.emp_id

HAVING COUNT(*) <= 4;

The down side of this approach is that you cannot easily mix ascending and descending comparisons in the same comparison

Trang 10

predicate The trick is to make numeric columns negative to reverse the sense of the theta operator

Before you attempt it, here is the scalar subquery version of the multiple extrema problems:

SELECT

(SELECT MAX(P0.height)

FROM Personnel AS P0

WHERE P0.weight = (SELECT MAX(weight)

FROM Personnel AS P1)) AS s1,

(SELECT MAX(P0.height)

FROM Personnel AS P0

WHERE height NOT IN (s1)

AND P0.weight = (SELECT MAX(weight)

FROM Personnel AS P1

WHERE height NOT IN (s1))) AS s2, (SELECT MAX(P0.height)

FROM Personnel AS P0

WHERE height NOT IN (s1, s2)

AND P0.weight = (SELECT MAX(weight)

FROM Personnel AS P1

WHERE height NOT IN (s1, s2))) AS s3 FROM Dummy;

Again, multiple criteria and their ordering would be expressed as multiple levels of subquery nesting This picks the tallest people and decides ties with the greatest weight within that subset of personnel While this looks awful and is hard to read, it does run fairly fast, because the predicates are repeated and can be factored out by the optimizer Another form of multiple criteria is finding the generalized extrema functions within groupings; for example, finding the top three salaries in each department Adding the grouping constraints to the subquery expressions gives us an answer

SELECT dept_nbr, salary_amt

FROM Personnel AS P1

WHERE (SELECT COUNT(*)

FROM Personnel AS P2

WHERE P2.dept_nbr = P1.dept_nbr

AND P2.salary_amt < P1.salary_amt) < :n;

Ngày đăng: 06/07/2014, 09:20

TỪ KHÓA LIÊN QUAN