Joe Celko s SQL for Smarties - Advanced SQL Programming P57 doc

CREATE TABLE Sales salesman CHAR10, client_name CHAR10, sales_amount DECIMAL 9,2 NOT NULL, PRIMARY KEY salesman, client_name; The problem is to show each salesman, his client, the amount

Trang 1

CREATE TABLE Sales (salesman CHAR(10), client_name CHAR(10), sales_amount DECIMAL (9,2) NOT NULL, PRIMARY KEY (salesman, client_name));

The problem is to show each salesman, his client, the amount of that sale, what percentage of his total sales volume that one sale represents, and the cumulative percentage of his total sales we have reached at that point We will sort the clients from the largest amount to the smallest This problem is based on a salesman’s report originally written for a small commercial printing company The idea was to show the salesmen where their business was coming from and to persuade them to give up their smaller accounts (defined as the lower 20%) to new salesmen The report lets the salesman run his finger down the page and see which customers represented the top 80% of his income.

We can use derived tables to build layers of aggregation in the same query.

SELECT S0.salesman, S0.client_name, S0.sales_amt, ((S0.sales_amt * 100)/ ST.salesman_total)

AS percent_of_total, (SUM(S1.sales_amt)/((S0.sales_amt * 100)/

ST.salesman_total))

AS cum_percent FROM Sales AS S0 INNER JOIN Sales AS S1

ON (S0.salesman, S0.client_name) <= (S1.salesman, S1.client_name)

INNER JOIN (SELECT S2.salesman, SUM(S1.sales_amt) FROM Sales AS S2

GROUP BY S2.salesman) AS ST(salesman, salesman_total)

ON S0.salesman = ST.salesman GROUP BY S0.salesman, S0.client_name, S0.sales_amt;

However, if your SQL allows subqueries in the SELECT clause but not

in the FROM clause, you can fake it with this query:

Trang 2

SELECT S0.salesman, S0.client_name, S0.sales_amt

(S0.sales_amt * 100.00/ (SELECT SUM(S1.sales_amt)

FROM Sales AS S1

WHERE S0.salesman = S1.salesman))

AS percentage_of_total,

(SELECT SUM(S3.sales_amt)

FROM Sales AS S3

WHERE S0.salesman = S3.salesman

AND (S3.sales_amt > S0.sales_amt

OR (S3.sales_amt = S0.sales_amt

AND S3.client_name >= S0.client_name))) * 100.00

/ (SELECT SUM(S2.sales_amt)

FROM Sales AS S2

WHERE S0.salesman = S2.salesman) AS cum_percent

FROM Sales AS S0;

This query will probably run like glue.

23.6.4 Rankings and Related Statistics

Martin Tillinger posted this problem on the MSACCESS forum of CompuServe in early 1995 How do you rank your salesmen in each territory, given a SalesReport table that looks like this?

CREATE TABLE SalesReport

(salesman CHAR(20) NOT NULL PRIMARY KEY

REFERENCES Salesforce(salesman),

territory INTEGER NOT NULL,

sales_tot DECIMAL (8,2) NOT NULL);

This statistic is called a ranking A ranking is shown as integers that represent the ordinal values (first, second, third, and so on) of the elements of a set based on one of the values In this case, sales personnel are ranked by their total sales within a territory The one with the highest total sales is in first place, the next highest is in second place, and so forth.

The hard question is how to handle ties The rule is that if two salespersons have the same value, they have the same ranking, and there are no gaps in the rankings This is the nature of ordinal numbers—there cannot be a third place without a first and a second place A query that will do this for us is:

Trang 3

SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(DISTINCT sales_tot)

FROM SalesReport AS S2 WHERE S2.sales_tot >= S1.sales_tot AND S2.territory = S1.territory) AS rank FROM SalesReport AS S1;

You might also remember that this is really a version of the generalized extrema functions we already discussed Another way to write this query is thus:

SELECT S1.salesman, S1.territory, MAX(S1.sales_tot), SUM (CASE

WHEN (S1.sales_tot || S1.name) <= (S2.sales_tot || S2.name) THEN 1 ELSE 0 END) AS rank FROM SalesReport AS S2, SalesReport AS S2 WHERE S1.salesman <> S2.salesman

AND S1.territory = S2.territory GROUP BY S1.salesman, S1.territory;

This query uses the MAX() function on the nongrouping columns in the SalesReport to display them so that the aggregation will work.

It is worth looking at the four possible variations on this basic query

to see what each change does to the result set.

Version 1: COUNT(DISTINCT) and >= yields a ranking.

SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(DISTINCT sales_tot)

FROM SalesReport AS S2 WHERE S2.sales_tot >= S1.sales_tot AND S2.territory = S1.territory) AS rank FROM SalesReport AS S1;

salesman territory sales_tot rank =============================================

'Wilson' 1 990.00 1

'Smith' 1 950.00 2

'Richards' 1 800.00 3

'Quinn' 1 700.00 4

'Parker' 1 345.00 5

'Jones' 1 345.00 5

Trang 4

'Hubbard' 1 345.00 5

'Date' 1 200.00 6

'Codd' 1 200.00 6

'Blake' 1 100.00 7

Version 2: COUNT(DISTINCT) and > yields a ranking, but it starts at zero. SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(DISTINCT sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot > S1.sales_tot AND S2.territory = S1.territory) AS rank FROM SalesReport AS S1; salesman territory sales_tot rank ============================================= 'Wilson' 1 990.00 0

'Smith' 1 950.00 1

'Richard' 1 800.00 2

'Quinn' 1 700.00 3

'Parker' 1 345.00 4

'Jones' 1 345.00 4

'Hubbard' 1 345.00 4

'Date' 1 200.00 5

'Codd' 1 200.00 5

'Blake' 1 100.00 6

Version 3: COUNT(ALL) and >= yields a standing which starts at one. SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot >= S1.sales_tot AND S2.territory = S1.territory) AS standing FROM SalesReport AS S1; salesman territory sales_tot standing ============================================= 'Wilson' 1 990.00 1

'Smith' 1 950.00 2

Trang 5

'Richard' 1 800.00 3

'Quinn' 1 700.00 4

'Parker' 1 345.00 7

'Jones' 1 345.00 7

'Hubbard' 1 345.00 7

'Date' 1 200.00 9

'Codd' 1 200.00 9

'Blake' 1 100.00 10

Version 4: COUNT(ALL) and > yields a standing that starts at zero. SELECT S1.salesman, S1.territory, S1.sales_tot, (SELECT COUNT(sales_tot) FROM SalesReport AS S2 WHERE S2.sales_tot > S1.sales_tot AND S2.territory = S1.territory) AS standing FROM SalesReport AS S1; salesman territory sales_tot standing ============================================== 'Wilson' 1 990.00 0

'Smith' 1 950.00 1

'Richard' 1 800.00 2

'Quinn' 1 700.00 3

'Parker' 1 345.00 4

'Jones' 1 345.00 4

'Hubbard' 1 345.00 4

'Date' 1 200.00 7

'Codd' 1 200.00 7

'Blake' 1 100.00 9

Another system, used in some British schools and in horse racing, will also leave gaps in the numbers, but in a different direction For example given this set of Marks: Marks class_standing ====================== 100 1

90 2

70 4

Trang 6

Both students with 90 were second because only one person had a higher mark The student with 70 was fourth because there were three people ahead of him With our data, that would be:

SELECT S1.salesman, S1.territory, S1.sales_tot,

(SELECT COUNT(S2 sales_tot)

FROM SalesReport AS S2

WHERE S2.sales_tot > S1.sales_tot

AND S2.territory = S1.territory) + 1 AS british FROM SalesReport AS S1;

salesman territory sales_tot british

=============================================

'Wilson' 1 990.00 1

'Smith' 1 950.00 2

'Richard' 1 800.00 3

'Quinn' 1 700.00 4

'Parker' 1 345.00 5

'Jones' 1 345.00 5

'Hubbard' 1 345.00 5

'Date' 1 200.00 8

'Codd' 1 200.00 8

'Blake' 1 100.00 10

As an aside for the mathematicians among the readers, I always use the heuristics that it helps solve an SQL problem to think in terms of sets What we are looking for in these ranking queries is how to assign an ordinal number to a subset of the SalesReport table This subset is the rows that have an equal or higher sales volume than the salesman at whom we are looking Or in other words, one copy of the SalesReport table provides the elements of the subsets, and the other copy provides the boundary of the subsets This count is really a sequence of nested subsets.

If you happen to have had a good set theory course, you would

remember John von Neumann’s definition of the nth ordinal number; it

is the set of all ordinal numbers less than the nth number

23.6.5 Quintiles and Related Statistics

Once you have the ranking, it is fairly easy to classify the data set into percentiles, quintiles, or dectiles These are coarser versions of a ranking that use subsets of roughly equal size A quintile is 1/5 of the population,

Trang 7

a dectile is 1/10 of the population, and a percentile is 1/100 of the population I will present quintiles here, since whatever we do for them can be generalized to other partitionings This statistic is popular with schools, so I will use the SAT scores for an imaginary group of students for my example

SELECT T1.student_id, T1.score, T1.rank, CASE WHEN T1.rank <= 0.2 * T2.population_size THEN 1 WHEN T1.rank <= 0.4 * T2.population_size THEN 2 WHEN T1.rank <= 0.6 * T2.population_size THEN 3 WHEN T1.rank <= 0.8 * T2.population_size THEN 4 ELSE 5 END AS quintile

FROM (SELECT S1.student_id, S1.score, (SELECT COUNT(*)

FROM SAT_Scores AS S2 WHERE S2.score >= S1.score) FROM SAT_Scores AS S1) AS T1(student_id, score, rank) CROSS JOIN

(SELECT COUNT(*) FROM SAT_Scores)

AS T2(population_size);

The idea is straightforward: compute the rank for each element and then put it into a bucket whose size is determined by the population size There are the same problems with ties that we had with rankings, as well as problems about what to do when the population is skewed

A cross tabulation, or crosstab for short, is a common statistical report It can be done in IBM’s QMF tool, using the ACROSS summary option, and

in many other SQL-based reporting packages SPSS, SAS, and other statistical packages have library procedures or language constructs for crosstabs Many spreadsheets can load the results of SQL queries and perform a crosstab within the spreadsheet

If you can use a reporting package on the server in a client/server system instead of the following method, do so It will run faster and in less space than the method discussed here

However, if you have to use the reporting package on the client side, the extra time required to transfer data will make these methods on the server side much faster.

Trang 8

A one-way crosstab “flattens out” a table to display it in a report format Assume that we have a table of sales by product and the dates the sales were made We want to print out a report of the sales of products

by years for a full decade The solution is to create a table and populate it

to look like an identity matrix (all elements on the diagonal are one, all others zero) with a rightmost column of all ones to give a row total, then JOIN the Sales table to it

CREATE TABLE Sales

(product_name CHAR(15) NOT NULL,

product_price DECIMAL(5,2) NOT NULL,

qty INTEGER NOT NULL,

sales_year INTEGER NOT NULL);

CREATE TABLE Crosstabs

(year INTEGER NOT NULL,

year1 INTEGER NOT NULL,

row_total INTEGER NOT NULL);

The table would be populated as follows:

Sales_year year1 year2 year3 year4 year5 row_total

========================================================

1990 1 0 0 0 0 1

1991 0 1 0 0 0 1

1992 0 0 1 0 0 1

1993 0 0 0 1 0 1

1994 0 0 0 0 1 1

The query to produce the report table is

SELECT S1.product_name,

SUM(S1.qty * S1.product_price * C1.year1),

Trang 9

SUM(S1.qty * S1.product_price * C1.row_total) FROM Sales AS S1, Crosstabs AS C1

WHERE S1.year = C1.year GROUP BY S1.product_name;

Obviously, (S1.product_price * S1.qty) is the total dollar

amount of each product in each year The year n column will be either a

one or a zero If it is a zero, the total dollar amount in the SUM() is zero;

if it is a one, the total dollar amount in the SUM() is unchanged.

This solution lets you adjust the time frame being shown in the report

by replacing the values in the year column to whatever consecutive years you wish A two-way crosstab takes two variables and produces a spreadsheet with all values of one variable on the rows and all values of the other represented by the columns Each cell in the table holds the COUNT of entities that have those values for the two variables NULL s will not fit into a crosstab very well, unless you decide to make them a group

of their own or to remove them.

Another trick is to use the POSITION() function to convert a string into a one or a zero For example, assume we have a “day of the week” function that returns a three-letter abbreviation and we want to report the sales of items by day of the week in a horizontal list.

CREATE TABLE Weekdays (day_name CHAR(3) NOT NULL PRIMARY KEY, mon INTEGER NOT NULL,

tue INTEGER NOT NULL, wed INTEGER NOT NULL, thu INTEGER NOT NULL, fri INTEGER NOT NULL, sat INTEGER NOT NULL, sun INTEGER NOT NULL);

INSERT INTO WeekDays VALUES ('MON', 1, 0, 0, 0, 0, 0, 0), ('TUE', 0, 1, 0, 0, 0, 0, 0), ('WED', 0, 0, 1, 0, 0, 0, 0), ('THU', 0, 0, 0, 1, 0, 0, 0), ('FRI', 0, 0, 0, 0, 1, 0, 0), ('SAT', 0, 0, 0, 0, 0, 1, 0), ('SUN', 0, 0, 0, 0, 0, 0, 1);

Trang 10

SELECT item,

SUM(amt * qty *

* mon * POSITION('MON' IN DOW(sales_date))) AS mon_tot, SUM(amt * qty

* tue * POSITION('TUE' IN DOW(sales_date))) AS tue_tot, SUM(amt * qty

* wed * POSITION('WED' IN DOW(sales_date))) AS wed_tot, SUM(amt * qty

* thu * POSITION('THU' IN DOW(sales_date))) AS thu_tot, SUM(amt * qty

* fri * POSITION('FRI' IN DOW(sales_date))) AS fri_tot, SUM(amt * qty

* sat * POSITION('SAT' IN DOW(sales_date))) AS sat_tot, SUM(amt * qty

* sun * POSITION('SUN' IN DOW(sales_date))) AS sun_tot FROM Weekdays, Sales;

There are also totals for each column and each row, as well as a grand

total Crosstabs of (n) variables are defined by building an n-dimensional spreadsheet But you cannot easily print (n) dimensions on

dimensional paper The usual trick is to display the results as a two-dimensional grid with one or both axes as a tree structure The way the values are nested on the axis is usually under program control; thus,

“race within sex” shows sex broken down by race, whereas “sex within

race” shows race broken down by sex.

Assume that we have a table, Personnel (emp_nbr, sex, race, job_nbr, salary_amt), keyed on employee number, with no NULL s in any

columns We wish to write a crosstab of employees by sex and race, which would look like this:

asian black caucasian latino Other TOTALS

=========================================================== Male 3 2 12 5 5 27

Female 1 10 20 2 9 42

TOTAL 4 12 32 7 14 69

The first thought is to use a GROUP BY and write a simple query, thus:

SELECT sex, race, COUNT(*)

FROM Personnel

GROUP BY sex, race;

Định dạng
Số trang	10
Dung lượng	128 KB