Listing 15.31 calculates the trimmed mean of book sales in the sample database by omitting the top three and bottom three sales figures.. SELECT AVGsales AS TrimmedMean FROM titles t1 WH
Trang 1Assigning Ranks
Ranking, which allocates the numbers 1, 2,
3, … to sorted values, is related to top-n
queries and shares the problem of
interpret-ing ties The followinterpret-ing queries calculate
ranks for sales values in the table empsales
from the preceding section
Listings 15.30a to 15.30e rank employees
by sales The first two queries shows the
most commonly accepted ways to rank
val-ues The other queries show variations on
them Figure 15.30 shows the result of each
ranking method, a to e, combined for brevity
and ease of comparison These queries rank
highest to lowest; to reverse the order, change
>(or >=) to <(or <=) in the WHEREcomparisons
Listing 15.30a Rank employees by sales (method a).
See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales, (SELECT COUNT(sales) FROM empsales e2
AS ranking FROM empsales e1;
Listing
Listing 15.30b Rank employees by sales (method b).
See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales, (SELECT COUNT(sales) FROM empsales e2
AS ranking FROM empsales e1;
Listing
Listing 15.30c Rank employees by sales (method c).
See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales, (SELECT COUNT(sales) FROM empsales e2
AS ranking FROM empsales e1;
Listing
Listing 15.30d Rank employees by sales (method d).
See Figure 15.30 for the result.
Listing
Trang 2These queries use correlated subqueries and so run slowly
If you’re ranking a large number of items, you should use a built-in rank function or OLAP component, if available The SQL:2003 standard introduced the functions RANK() andDENSE_RANK(), which Microsoft SQL Server 2005 and later, Oracle, and DB2 support For Microsoft SQL Server 2000,
look at the Analysis Services (OLAP) func-tionRANK() Alternatively, you can use your DBMS’s SQL extensions to calculate ranks
efficiently The following MySQL script, for
example, is equivalent to Listing 15.30b:
SET @rownum = 0;
SET @rank = 0;
SET @prev_val = NULL;
SELECT
@rownum := @rownum + 1 AS row,
@rank := IF(@prev_val <> sales,
@rownum, @rank) AS rank,
@prev_val := sales AS sales FROM empsales
ORDER BY sales DESC;
✔ Tips
■ You can add the clause ORDER BY ranking ASCto a query’s outer SELECTto sort the results by rank
sup-port COUNT(DISTINCT)and won’t run Listings 15.30d and 15.30e For a workaround, see “Aggregating Distinct Values with DISTINCT” in Chapter 6
Listing 15.30e Rank employees by sales (method e).
See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales,
FROM empsales e2
AS ranking
FROM empsales e1;
Listing
emp_id sales a b c d e
-
E09 900 1 1 0 1 0
E02 800 2 2 1 2 1
E10 700 4 3 2 3 2
E05 700 4 3 2 3 2
E01 600 5 5 4 4 3
E04 500 8 6 5 5 4
E03 500 8 6 5 5 4
E06 500 8 6 5 5 4
E08 400 9 9 8 6 5
E07 300 10 10 9 7 6
Figure 15.30 Compilation of results of Listings 15.30a
to 15.30e.
Trang 3Calculating a
Trimmed Mean
The trimmed mean is a robust order
statis-tic that is the mean (average) of the data if
the k smallest values and k largest values are
discarded The idea is to avoid influence of
extreme observations
Listing 15.31 calculates the trimmed mean
of book sales in the sample database by
omitting the top three and bottom three
sales figures See Figure 15.31 for the result.
For reference, the 12 sorted sales values are
566, 4095, 5000, 9566, 10467, 11320, 13001,
25667, 94123, 100001, 201440, and 1500200
This query discards 566, 4095, 5000, 100001,
201440, and 1500200 and calculates the
mean in the usual way by using the
remain-ing six middle values Nulls are ignored
Duplicate values are either all removed or all
retained (If all sales are the same, none of
them will be trimmed no matter what k is,
for example.)
Listing 15.32 is similar to Listing 15.40 but
trims a fixed percentage of the extreme
val-ues rather than a fixed number Trimming by
0.25 (25%), for example, discards the sales in
the top and bottom quartiles and averages
what’s left See Figure 15.32 for the result.
✔ Tip
DB2 return an integer for the
trimmed mean because the column
salesis defined as an INTEGER To get a
floating-point value, change AVG(sales)
Listing 15.31 Calculate the trimmed mean for k = 3.
See Figure 15.31 for the result.
SELECT AVG(sales) AS TrimmedMean FROM titles t1
WHERE (SELECT COUNT(*) FROM titles t2 WHERE t2.sales <= t1.sales) > 3 AND
(SELECT COUNT(*) FROM titles t3 WHERE t3.sales >= t1.sales) > 3;
Listing
TrimmedMean -27357.3333
Figure 15.31 Result of Listing 15.31.
Listing 15.32 Calculate the trimmed mean by
discarding the lower and upper 25% of values See Figure 15.32 for the result.
SELECT AVG(sales) AS TrimmedMean FROM titles t1
WHERE (SELECT COUNT(*) FROM titles t2 WHERE t2.sales <= t1.sales) >= (SELECT 0.25 * COUNT(*) FROM titles) AND
(SELECT COUNT(*) FROM titles t3 WHERE t3.sales >= t1.sales) >=
Listing
Trang 4Picking Random Rows
Some databases are so large, and queries on them so complex, that often it’s impractical (and unnecessary) to retrieve all the data relevant to a query If you’re interested in finding an overall trend or pattern, for exam-ple, an approximate answer within some margin of error usually will do One way to speed such queries is to select a random sample of rows An efficient sample can improve performance by orders of magni-tude yet still yield accurate results
Standard SQL’s TABLESAMPLEclause returns
a random subset of rows DB2 and SQL Server 2005 and later support TABLESAMPLE,
and Oracle has something similar For the
other DBMSs, use a (nonstandard) function that returns a uniform random number
between 0 and 1 (Table 15.1).
Listing 15.33a randomly picks about 25%
(0.25) of the rows from the sample-database table titles If necessary, change RAND()
to the function that appears in Table 15.1 for
your DBMS For Oracle, use Listing 15.33b For SQL Server 2005 and later and DB2, use Listing 15.33c.
Table 15.1
Randomization Features
D B M S C l a u s e o r F u n c t i o n
Access RND() function
SQL Server 2000 RAND() function
SQL Server 2005/2008 TABLESAMPLE clause
Oracle SAMPLE clause or DBMS_RANDOM
package
MySQL RAND() function
PostgreSQL RANDOM() function
Listing 15.33a Select about 25% percent of the rows
in the table titles at random See Figure 15.33 for a
possible result.
SELECT title_id, type, sales
FROM titles
WHERE RAND() < 0.25;
Listing
Listing 15.33b Select about 25% percent of the rows
in the table titles at random (Oracle only) See
Figure 15.33 for a possible result.
SELECT title_id, type, sales
FROM titles
SAMPLE (25);
Listing
Listing 15.33c Select about 25% percent of the
rows in the table titles at random (SQL Server 2005
and later and DB2 only) See Figure 15.33 for a
possible result.
SELECT title_id, type, sales
FROM titles
TABLESAMPLE SYSTEM (25);
Listing
Trang 5Figure 15.33 shows one possible result of a
random selection The rows and the number
of rows returned will change each time you
run the query If you need an exact number
of random rows, increase the sampling
per-centage and use one of the techniques
described in “Limiting the Number of Rows
Returned” earlier in this chapter
✔ Tips
■ Randomizers take an optional seed
argu-ment or setting that sets the starting
value for a random-number sequence
Identical seeds yield identical sequences
(handy for testing) By default, the DBMS
sets the seed based on the system time to
generate different sequences every time
■ Listing 15.33a won’t run correctly
on Microsoft Access or Microsoft SQL Server 2000 because
the random-number function returns the
same “random” number for each selected
row In Access, use Visual Basic or C# to
pick random rows For SQL Server 2000,
search for the article “Returning Rows in
Random Order” at www.sqlteam.com
To use the NEWID()function to pick n
random rows in Microsoft SQL Server:
SELECT TOP n title_id, type, sales
FROM titles
ORDER BY NEWID() ;
To use the VALUE()function in the
DBMS_RANDOMpackage to pick n random
rows in Oracle:
SELECT * FROM
title_id type sales - -T03 computer 25667 T04 psychology 13001 T11 psychology 94123
Figure 15.33 One possible result of Listing 15.33a/b/c.
Selecting Every nth Row
Instead of picking random rows, you can
pick every nth row by using a modulo
expression:
◆ m MOD n(Microsoft Access)
◆ m % n(Microsoft SQL Server)
◆ MOD(m,n)(other DBMSs)
This expression returns the remainder of
m divided by n For example, MOD(20,6)is
2 because 20 equals (3 ✕6) + 2 MOD(a,2)
is 0 if a is an even number.
The condition MOD(rownumber,n) = 0 picks every nth row, where rownumber is
a column of consecutive integers or row
identifiers This Oracle query picks every
third row in a table, for example:
SELECT *
FROM table
Trang 6Handling Duplicates
Normally you use SQL’s PRIMARY KEYorUNIQUE constraints (see Chapter 11) to prevent dupli-cate rows from appearing in production tables But you need to know how to handle duplicates that appear when you accidentally import the same data twice or import data from a nonrelational environment, such as a spreadsheet or accounting package, where redundant information is rampant This sec-tion describes how to detect, count, and remove duplicates
Suppose that you import rows into a staging table to detect and eliminate any duplicates before inserting the data into a production
table (Listing 15.34 and Figure 15.34) The
columnidis a unique row identifier that lets you identify and select rows that other-wise would be duplicates If your imported rows don’t already have an identity column, you can add one yourself; see “Unique Identifiers” in Chapter 3 and “Generating Sequences” earlier in this chapter It’s a good practice to add an identity column to even short-lived working tables, but in this case it also makes deleting duplicates easy The imported data might include other columns too, but you’ve decided that the combination
of only book title, book type, and price deter-mines whether a row is a duplicate, regardless
of the values in any other columns Before you identify or delete duplicates, you must define exactly what it means for two rows to
be considered “duplicates” of each other
Listing 15.35 lists only the duplicates by
counting the number of occurrences of each unique combination of title_name,type, andprice See Figure 15.35 for the result
If this query returns an empty result, the table contains no duplicates To list only the nonduplicates, change COUNT(*) > 1
toCOUNT(*) = 1
Listing 15.34 List the imported rows See Figure 15.34
for the result.
SELECT id, title_name, type, price
FROM dups;
Listing
id title_name type price
-1 Book Title 5 children -15.00
2 Book Title 3 biography 7.00
3 Book Title 1 history 10.00
4 Book Title 2 children 20.00
5 Book Title 4 history 15.00
6 Book Title 1 history 10.00
7 Book Title 3 biography 7.00
8 Book Title 1 history 10.00
Figure 15.34 Result of Listing 15.34.
Listing 15.35 List only duplicates See Figure 15.35 for
the result.
SELECT title_name, type, price
FROM dups
GROUP BY title_name, type, price
HAVING COUNT(*) > 1;
Listing
title_name type price
-Book Title 1 history 10.00
Book Title 3 biography 7.00
Figure 15.35 Result of Listing 15.35.
Trang 7Listing 15.36 uses a similar technique
to list each row and its duplicate count
See Figure 15.36 for the result To list
only the duplicates, change COUNT(*) >= 1
toCOUNT(*) > 1
Listing 15.37 deletes duplicate rows from
dupsin place This statement uses the
col-umnidto leave exactly one occurrence (the
one with the highest ID) of each duplicate
Figure 15.37 shows the table dupsafter
running this statement See also “Deleting
Rows with DELETE” in Chapter 10
Listing 15.36 List each row and its number of
repetitions See Figure 15.36 for the result.
SELECT title_name, type, price, COUNT(*) AS NumDups FROM dups
GROUP BY title_name, type, price HAVING COUNT(*) >= 1
ORDER BY COUNT(*) DESC;
Listing
title_name type price NumDups - - -Book Title 1 history 10.00 3 Book Title 3 biography 7.00 2 Book Title 4 history 15.00 1 Book Title 2 children 20.00 1 Book Title 5 children 15.00 1
Figure 15.36 Result of Listing 15.36.
Listing 15.37 Remove the redundant duplicates in
place See Figure 15.37 for the result.
DELETE FROM dups WHERE id < ( SELECT MAX(d.id) FROM dups d WHERE dups.title_name = d.title_name AND dups.type = d.type
AND dups.price = d.price);
Listing
id title_name type price
Trang 8✔ Tips
■ If you define a duplicate to span every column in a row (not just a subset of columns), you can drop the column id and use SELECT DISTINCT * FROM table
to delete duplicates See “Eliminating Duplicate Rows with DISTINCT” in Chapter 4
■ If your DBMS offers a built-in
unique row identifier, you can drop the column idand still delete
duplicates in place In Oracle, for
exam-ple, you can replace idwith the ROWID pseudocolumn in Listing 15.37; change the outer WHEREclause to:
WHERE ROWID < (SELECT MAX(d.ROWID)
To run Listing 15.45 in MySQL, change
ORDER BY COUNT(*) DESCtoORDER BY NumDups DESC You can’t use Listing 15.37
to do an in-place deletion because MySQL won’t let you use same table for both the subquery’s FROMclause and the DELETEtarget
Messy Data
Deleting duplicates gets harder as data
get messier It’s not unusual to buy a
mail-ing list with entries that look like this:
name address1
—————————— ——————————————————
John Smith 123 Main St
John Smith 123 Main St, Apt 1
Jack Smiht 121 Main Rd
John Symthe 123 Main St.
Jon Smith 123 Mian Street
DBMSs offer nonstandard tools such
as Soundex (phonetic) functions to
sup-press spelling variations, but creating an
automated deletion program that works
over thousands or millions of rows is a
major project
Trang 9Creating a Telephone List
You can use the function COALESCE()with
a left outer join to create a convenient
telephone listing from a normalized table
of phone numbers Suppose that the
sample database has an extra table named
telephonesthat stores the authors’ work
and home telephone numbers:
au_id tel_type tel_no
- -
-A01 H 111-111-1111
A01 W 222-222-2222
A02 W 333-333-3333
A04 H 444-444-4444
A04 W 555-555-5555
A05 H 666-666-6666
The table’s composite primary key is (au_id,
tel_type), where tel_typeindicates whether
tel_nois a work (W) or home (H) number
Listing 15.38 lists the authors’ names and
numbers If an author has only one number,
that number is listed If an author has both
home and work numbers, only the work
number is listed Authors with no numbers
aren’t listed See Figure 15.38 for the result.
The first left join picks out the work
num-bers, and the second picks out the home
numbers The WHEREclause filters out authors
with no numbers (You can extend this query
to add cell-phone and other numbers.)
■ For more information about COALESCE(),
see “Checking for Nulls with COALESCE()”
Listing 15.38 Lists the authors’ names and telephone
numbers, favoring work numbers over home numbers See Figure 15.38 for the result.
SELECT a.au_id AS "ID", a.au_fname AS "FirstName", a.au_lname AS "LastName", COALESCE(twork.tel_no, thome.tel_no)
AS "TelNo", COALESCE(twork.tel_type, thome.tel_type)
AS "TelType"
FROM authors a LEFT OUTER JOIN telephones twork
ON a.au_id = twork.au_id AND twork.tel_type = 'W' LEFT OUTER JOIN telephones thome
ON a.au_id = thome.au_id AND thome.tel_type = 'H' WHERE COALESCE(twork.tel_no, thome.tel_no)
IS NOT NULL ORDER BY a.au_fname ASC, a.au_lname ASC;
Listing
ID FirstName LastName TelNo TelType - - -A05 Christian Kells 666-666-6666 H A04 Klee Hull 555-555-5555 W A01 Sarah Buchman 222-222-2222 W A02 Wendy Heydemark 333-333-3333 W
Figure 15.38 Result of Listing 15.38.
Trang 10Retrieving Metadata
Metadata are data about data In DBMSs,
metadata include information about
schemas, databases, users, tables, columns,
and so on You already saw metadata in
“Getting User Information” in Chapter 5 and
“Displaying Table Definitions” in Chapter 10 The first thing to do when meeting a new
database is to inspect its metadata: What’s
in the database? How big is it? How are the
tables organized?
Metadata, like other data, are stored in
tables and so can be accessed via SELECT
queries Metadata also can be accessed,
often more conveniently, by using
com-mand-line and graphical tools The following listings show DBMS-specific examples for
viewing metadata The DBMS itself
main-tains metadata—look, but don’t touch
■ The SQL standard calls a set
of metadata a catalog and
specifies that it be accessed through the schema INFORMATION_SCHEMA Not all
DBMSs implement this schema or use
the same terms In Microsoft SQL
Server, for example, the equivalent term
for a catalog is a database and for a
schema, an owner In Oracle, the
reposi-tory of metadata is the data dictionary.