SQL VISUAL QUICKSTART GUIDE- P46 pps

Listing 15.31 calculates the trimmed mean of book sales in the sample database by omitting the top three and bottom three sales figures.. SELECT AVGsales AS TrimmedMean FROM titles t1 WH

Trang 1

Assigning Ranks

Ranking, which allocates the numbers 1, 2,

3, … to sorted values, is related to top-n

queries and shares the problem of

interpret-ing ties The followinterpret-ing queries calculate

ranks for sales values in the table empsales

from the preceding section

Listings 15.30a to 15.30e rank employees

by sales The first two queries shows the

most commonly accepted ways to rank

val-ues The other queries show variations on

them Figure 15.30 shows the result of each

ranking method, a to e, combined for brevity

and ease of comparison These queries rank

highest to lowest; to reverse the order, change

>(or >=) to <(or <=) in the WHEREcomparisons

Listing 15.30a Rank employees by sales (method a).

See Figure 15.30 for the result.

SELECT e1.emp_id, e1.sales, (SELECT COUNT(sales) FROM empsales e2

AS ranking FROM empsales e1;

Listing

Listing 15.30b Rank employees by sales (method b).

Listing

Listing 15.30c Rank employees by sales (method c).

Listing

Listing 15.30d Rank employees by sales (method d).

Listing

Trang 2

These queries use correlated subqueries and so run slowly

If you’re ranking a large number of items, you should use a built-in rank function or OLAP component, if available The SQL:2003 standard introduced the functions RANK() andDENSE_RANK(), which Microsoft SQL Server 2005 and later, Oracle, and DB2 support For Microsoft SQL Server 2000,

look at the Analysis Services (OLAP) func-tionRANK() Alternatively, you can use your DBMS’s SQL extensions to calculate ranks

efficiently The following MySQL script, for

example, is equivalent to Listing 15.30b:

SET @rownum = 0;

SET @rank = 0;

SET @prev_val = NULL;

SELECT

@rownum := @rownum + 1 AS row,

@rank := IF(@prev_val <> sales,

@rownum, @rank) AS rank,

@prev_val := sales AS sales FROM empsales

ORDER BY sales DESC;

✔ Tips

■ You can add the clause ORDER BY ranking ASCto a query’s outer SELECTto sort the results by rank

sup-port COUNT(DISTINCT)and won’t run Listings 15.30d and 15.30e For a workaround, see “Aggregating Distinct Values with DISTINCT” in Chapter 6

Listing 15.30e Rank employees by sales (method e).

SELECT e1.emp_id, e1.sales,

FROM empsales e2

AS ranking

FROM empsales e1;

Listing

emp_id sales a b c d e

-

E09 900 1 1 0 1 0

E02 800 2 2 1 2 1

E10 700 4 3 2 3 2

E05 700 4 3 2 3 2

E01 600 5 5 4 4 3

E04 500 8 6 5 5 4

E03 500 8 6 5 5 4

E06 500 8 6 5 5 4

E08 400 9 9 8 6 5

E07 300 10 10 9 7 6

Figure 15.30 Compilation of results of Listings 15.30a

to 15.30e.

Trang 3

Calculating a

Trimmed Mean

The trimmed mean is a robust order

statis-tic that is the mean (average) of the data if

the k smallest values and k largest values are

discarded The idea is to avoid influence of

extreme observations

Listing 15.31 calculates the trimmed mean

of book sales in the sample database by

omitting the top three and bottom three

sales figures See Figure 15.31 for the result.

For reference, the 12 sorted sales values are

566, 4095, 5000, 9566, 10467, 11320, 13001,

25667, 94123, 100001, 201440, and 1500200

This query discards 566, 4095, 5000, 100001,

201440, and 1500200 and calculates the

mean in the usual way by using the

remain-ing six middle values Nulls are ignored

Duplicate values are either all removed or all

retained (If all sales are the same, none of

them will be trimmed no matter what k is,

for example.)

Listing 15.32 is similar to Listing 15.40 but

trims a fixed percentage of the extreme

val-ues rather than a fixed number Trimming by

0.25 (25%), for example, discards the sales in

the top and bottom quartiles and averages

what’s left See Figure 15.32 for the result.

✔ Tip

DB2 return an integer for the

trimmed mean because the column

salesis defined as an INTEGER To get a

floating-point value, change AVG(sales)

Listing 15.31 Calculate the trimmed mean for k = 3.

SELECT AVG(sales) AS TrimmedMean FROM titles t1

WHERE (SELECT COUNT(*) FROM titles t2 WHERE t2.sales <= t1.sales) > 3 AND

(SELECT COUNT(*) FROM titles t3 WHERE t3.sales >= t1.sales) > 3;

Listing

TrimmedMean -27357.3333

Figure 15.31 Result of Listing 15.31.

Listing 15.32 Calculate the trimmed mean by

discarding the lower and upper 25% of values See Figure 15.32 for the result.

SELECT AVG(sales) AS TrimmedMean FROM titles t1

WHERE (SELECT COUNT(*) FROM titles t2 WHERE t2.sales <= t1.sales) >= (SELECT 0.25 * COUNT(*) FROM titles) AND

(SELECT COUNT(*) FROM titles t3 WHERE t3.sales >= t1.sales) >=

Listing

Trang 4

Picking Random Rows

Some databases are so large, and queries on them so complex, that often it’s impractical (and unnecessary) to retrieve all the data relevant to a query If you’re interested in finding an overall trend or pattern, for exam-ple, an approximate answer within some margin of error usually will do One way to speed such queries is to select a random sample of rows An efficient sample can improve performance by orders of magni-tude yet still yield accurate results

Standard SQL’s TABLESAMPLEclause returns

a random subset of rows DB2 and SQL Server 2005 and later support TABLESAMPLE,

and Oracle has something similar For the

other DBMSs, use a (nonstandard) function that returns a uniform random number

between 0 and 1 (Table 15.1).

Listing 15.33a randomly picks about 25%

(0.25) of the rows from the sample-database table titles If necessary, change RAND()

to the function that appears in Table 15.1 for

your DBMS For Oracle, use Listing 15.33b For SQL Server 2005 and later and DB2, use Listing 15.33c.

Table 15.1

Randomization Features

D B M S C l a u s e o r F u n c t i o n

Access RND() function

SQL Server 2000 RAND() function

SQL Server 2005/2008 TABLESAMPLE clause

Oracle SAMPLE clause or DBMS_RANDOM

package

MySQL RAND() function

PostgreSQL RANDOM() function

Listing 15.33a Select about 25% percent of the rows

in the table titles at random See Figure 15.33 for a

possible result.

SELECT title_id, type, sales

FROM titles

WHERE RAND() < 0.25;

Listing

Listing 15.33b Select about 25% percent of the rows

in the table titles at random (Oracle only) See

Figure 15.33 for a possible result.

FROM titles

SAMPLE (25);

Listing

Listing 15.33c Select about 25% percent of the

rows in the table titles at random (SQL Server 2005

and later and DB2 only) See Figure 15.33 for a

possible result.

FROM titles

TABLESAMPLE SYSTEM (25);

Listing

Trang 5

Figure 15.33 shows one possible result of a

random selection The rows and the number

of rows returned will change each time you

run the query If you need an exact number

of random rows, increase the sampling

per-centage and use one of the techniques

described in “Limiting the Number of Rows

Returned” earlier in this chapter

✔ Tips

■ Randomizers take an optional seed

argu-ment or setting that sets the starting

value for a random-number sequence

Identical seeds yield identical sequences

(handy for testing) By default, the DBMS

sets the seed based on the system time to

generate different sequences every time

■ Listing 15.33a won’t run correctly

on Microsoft Access or Microsoft SQL Server 2000 because

the random-number function returns the

same “random” number for each selected

row In Access, use Visual Basic or C# to

pick random rows For SQL Server 2000,

search for the article “Returning Rows in

Random Order” at www.sqlteam.com

To use the NEWID()function to pick n

random rows in Microsoft SQL Server:

SELECT TOP n title_id, type, sales

FROM titles

ORDER BY NEWID() ;

To use the VALUE()function in the

DBMS_RANDOMpackage to pick n random

rows in Oracle:

SELECT * FROM

title_id type sales - -T03 computer 25667 T04 psychology 13001 T11 psychology 94123

Figure 15.33 One possible result of Listing 15.33a/b/c.

Selecting Every nth Row

Instead of picking random rows, you can

pick every nth row by using a modulo

expression:

◆ m MOD n(Microsoft Access)

◆ m % n(Microsoft SQL Server)

◆ MOD(m,n)(other DBMSs)

This expression returns the remainder of

m divided by n For example, MOD(20,6)is

2 because 20 equals (3 ✕6) + 2 MOD(a,2)

is 0 if a is an even number.

The condition MOD(rownumber,n) = 0 picks every nth row, where rownumber is

a column of consecutive integers or row

identifiers This Oracle query picks every

third row in a table, for example:

SELECT *

FROM table

Trang 6

Handling Duplicates

Normally you use SQL’s PRIMARY KEYorUNIQUE constraints (see Chapter 11) to prevent dupli-cate rows from appearing in production tables But you need to know how to handle duplicates that appear when you accidentally import the same data twice or import data from a nonrelational environment, such as a spreadsheet or accounting package, where redundant information is rampant This sec-tion describes how to detect, count, and remove duplicates

Suppose that you import rows into a staging table to detect and eliminate any duplicates before inserting the data into a production

table (Listing 15.34 and Figure 15.34) The

columnidis a unique row identifier that lets you identify and select rows that other-wise would be duplicates If your imported rows don’t already have an identity column, you can add one yourself; see “Unique Identifiers” in Chapter 3 and “Generating Sequences” earlier in this chapter It’s a good practice to add an identity column to even short-lived working tables, but in this case it also makes deleting duplicates easy The imported data might include other columns too, but you’ve decided that the combination

of only book title, book type, and price deter-mines whether a row is a duplicate, regardless

of the values in any other columns Before you identify or delete duplicates, you must define exactly what it means for two rows to

be considered “duplicates” of each other

Listing 15.35 lists only the duplicates by

counting the number of occurrences of each unique combination of title_name,type, andprice See Figure 15.35 for the result

If this query returns an empty result, the table contains no duplicates To list only the nonduplicates, change COUNT(*) > 1

toCOUNT(*) = 1

Listing 15.34 List the imported rows See Figure 15.34

for the result.

SELECT id, title_name, type, price

FROM dups;

Listing

id title_name type price

-1 Book Title 5 children -15.00

2 Book Title 3 biography 7.00

3 Book Title 1 history 10.00

4 Book Title 2 children 20.00

7 Book Title 3 biography 7.00

Listing 15.35 List only duplicates See Figure 15.35 for

the result.

SELECT title_name, type, price

FROM dups

GROUP BY title_name, type, price

HAVING COUNT(*) > 1;

Listing

title_name type price

-Book Title 1 history 10.00

Book Title 3 biography 7.00

Trang 7

Listing 15.36 uses a similar technique

to list each row and its duplicate count

See Figure 15.36 for the result To list

only the duplicates, change COUNT(*) >= 1

toCOUNT(*) > 1

Listing 15.37 deletes duplicate rows from

dupsin place This statement uses the

col-umnidto leave exactly one occurrence (the

one with the highest ID) of each duplicate

Figure 15.37 shows the table dupsafter

running this statement See also “Deleting

Rows with DELETE” in Chapter 10

Listing 15.36 List each row and its number of

repetitions See Figure 15.36 for the result.

SELECT title_name, type, price, COUNT(*) AS NumDups FROM dups

GROUP BY title_name, type, price HAVING COUNT(*) >= 1

ORDER BY COUNT(*) DESC;

Listing

title_name type price NumDups - - -Book Title 1 history 10.00 3 Book Title 3 biography 7.00 2 Book Title 4 history 15.00 1 Book Title 2 children 20.00 1 Book Title 5 children 15.00 1

Listing 15.37 Remove the redundant duplicates in

place See Figure 15.37 for the result.

DELETE FROM dups WHERE id < ( SELECT MAX(d.id) FROM dups d WHERE dups.title_name = d.title_name AND dups.type = d.type

AND dups.price = d.price);

Listing

id title_name type price

Trang 8

✔ Tips

■ If you define a duplicate to span every column in a row (not just a subset of columns), you can drop the column id and use SELECT DISTINCT * FROM table

to delete duplicates See “Eliminating Duplicate Rows with DISTINCT” in Chapter 4

■ If your DBMS offers a built-in

unique row identifier, you can drop the column idand still delete

duplicates in place In Oracle, for

exam-ple, you can replace idwith the ROWID pseudocolumn in Listing 15.37; change the outer WHEREclause to:

WHERE ROWID < (SELECT MAX(d.ROWID)

To run Listing 15.45 in MySQL, change

ORDER BY COUNT(*) DESCtoORDER BY NumDups DESC You can’t use Listing 15.37

to do an in-place deletion because MySQL won’t let you use same table for both the subquery’s FROMclause and the DELETEtarget

Messy Data

Deleting duplicates gets harder as data

get messier It’s not unusual to buy a

mail-ing list with entries that look like this:

name address1

—————————— ——————————————————

John Smith 123 Main St

John Smith 123 Main St, Apt 1

Jack Smiht 121 Main Rd

John Symthe 123 Main St.

Jon Smith 123 Mian Street

DBMSs offer nonstandard tools such

as Soundex (phonetic) functions to

sup-press spelling variations, but creating an

automated deletion program that works

over thousands or millions of rows is a

major project

Trang 9

Creating a Telephone List

You can use the function COALESCE()with

a left outer join to create a convenient

telephone listing from a normalized table

of phone numbers Suppose that the

sample database has an extra table named

telephonesthat stores the authors’ work

and home telephone numbers:

au_id tel_type tel_no

- -

-A01 H 111-111-1111

A01 W 222-222-2222

A02 W 333-333-3333

A04 H 444-444-4444

A04 W 555-555-5555

A05 H 666-666-6666

The table’s composite primary key is (au_id,

tel_type), where tel_typeindicates whether

tel_nois a work (W) or home (H) number

Listing 15.38 lists the authors’ names and

numbers If an author has only one number,

that number is listed If an author has both

home and work numbers, only the work

number is listed Authors with no numbers

aren’t listed See Figure 15.38 for the result.

The first left join picks out the work

num-bers, and the second picks out the home

numbers The WHEREclause filters out authors

with no numbers (You can extend this query

to add cell-phone and other numbers.)

■ For more information about COALESCE(),

see “Checking for Nulls with COALESCE()”

Listing 15.38 Lists the authors’ names and telephone

numbers, favoring work numbers over home numbers See Figure 15.38 for the result.

SELECT a.au_id AS "ID", a.au_fname AS "FirstName", a.au_lname AS "LastName", COALESCE(twork.tel_no, thome.tel_no)

AS "TelNo", COALESCE(twork.tel_type, thome.tel_type)

AS "TelType"

FROM authors a LEFT OUTER JOIN telephones twork

ON a.au_id = twork.au_id AND twork.tel_type = 'W' LEFT OUTER JOIN telephones thome

ON a.au_id = thome.au_id AND thome.tel_type = 'H' WHERE COALESCE(twork.tel_no, thome.tel_no)

IS NOT NULL ORDER BY a.au_fname ASC, a.au_lname ASC;

Listing

ID FirstName LastName TelNo TelType - - -A05 Christian Kells 666-666-6666 H A04 Klee Hull 555-555-5555 W A01 Sarah Buchman 222-222-2222 W A02 Wendy Heydemark 333-333-3333 W

Trang 10

Retrieving Metadata

Metadata are data about data In DBMSs,

metadata include information about

schemas, databases, users, tables, columns,

and so on You already saw metadata in

“Getting User Information” in Chapter 5 and

“Displaying Table Definitions” in Chapter 10 The first thing to do when meeting a new

database is to inspect its metadata: What’s

in the database? How big is it? How are the

tables organized?

Metadata, like other data, are stored in

tables and so can be accessed via SELECT

queries Metadata also can be accessed,

often more conveniently, by using

com-mand-line and graphical tools The following listings show DBMS-specific examples for

viewing metadata The DBMS itself

main-tains metadata—look, but don’t touch

■ The SQL standard calls a set

of metadata a catalog and

specifies that it be accessed through the schema INFORMATION_SCHEMA Not all

DBMSs implement this schema or use

the same terms In Microsoft SQL

Server, for example, the equivalent term

for a catalog is a database and for a

schema, an owner In Oracle, the

reposi-tory of metadata is the data dictionary.

Định dạng
Số trang	10
Dung lượng	191,59 KB