Joe Celko s SQL for Smarties - Advanced SQL Programming P22 pps

In Standard SQL, adding they were in fact “real NULLs” i.e., present in the data and therefore assumed to model a missing value or “created NULLs” i.e., created as place holders for summ

Trang 1

182 CHAPTER 5: CHARACTER DATA TYPES IN SQL

d SCH => SSS PH => FF

e H => If previous or next character is a consonant use the previous character

f W => If previous character is a vowel, use the previous character

Add the current character to result if the current character

is to equal to the last key character

(5) If last character is S, remove it (6) If last characters are AY, replace them with Y (7) If last character is A, remove it

The stated reliability of NYSIIS is 98.72%, with a selectivity factor of .164% for a name inquiry This was taken from Robert L Taft, “Name Search Techniques,” New York State Identification and Intelligence System

5.4 Cutter Tables

Another encoding scheme for names has been used for libraries for more than 100 years The catalog number of a book often needs to reduce an author’s name to a simple fixed-length code While the results of a Cutter table look much like those of a Soundex, their goal is different They attempt to preserve the original alphabetical order of the names in the encodings

But the librarian cannot just attach the author’s name to the classification code Names are not the same length, nor are they unique within their first letters For example, “Smith, John A.” and “Smith, John B.” are not unique until the last letter

What librarians have done about this problem is to use Cutter tables These tables map authors’ full names into letter-and-digit codes There are several versions of the Cutter tables The older tables tended to use a mix of letters (both upper- and lowercase) followed by digits The three-figure version uses a single letter followed by three digits For example, using that table:

"Adams, J" becomes "A214"

"Adams, M" becomes "A215"

"Arnold" becomes "A752"

"Dana" becomes "D168"

"Sherman" becomes "S553"

"Scanlon" becomes "S283"

Trang 2

5.4 Cutter Tables 183

The distribution of these numbers is based on the actual distribution

of names of authors in English-speaking countries You simply scan down the table until you find the place where your name would fall and use that code

Cutter tables have two important properties The first is that they preserve the alphabetical ordering of the original name list, which means that you can do a rough sort on them The second is that each grouping tends to be of approximately the same size as the set of names gets larger These properties can be handy for building indexes in a database

If you would like copies of the Cutter tables, you can find some of them on the Internet Princeton University Library has posted its rules for names, locations, regions, and other things on its Web site, http:// infoshare1.princeton.edu/katmandu/class/cutter.html

You can also get hard copies from this publisher

Hargrave House

7312 Firethorn

Littleton, CO 80125

Web site = www.cuttertables.com

Trang 4

C H A P T E R

6 NULLs: Missing Data in SQL

A DISCUSSION OF HOW missing data should be handled enters a sensitive area in relational database circles Dr E F Codd, creator of the relational model, favored two types of missing-value tokens in his book on the second version of the relational model: one for

“unknown” (the eye color of a man wearing sunglasses) and one for

“not applicable” (the eye color of an automobile) Chris Date, leading author on relational databases, advocates not using any general-purpose tokens for missing values at all Standard SQL uses one token, based on Dr Codd’s original relational model

Perhaps Dr Codd was right—again In Standard SQL, adding

they were in fact “real NULLs” (i.e., present in the data and therefore assumed to model a missing value) or “created NULLs” (i.e., created as place holders for summary rows in the result set)

In their book A Guide to Sybase and SQL Server, David McGoveran and C J Date said: “It is this writer’s opinion than NULLs, at least as currently defined and implemented in SQL, are far more trouble than they are worth and should be avoided; they display very strange and inconsistent behavior and can be a rich source of error and confusion (Please note that these comments and criticisms apply to any system that supports SQL-style NULLs, not just to SQL Server specifically.)”

Trang 5

186 CHAPTER 6: NULLS: MISSING DATA IN SQL

SQL takes the middle ground and has a single general-purpose NULL for missing values Rules for NULLs in particular statements appear in the appropriate sections of this book This section will discuss NULLs and missing values in general

People have trouble with things that are not there There is no concept of zero in Roman numerals and in other traditional numeral systems It was centuries before Hindu-Arabic numerals became popular

in Europe In fact, many early Renaissance accounting firms advertised that they did not use the fancy, newfangled notation and kept records in well-understood Roman numerals instead

Many of the conceptual problems with zero arose from not knowing the difference between ordinal and cardinal numbers Ordinal numbers measure position; cardinal numbers measure quantity or magnitude The argument against the zero was this: if there is no quantity or magnitude there, how can you count or measure it? What does it mean to multiply

or divide a number by zero? There was considerable linguistic confusion over words that deal with the lack of something

As the Greek paradox says:

1 No cat has 12 tails

2 A cat has one more tail than no cat

3 Therefore, a cat has 13 tails

Likewise, it was a long time before the idea of an empty set found its way into mathematics The argument was that if there are no elements, how could you have a set of them? Is the empty set a subset of itself? Is the empty set a subset of all other sets? Is there only one universal empty set or one empty set for each type of set?

Computer science now has its own problem with missing data The Interim Report 75-02-08 to the ANSI X3 (SPARC Study Group 1975) identified 14 different kinds of incomplete data that could appear as the result of queries or as attribute values These types included overflows, underflows, errors, and other problems in trying to represent the real world within the limits of a computer

Instead of discussing the theory for the different models and approaches to missing data, I would rather explain why and how to use NULLs in SQL In the rest of this book, I will be urging you not to use them, which may seem contradictory, but it is not Think of a NULL as a drug; use it properly and it works for you, but abuse it and it can ruin

Trang 6

6.2 Missing Values in Columns 187

everything Your best policy is to avoid NULLs when you can and use them properly when you have to

6.1 Empty and Missing Tables

An empty table or view is a different concept from a missing table An empty table is one that is defined with columns and constraints, but that has zero rows in it This can happen when a table or view is created for the first time, or when all the rows are deleted from the table It is a perfectly good table By definition, all of its constraints are TRUE

A missing table has been removed from the database schema with a

the name wrong) A missing view is a bit different It, too, can be absent because of a DROP VIEW statement or a typing error But it can also be absent because a table or view from which it was built has been removed This means that the view cannot be constructed at runtime, and the database reports a failure If you used CASCADE behavior when you dropped a table, the view would also be gone; but we’ll explore that later The behavior of an empty TABLE or VIEW will vary with the way it is used The reader should look at sections of this book that deal with predicates that use a subquery In general, an empty table can be treated either as a NULL or as an empty set, depending on context

6.2 Missing Values in Columns

The usual description of NULLs is that they represent currently unknown values that may be replaced later with real values when we know something Actually, the NULL covers a lot of territory, since it is the only way of showing any missing values Going back to basics for a moment, we can define a row in a database as an entity, which has one or more attributes (columns), each of which is drawn from some domain Let us use the notation E(A) = V to represent the idea that an entity, E, has an attribute, A, which has a value, V For example, I could write

“John(hair) = black” to say that John has black hair

SQL’s general-purpose NULLs do not quite fit this model If you have defined a domain for hair color and one for car color, then a hair color should not be comparable to a car color, because they are drawn from two different domains You would need to make their domains

comparable with an implicit or explicit casting function This is now being done in Standard SQL, which has a CREATE DOMAIN statement, but most implementations do not have this feature yet Trying to find out which employees drive cars that match their hair is a bit weird outside of

Trang 7

Los Angeles, but in the case of NULLs, do we have a hit when a bald-headed man walks to work? Are no hair and no car somehow equal in color? In SQL, we would get an UNKNOWN result, rather than an error, if

we compared these two NULLs directly The domain-specific NULLs are conceptually different from the general NULL, because we know what kind of thing is UNKNOWN This could be shown in our notation as E(A) = NULL to mean that we know the entity, and we know the attribute, but

we do not know the value

Another flavor of NULL is “Not Applicable” (shown as N/A on forms and spreadsheets and called “I-marks” by Dr E F Codd in his second version of the Relational Model), which we have been using on paper forms and in some spreadsheets for years For example, a bald man’s hair-color attribute is a missing-value NULL drawn from the hair-color domain, but his feather-color attribute is a Not Applicable NULL The attribute itself is missing, not just the value This missing-attribute NULL could be written as E(NULL) = NULL in the formula notation

How could an attribute not belonging to an entity show up in a table? Consolidate medical records and put everyone together for statistical purposes You should not find any male pregnancies in the result table The programmer has a choice as to how to handle pregnancies He can have a column in the consolidated table for “number of pregnancies,” put a zero or a NULL in the rows where sex = ‘male’, and then add some

The other way is to have a column for “medical condition” and one for “number of occurrences” beside it Another CHECK() clause would make sure male pregnancies do not appear But what happens when the sex is unknown and all we have is a name like ‘Alex Morgan’, which could belong to either gender? Can we use the presence of one or more pregnancies to determine that Alex is a woman? What if Alex is a woman who has never borne children? The case where we have NULL(A) = V is a bit strange It means that we do not know the entity, but we are looking for a known attribute, A, which has a value of V This is like asking

“What things are colored red?”—a perfectly good question, but one that

is very hard to ask in an SQL database

If you want to try writing such a query in SQL, you have to get to the system tables to get the table and column names, then JOIN them to the rows in the tables and come back with the PRIMARY KEY of that row For completeness, we could play with all eight possible combinations

of known and unknown values in the basic E(A) = V formula But such combinations are of little use or meaning For example, NULL(NULL) = V would mean that we know a value, but not the entity or the attribute

Trang 8

6.3 Context and Missing Values 189

This is like the running joke from The Hitchhiker’s Guide to the Galaxy

(Adams 1979), in which the answer to the question, “What is the

meaning of life, the universe, and everything” is 42 Likewise, “total

ignorance NULL, shown as NULL(NULL) = NULL, means that we have no

information about the entity, even about its existence, its attributes, or

their values.”

6.3 Context and Missing Values

Create a domain called Tricolor that is limited to the values ‘Red’,

‘White’, and ‘Blue’, and a column in a table drawn from that domain with

that column, I have some information about the two NULLs I know they

will be either (‘White’, ‘Blue’) or (‘Blue’, ‘White’) when their rows are

resolved This is what Chris Date calls a “distinguished NULL,” which

means we have some information in it

If my table has a ‘Red’, a ‘White’, and a NULL value in that column,

can I change the last NULL to ‘Blue’ because it can only be ‘Blue’ under

the rule? Or do I have to wait until I see an actual value for that row?

There is no clear way to handle this in SQL Multiple values cannot be

put in a column, nor can the database automatically change values as

part of the column declaration

This idea can be carried farther with marked NULL values For

example, we are given a table of hotel rooms that has columns for

check-in date and checkout date We know the check-check-in date for each visitor,

but we do not know his or her checkout dates Instead, we know

relationships among the NULLs We can put them into groups—Mr and

Mrs X will check out on the same day, members of tour group Y will

check out on the same day, and so forth We can also add conditions on

them: nobody checks out before his check-in date, tour group Y will

leave after January 7, 2005, and so forth Such rules can be put into SQL

database schemas, but it is very hard to do The usual method is to use

procedural code in a host language to handle such things

David McGoveran has proposed that each column that can have

missing data should be paired with a column that encodes the reason for

the absence of a value (McGoveran 1993, 1994 January, February,

March) The cost is a bit of extra logic, but the extra column makes it

easy to write queries that include or exclude values based on the

semantics of the situation

Finally, you might want to look at solutions statisticians have used

for missing data In many kinds of computations, the missing values

Trang 9

are replaced by an average, median or other value constructed from the data set

A NULL cannot be compared to another NULL (equal, not equal, less than, greater than, and so forth) This is where we get SQL’s three-valued logic instead of two-three-valued logic Most programmers do not easily think in three values But think about it for a minute Imagine that you are looking at brown paper bags and are asked to compare them without seeing inside of either of them What can you say about the predicate “Bag A has more tuna fish than Bag B.”—is it TRUE or

FALSE? You cannot say one way or the other, so you use a third logical value, UNKNOWN

If I execute SELECT * FROM SomeTable WHERE SomeColumn = 2; and then execute SELECT * FROM SomeTable WHERE

these two queries However, I also need to execute SELECT * FROM

George Boole developed two-valued logic and attached his name to Boolean algebra forever (Boole 1854) This is not the only possible system, but it is the one that works best with a binary (two-state) computer and with a lot of mathematics SQL has three-valued logic:

NULLs in comparisons and other predicates, but UNKNOWN is a logical value and not the same as a NULL, which is a data value marker That is why you have to say (x IS[NOT] NULL) in SQL and not use (x =

come with SQL

Table 6.1 SQL’s Three Operators

x NOT

==================

TRUE FALSE UNK UNK FALSE TRUE

Trang 10

6.5 NULLs and Logic 191

AND | TRUE UNK FALSE

=============================

TRUE | TRUE UNK FALSE

UNK | UNK UNK FALSE

FALSE | FALSE FALSE FALSE

OR | TRUE UNK FALSE

============================

TRUE | TRUE TRUE TRUE

UNK | TRUE UNK UNK

FALSE | TRUE UNK FALSE

All other predicates in SQL resolve themselves to chains of these three operators But that resolution is not immediately clear in all cases, since it is done at run time in the case of predicates that use subqueries

6.5.1 NULLS in Subquery Predicates

People forget that a subquery often hides a comparison with a NULL Consider these two tables:

CREATE TABLE Table1 (col1 INTEGER);

INSERT Table1 (col1) VALUES (1);

CREATE TABLE Table2 (col1 INTEGER);

Notice that the columns are NULL-able Execute this query:

SELECT col1

FROM Table2

WHERE col1 NOT IN (SELECT col1 FROM Table1);

Result

col1

======

3

Định dạng
Số trang	10
Dung lượng	234,58 KB