In Standard SQL, adding they were in fact “real NULLs” i.e., present in the data and therefore assumed to model a missing value or “created NULLs” i.e., created as place holders for summ
Trang 1182 CHAPTER 5: CHARACTER DATA TYPES IN SQL
d SCH => SSS PH => FF
e H => If previous or next character is a consonant use the previous character
f W => If previous character is a vowel, use the previous character
Add the current character to result if the current character
is to equal to the last key character
(5) If last character is S, remove it (6) If last characters are AY, replace them with Y (7) If last character is A, remove it
The stated reliability of NYSIIS is 98.72%, with a selectivity factor of .164% for a name inquiry This was taken from Robert L Taft, “Name Search Techniques,” New York State Identification and Intelligence System
5.4 Cutter Tables
Another encoding scheme for names has been used for libraries for more than 100 years The catalog number of a book often needs to reduce an author’s name to a simple fixed-length code While the results of a Cutter table look much like those of a Soundex, their goal is different They attempt to preserve the original alphabetical order of the names in the encodings
But the librarian cannot just attach the author’s name to the classification code Names are not the same length, nor are they unique within their first letters For example, “Smith, John A.” and “Smith, John B.” are not unique until the last letter
What librarians have done about this problem is to use Cutter tables These tables map authors’ full names into letter-and-digit codes There are several versions of the Cutter tables The older tables tended to use a mix of letters (both upper- and lowercase) followed by digits The three-figure version uses a single letter followed by three digits For example, using that table:
"Adams, J" becomes "A214"
"Adams, M" becomes "A215"
"Arnold" becomes "A752"
"Dana" becomes "D168"
"Sherman" becomes "S553"
"Scanlon" becomes "S283"
Trang 25.4 Cutter Tables 183
The distribution of these numbers is based on the actual distribution
of names of authors in English-speaking countries You simply scan down the table until you find the place where your name would fall and use that code
Cutter tables have two important properties The first is that they preserve the alphabetical ordering of the original name list, which means that you can do a rough sort on them The second is that each grouping tends to be of approximately the same size as the set of names gets larger These properties can be handy for building indexes in a database
If you would like copies of the Cutter tables, you can find some of them on the Internet Princeton University Library has posted its rules for names, locations, regions, and other things on its Web site, http:// infoshare1.princeton.edu/katmandu/class/cutter.html
You can also get hard copies from this publisher
Hargrave House
7312 Firethorn
Littleton, CO 80125
Web site = www.cuttertables.com
Trang 4C H A P T E R
6 NULLs: Missing Data in SQL
A DISCUSSION OF HOW missing data should be handled enters a sensitive area in relational database circles Dr E F Codd, creator of the relational model, favored two types of missing-value tokens in his book on the second version of the relational model: one for
“unknown” (the eye color of a man wearing sunglasses) and one for
“not applicable” (the eye color of an automobile) Chris Date, leading author on relational databases, advocates not using any general-purpose tokens for missing values at all Standard SQL uses one token, based on Dr Codd’s original relational model
Perhaps Dr Codd was right—again In Standard SQL, adding
they were in fact “real NULLs” (i.e., present in the data and therefore assumed to model a missing value) or “created NULLs” (i.e., created as place holders for summary rows in the result set)
In their book A Guide to Sybase and SQL Server, David McGoveran and C J Date said: “It is this writer’s opinion than NULLs, at least as currently defined and implemented in SQL, are far more trouble than they are worth and should be avoided; they display very strange and inconsistent behavior and can be a rich source of error and confusion (Please note that these comments and criticisms apply to any system that supports SQL-style NULLs, not just to SQL Server specifically.)”
Trang 5186 CHAPTER 6: NULLS: MISSING DATA IN SQL
SQL takes the middle ground and has a single general-purpose NULL for missing values Rules for NULLs in particular statements appear in the appropriate sections of this book This section will discuss NULLs and missing values in general
People have trouble with things that are not there There is no concept of zero in Roman numerals and in other traditional numeral systems It was centuries before Hindu-Arabic numerals became popular
in Europe In fact, many early Renaissance accounting firms advertised that they did not use the fancy, newfangled notation and kept records in well-understood Roman numerals instead
Many of the conceptual problems with zero arose from not knowing the difference between ordinal and cardinal numbers Ordinal numbers measure position; cardinal numbers measure quantity or magnitude The argument against the zero was this: if there is no quantity or magnitude there, how can you count or measure it? What does it mean to multiply
or divide a number by zero? There was considerable linguistic confusion over words that deal with the lack of something
As the Greek paradox says:
1 No cat has 12 tails
2 A cat has one more tail than no cat
3 Therefore, a cat has 13 tails
Likewise, it was a long time before the idea of an empty set found its way into mathematics The argument was that if there are no elements, how could you have a set of them? Is the empty set a subset of itself? Is the empty set a subset of all other sets? Is there only one universal empty set or one empty set for each type of set?
Computer science now has its own problem with missing data The Interim Report 75-02-08 to the ANSI X3 (SPARC Study Group 1975) identified 14 different kinds of incomplete data that could appear as the result of queries or as attribute values These types included overflows, underflows, errors, and other problems in trying to represent the real world within the limits of a computer
Instead of discussing the theory for the different models and approaches to missing data, I would rather explain why and how to use NULLs in SQL In the rest of this book, I will be urging you not to use them, which may seem contradictory, but it is not Think of a NULL as a drug; use it properly and it works for you, but abuse it and it can ruin
Trang 66.2 Missing Values in Columns 187
everything Your best policy is to avoid NULLs when you can and use them properly when you have to
6.1 Empty and Missing Tables
An empty table or view is a different concept from a missing table An empty table is one that is defined with columns and constraints, but that has zero rows in it This can happen when a table or view is created for the first time, or when all the rows are deleted from the table It is a perfectly good table By definition, all of its constraints are TRUE
A missing table has been removed from the database schema with a
the name wrong) A missing view is a bit different It, too, can be absent because of a DROP VIEW statement or a typing error But it can also be absent because a table or view from which it was built has been removed This means that the view cannot be constructed at runtime, and the database reports a failure If you used CASCADE behavior when you dropped a table, the view would also be gone; but we’ll explore that later The behavior of an empty TABLE or VIEW will vary with the way it is used The reader should look at sections of this book that deal with predicates that use a subquery In general, an empty table can be treated either as a NULL or as an empty set, depending on context
6.2 Missing Values in Columns
The usual description of NULLs is that they represent currently unknown values that may be replaced later with real values when we know something Actually, the NULL covers a lot of territory, since it is the only way of showing any missing values Going back to basics for a moment, we can define a row in a database as an entity, which has one or more attributes (columns), each of which is drawn from some domain Let us use the notation E(A) = V to represent the idea that an entity, E, has an attribute, A, which has a value, V For example, I could write
“John(hair) = black” to say that John has black hair
SQL’s general-purpose NULLs do not quite fit this model If you have defined a domain for hair color and one for car color, then a hair color should not be comparable to a car color, because they are drawn from two different domains You would need to make their domains
comparable with an implicit or explicit casting function This is now being done in Standard SQL, which has a CREATE DOMAIN statement, but most implementations do not have this feature yet Trying to find out which employees drive cars that match their hair is a bit weird outside of
Trang 7188 CHAPTER 6: NULLS: MISSING DATA IN SQL
Los Angeles, but in the case of NULLs, do we have a hit when a bald-headed man walks to work? Are no hair and no car somehow equal in color? In SQL, we would get an UNKNOWN result, rather than an error, if
we compared these two NULLs directly The domain-specific NULLs are conceptually different from the general NULL, because we know what kind of thing is UNKNOWN This could be shown in our notation as E(A) = NULL to mean that we know the entity, and we know the attribute, but
we do not know the value
Another flavor of NULL is “Not Applicable” (shown as N/A on forms and spreadsheets and called “I-marks” by Dr E F Codd in his second version of the Relational Model), which we have been using on paper forms and in some spreadsheets for years For example, a bald man’s hair-color attribute is a missing-value NULL drawn from the hair-color domain, but his feather-color attribute is a Not Applicable NULL The attribute itself is missing, not just the value This missing-attribute NULL could be written as E(NULL) = NULL in the formula notation
How could an attribute not belonging to an entity show up in a table? Consolidate medical records and put everyone together for statistical purposes You should not find any male pregnancies in the result table The programmer has a choice as to how to handle pregnancies He can have a column in the consolidated table for “number of pregnancies,” put a zero or a NULL in the rows where sex = ‘male’, and then add some
The other way is to have a column for “medical condition” and one for “number of occurrences” beside it Another CHECK() clause would make sure male pregnancies do not appear But what happens when the sex is unknown and all we have is a name like ‘Alex Morgan’, which could belong to either gender? Can we use the presence of one or more pregnancies to determine that Alex is a woman? What if Alex is a woman who has never borne children? The case where we have NULL(A) = V is a bit strange It means that we do not know the entity, but we are looking for a known attribute, A, which has a value of V This is like asking
“What things are colored red?”—a perfectly good question, but one that
is very hard to ask in an SQL database
If you want to try writing such a query in SQL, you have to get to the system tables to get the table and column names, then JOIN them to the rows in the tables and come back with the PRIMARY KEY of that row For completeness, we could play with all eight possible combinations
of known and unknown values in the basic E(A) = V formula But such combinations are of little use or meaning For example, NULL(NULL) = V would mean that we know a value, but not the entity or the attribute
Trang 86.3 Context and Missing Values 189
This is like the running joke from The Hitchhiker’s Guide to the Galaxy
(Adams 1979), in which the answer to the question, “What is the
meaning of life, the universe, and everything” is 42 Likewise, “total
ignorance NULL, shown as NULL(NULL) = NULL, means that we have no
information about the entity, even about its existence, its attributes, or
their values.”
6.3 Context and Missing Values
Create a domain called Tricolor that is limited to the values ‘Red’,
‘White’, and ‘Blue’, and a column in a table drawn from that domain with
that column, I have some information about the two NULLs I know they
will be either (‘White’, ‘Blue’) or (‘Blue’, ‘White’) when their rows are
resolved This is what Chris Date calls a “distinguished NULL,” which
means we have some information in it
If my table has a ‘Red’, a ‘White’, and a NULL value in that column,
can I change the last NULL to ‘Blue’ because it can only be ‘Blue’ under
the rule? Or do I have to wait until I see an actual value for that row?
There is no clear way to handle this in SQL Multiple values cannot be
put in a column, nor can the database automatically change values as
part of the column declaration
This idea can be carried farther with marked NULL values For
example, we are given a table of hotel rooms that has columns for
check-in date and checkout date We know the check-check-in date for each visitor,
but we do not know his or her checkout dates Instead, we know
relationships among the NULLs We can put them into groups—Mr and
Mrs X will check out on the same day, members of tour group Y will
check out on the same day, and so forth We can also add conditions on
them: nobody checks out before his check-in date, tour group Y will
leave after January 7, 2005, and so forth Such rules can be put into SQL
database schemas, but it is very hard to do The usual method is to use
procedural code in a host language to handle such things
David McGoveran has proposed that each column that can have
missing data should be paired with a column that encodes the reason for
the absence of a value (McGoveran 1993, 1994 January, February,
March) The cost is a bit of extra logic, but the extra column makes it
easy to write queries that include or exclude values based on the
semantics of the situation
Finally, you might want to look at solutions statisticians have used
for missing data In many kinds of computations, the missing values
Trang 9190 CHAPTER 6: NULLS: MISSING DATA IN SQL
are replaced by an average, median or other value constructed from the data set
A NULL cannot be compared to another NULL (equal, not equal, less than, greater than, and so forth) This is where we get SQL’s three-valued logic instead of two-three-valued logic Most programmers do not easily think in three values But think about it for a minute Imagine that you are looking at brown paper bags and are asked to compare them without seeing inside of either of them What can you say about the predicate “Bag A has more tuna fish than Bag B.”—is it TRUE or
FALSE? You cannot say one way or the other, so you use a third logical value, UNKNOWN
If I execute SELECT * FROM SomeTable WHERE SomeColumn = 2; and then execute SELECT * FROM SomeTable WHERE
these two queries However, I also need to execute SELECT * FROM
George Boole developed two-valued logic and attached his name to Boolean algebra forever (Boole 1854) This is not the only possible system, but it is the one that works best with a binary (two-state) computer and with a lot of mathematics SQL has three-valued logic:
NULLs in comparisons and other predicates, but UNKNOWN is a logical value and not the same as a NULL, which is a data value marker That is why you have to say (x IS[NOT] NULL) in SQL and not use (x =
come with SQL
Table 6.1 SQL’s Three Operators
x NOT
==================
TRUE FALSE UNK UNK FALSE TRUE
Trang 106.5 NULLs and Logic 191
AND | TRUE UNK FALSE
=============================
TRUE | TRUE UNK FALSE
UNK | UNK UNK FALSE
FALSE | FALSE FALSE FALSE
OR | TRUE UNK FALSE
============================
TRUE | TRUE TRUE TRUE
UNK | TRUE UNK UNK
FALSE | TRUE UNK FALSE
All other predicates in SQL resolve themselves to chains of these three operators But that resolution is not immediately clear in all cases, since it is done at run time in the case of predicates that use subqueries
6.5.1 NULLS in Subquery Predicates
People forget that a subquery often hides a comparison with a NULL Consider these two tables:
CREATE TABLE Table1 (col1 INTEGER);
INSERT Table1 (col1) VALUES (1);
INSERT Table1 (col1) VALUES (2);
CREATE TABLE Table2 (col1 INTEGER);
INSERT Table2 (col1) VALUES (1);
INSERT Table2 (col1) VALUES (2);
INSERT Table2 (col1) VALUES (3);
INSERT Table2 (col1) VALUES (4);
INSERT Table2 (col1) VALUES (5);
Notice that the columns are NULL-able Execute this query:
SELECT col1
FROM Table2
WHERE col1 NOT IN (SELECT col1 FROM Table1);
Result
col1
======
3