Some common examples of good natural keys are as follows: • For people: Driver’s license numbers including the state of issue, company identification number, or other assigned IDs e.g.,
Trang 1The choice of primary key is largely a matter of convenience and what is easiest to use We’ll discuss primary keys later in this chapter in the context of relationships The important thing to remember is that when you have values that should exist only once in the database, you need to protect against duplicates
Choosing Keys
While keys can consist of any number of columns, it is best to try to limit the number of columns in
a key as much as possible For example, you may have a Book table with the columns
Publisher_Name, Publisher_City, ISBN_Number, Book_Name, and Edition From these attributes, the following three keys might be defined:
• Publisher_Name, Book_Name, Edition: A publisher will likely publish more than one book Also, it is safe to assume that book names are not unique across all books However, it is probably true that the same publisher will not publish two books with the same title and the same edition (at least, we assume that this is true!)
• ISBN_Number: The ISBN number is the unique identification number assigned to a book when
it is published
• Publisher_City, ISBN_Number: Because ISBN_Number is unique, it follows that Publisher_City and ISBN_Number combined is also unique
The choice of (Publisher_Name, Book_Name) as a composite candidate key seems valid, but the (Publisher_City, ISBN_Number) key requires more thought The implication of this key is that in every city, ISBN_Number can be used again, a conclusion that is obviously not appropriate This is a common problem with composite keys, which are often not thought out properly In this case, you might choose ISBN_Number as the PK and (Publisher_Name, Book_Name) as the AK
■ Note It is important to not confuse unique indexes with keys There may be valid performance-based reasons
to implement the Publisher_City, ISBN_Numberindex in your SQL Server database However, this would not
be identified as a key of a table In Chapter 6, we’ll discuss implementing keys, and in Chapter 8, we’ll cover implementing indexes for data access enhancement
Having established what keys are, we’ll next discuss the two main types of keys: natural keys (including smart keys) and surrogate keys
Natural Keys
Wikipedia (http://www.wikipedia.com) defines the term natural key as “a candidate key that has a
logical relationship to the attributes within that row” (at least it did when this chapter was written)
In other words, it is a “real” attribute of an entity that the user logically uses to uniquely identify each instance of an entity From our previous examples, all of our candidate keys so far—employee number, Social Security number (SSN), ISBN, and the (Publisher_Name, Book_Name) composite key—have been examples of natural keys
Some common examples of good natural keys are as follows:
• For people: Driver’s license numbers (including the state of issue), company identification
number, or other assigned IDs (e.g., customer numbers or employee numbers)
• For transactional documents (e.g., invoices, bills, and computer-generated notices): These
usu-ally have some sort of number assigned when they are printed
• For products for sale: These could be product numbers (product names are likely not unique).
Trang 2• For companies that clients deal with: These are commonly assigned a customer/client number
for tracking
• For buildings: This is usually the complete address, including the postal code.
• For mail: These could be the addressee’s name and address and the date the item was sent.
Be careful when choosing a natural key Ideally, you are looking for something that is stable, that you can control, and that is definitely going to allow you to uniquely identify every row in your
database
One thing of interest here is that what might be considered a natural key in your database is often not actually a natural key in the place where it is defined, for example, the driver’s license
number of a person In the example database, this is a number that every person has (or may need
before inclusion in our database, perhaps) However, the value of the driver’s license number is just
a series of integers This number did not actually occur in nature tattooed on the back of the
per-son’s neck at birth In the database where that number was created, it was actually more of a
surrogate key (which we will define in a later section)
Given that three-part names are common in the United States, it is usually relatively rare that you’ll have two people working in the same company or attending the same school who have the
same three names (Of course, if you work in a company with 200,000 people, the odds will go up
that you will have duplicates.) If you include prefixes and suffixes, it is a bit less likely, but “rare” or
even “extremely rare” cannot be implemented in a manner that makes it a safe key If you happen to
hire two people called Sir Lester James Fredingston III, then the second of them probably isn’t going
to take kindly to being called Les for short just so your database system isn’t compromised
One notable profession where names must be unique is acting No two actors who have their union cards can have the same name Some change their names from Archibald Leach to
some-thing more pleasant like Cary Grant, but in some cases the person wants to keep his or her name, so
in the actors database they add a uniquifier to the name to make it unique
A uniquifier might be some meaningless value added to a column or set of columns to give you
a unique key For example, five people (up from four, last edition) are listed on the Internet Movie
Database site (http://www.imdb.com) with the name Gary Grant (not Cary, but Gary) Each has a
dif-ferent number associated with his name to make him a unique Gary Grant (Of course, none of
these people has hit the big time, but watch out—it could be happening soon!)
■ Tip We tend to think of names in most systems as a kind of semiunique natural key This isn’t good enough for
identifying a single row, but it’s great for a human to find a value The phone book is a good example of this Say
you need to find Ray Janakowski in the phone book There might be more than one person with this name, but it
might be a “good enough” way to look up a person’s phone number This semiuniqueness is a very interesting
attribute of a table and should be documented for later use, but only in rare cases would you use the semiunique
values and make a key from them using a uniquifier
Smart Keys
A commonly occurring type of natural key in computer systems is a smart or intelligent key Some
identifiers will have additional information embedded in them, often as an easy way to build a
unique value for helping a human identify some real-world thing In most cases, the smart key can
be disassembled into its parts In some cases, however, the data will probably not jump out at you
Take the following example of the fictitious product serial number XJV102329392000123:
• X: Type of product (LCD television)
• JV: Subtype of product (32-inch console)
Trang 3• 1023: Lot that the product was produced in (the 1023rd batch produced)
• 293: Day of year
• 9: Last digit of year
• 2: Color
• 000123: Order of production
The simple-to-use smart key values serve an important purpose to the end user, in that the technician who received the product can decipher the value and see that in fact this product was built in a lot that contained defective whatchamajiggers, and he needs to replace it The essential thing for us during the logical design phase is to find all the bits of information that make up the smart keys because each of these values is likely going to need to be stored in its own column Smart keys, while useful in some cases, often present the database implementor with problems that will occur over time When at all possible, instead of implementing a single column with all of these values, consider having multiple column values for each of the different pieces of information and calculating the value of the smart key The end user gets what they need, and you in turn get what you need, a column value that never needs to be broken down into parts to work with
A big problem with smart keys is that it is possible to run out of unique values for the con-stituent parts, or some part of the key (e.g., the product type or subtype) may change It is
imperative that you be very careful and plan ahead if you use smart keys to represent multiple pieces of information When you have to change the format of smart keys, it often becomes a large validation problem to make sure that different values of the smart key are actually valid
■ Note Smart keys are useful tools to communicate a lot of information to the user in a small package However, all the bits of information that make up the smart key need to be identified, documented, and implemented in a straightforward manner Optimum SQL code expects the data to all be stored in individual columns, and as such, it
is of great importance that you needn’t ever base computing decisions on decoding the value We will talk more about the subject of choosing implementation keys in Chapter 5
Surrogate Keys
Surrogate keys (sometimes described as artificial keys) are kind of the opposite of natural keys The
wordsurrogate means “something that substitutes for,” and in this case, a surrogate key substitutes
for a natural key Sometimes there may not be a natural key that you think is stable or reliable enough to use, in which case you may decide to use a surrogate key In reality, many of our exam-ples of natural keys were actually surrogate keys in their original database but were elevated to a natural status by usage in the “real” world
A surrogate key can uniquely identify each instance of an entity, but it has no actual meaning with regard to that entity other than to represent existence Surrogate keys are usually maintained
by the system Common methods for creating surrogate key values are using a monotonically increasing number (e.g., an Identity column), some form of hash function, or even a globally unique identifier (GUID), which is a very long identifier that is unique on all machines in the world The concept of a surrogate key can be troubling to purists Since the surrogate key does not describe the row at all, can it really be an attribute of the row? Nevertheless, an exceptionally nice aspect of a surrogate key is that the value of the key should never change This, coupled with the fact that surrogate keys are always a single column, makes several aspects of implementation far easier The only reason for the existence of the surrogate key is to identify a row The main reason for
an artificial key is to provide a key that an end user never has to view and never has to interact with Think of it like your driver’s license number, an ID number that is given to you when you begin to
Trang 4drive It may have no other meaning than a number that helps a police officer look up who you are
when you’ve been testing to see just how fast you can go in sixth gear (although in the United
King-dom it is a scrambled version of the date of birth) The surrogate key should always have some
element that is just randomly chosen, and it should never be based on data that can change If your
driver’s license number were a smart key and decoded to include your hair color, the driver’s license
number might change frequently (for some youth and we folks whose hair has turned a different
color) No, this value is good only for looking you up in a database
Usually a true surrogate key is never shared with any users It will be a value generated on the computer system that is hidden from use, while the user directly accesses only the natural keys’
val-ues Probably the best reason for this definition is that once a user has access to a value, it then may
need to be modified For example, if you were customer 0000013 or customer 00000666, you might
request a change
■ Note In some ways, surrogate keys should probably not even be mentioned in the logical design section of this
book, but it is important to know of their existence, since they will undoubtedly still crop up in some logical
designs A typical flame war on the newsgroups (and amongst the tech reviewers of this book) is concerning
whether surrogate keys are a good idea I’m a proponent of their use (as you will see), but I try to be fairly open in
my approach in the book to demonstrate both ways of doing things Generally speaking, if a value is going to be
accessible to the end user, my preference is that it really needs to be modifiable and readable You can also have
two surrogate keys in a table: one that is the unchanging “address” of a value, the other that is built for user
con-sumption (that is compact, readable, and changeable if it somehow offends your user)
Just as the driver’s license number probably has no meaning to the police officer other than a means to quickly call up and check your records, the surrogate is used to make working with the
data programmatically easier Since the source of the value for the surrogate key does not have any
correspondence to something a user might care about, once a value has been associated with a row,
there is not ever a reason to change the value This is an exceptionally nice aspect of surrogate keys
The fact that the value of the key does not change, coupled with the fact that it is always a single
col-umn, makes several aspects of implementation far easier This will be made clearer later in the book
when choosing a primary key
Thinking back to the driver’s license analogy, if the driver’s license card has just a single value (the surrogate key) on it, how would Officer Uberter Sloudoun determine whether you were actually
the person identified? He couldn’t, so there are other attributes listed, such as name, birth date, and
usually your picture, which is an excellent unique key for a human to deal with (except possibly for
identical twins, of course) In this very same way, a table ought to have other keys defined as well, or
it is not a proper table
Consider the earlier example of a product identifier consisting of seven parts:
• X: Type of product (LCD television)
• JV: Subtype of product (32-inch console)
• 1023: Lot that the product was produced in (the 1023rd batch produced)
• 293: Day of year
• 9: Last digit of year
• 2: Color
• 000123: Order of production
A natural key would consist of these seven parts There is also a product serial number, which is the concatenation of the values such as XJV102329392000123 to identify the row Say you also have
Trang 5a surrogate key on the table that has a value of 3384038483 If the only key defined on the rows is the surrogate, the following situation might occur:
SurrogateKey ProductSerialNumber
–––––––––––– –––––––––––––––––––
3384038483 XJV102329392000123
3384434222 ZJV104329382043534
The first two rows are not duplicates, but since the surrogate key values have no real meaning,
in essence these are duplicate rows, since the user could not effectively tell them apart
This sort of problem is common, because most people using surrogate keys do not understand that only having a surrogate key opens them up to having rows with duplicate data in the columns where the data has some logical relationship to each other A user looking at the preceding table would have no clue which row actually represented the product he or she was after, or if both rows did
■ Note When doing logical design, I tend to model each table with a surrogate key, since during the design process I may not yet know what the final keys will in fact turn out to be This approach will become obvious throughout the book, especially in the case study presented throughout much of the book
Missing Values (NULLs)
If you look up the definition of a “loaded subject” in a computer dictionary, you will likely find the word NULL In the database, there must exist some way to say that the value of a given column is not known or that the value is irrelevant Often, a value outside of legitimate actual range (sometimes referred to as a sentinel value) is used to denote this value For decades, programmers have used
ancient dates in a date column to indicate that a certain value does not matter, they use a negative value where it does not make sense in the context of a column, or they simply use a text string of 'UNKNOWN' or 'N/A' These approaches are fine, but special coding is required to deal with these val-ues, for example:
IF (value<>'UNKNOWN') THEN
This is OK if it needs to be done only once The problem, of course, is that this special coding is needed every time a new type of column is added Instead, it is common to use a value of NULL,
which in relational theory means an empty set or a set with no value Going back to Codd’s rules, the third rule states the following:
NULL values (distinct from empty character string or a string of blank characters or zero) are
supported in the RDBMS for representing missing information in a systematic way, independ-ent of data type.
There are a couple of properties of NULL that you need to consider:
• Any value concatenated with NULL is NULL NULL can represent any valid value, so if an unknown value is concatenated with a known value, the result is still an unknown value
• All math operations with NULL will return NULL, for the very same reason that any value con-catenated with NULL returns NULL
• Logical comparisons can get tricky when NULL is introduced