Apress - Pro SQL Server 2008 Relational Database Design and Implementation (2008)04

Some common examples of good natural keys are as follows: • For people: Driver’s license numbers including the state of issue, company identification number, or other assigned IDs e.g.,

Trang 1

The choice of primary key is largely a matter of convenience and what is easiest to use We’ll discuss primary keys later in this chapter in the context of relationships The important thing to remember is that when you have values that should exist only once in the database, you need to protect against duplicates

Choosing Keys

While keys can consist of any number of columns, it is best to try to limit the number of columns in

a key as much as possible For example, you may have a Book table with the columns

Publisher_Name, Publisher_City, ISBN_Number, Book_Name, and Edition From these attributes, the following three keys might be defined:

• Publisher_Name, Book_Name, Edition: A publisher will likely publish more than one book Also, it is safe to assume that book names are not unique across all books However, it is probably true that the same publisher will not publish two books with the same title and the same edition (at least, we assume that this is true!)

• ISBN_Number: The ISBN number is the unique identification number assigned to a book when

it is published

• Publisher_City, ISBN_Number: Because ISBN_Number is unique, it follows that Publisher_City and ISBN_Number combined is also unique

The choice of (Publisher_Name, Book_Name) as a composite candidate key seems valid, but the (Publisher_City, ISBN_Number) key requires more thought The implication of this key is that in every city, ISBN_Number can be used again, a conclusion that is obviously not appropriate This is a common problem with composite keys, which are often not thought out properly In this case, you might choose ISBN_Number as the PK and (Publisher_Name, Book_Name) as the AK

■ Note It is important to not confuse unique indexes with keys There may be valid performance-based reasons

to implement the Publisher_City, ISBN_Numberindex in your SQL Server database However, this would not

be identified as a key of a table In Chapter 6, we’ll discuss implementing keys, and in Chapter 8, we’ll cover implementing indexes for data access enhancement

Having established what keys are, we’ll next discuss the two main types of keys: natural keys (including smart keys) and surrogate keys

Natural Keys

Wikipedia (http://www.wikipedia.com) defines the term natural key as “a candidate key that has a

logical relationship to the attributes within that row” (at least it did when this chapter was written)

In other words, it is a “real” attribute of an entity that the user logically uses to uniquely identify each instance of an entity From our previous examples, all of our candidate keys so far—employee number, Social Security number (SSN), ISBN, and the (Publisher_Name, Book_Name) composite key—have been examples of natural keys

Some common examples of good natural keys are as follows:

• For people: Driver’s license numbers (including the state of issue), company identification

number, or other assigned IDs (e.g., customer numbers or employee numbers)

• For transactional documents (e.g., invoices, bills, and computer-generated notices): These

usu-ally have some sort of number assigned when they are printed

• For products for sale: These could be product numbers (product names are likely not unique).

Trang 2

• For companies that clients deal with: These are commonly assigned a customer/client number

for tracking

• For buildings: This is usually the complete address, including the postal code.

• For mail: These could be the addressee’s name and address and the date the item was sent.

Be careful when choosing a natural key Ideally, you are looking for something that is stable, that you can control, and that is definitely going to allow you to uniquely identify every row in your

database

One thing of interest here is that what might be considered a natural key in your database is often not actually a natural key in the place where it is defined, for example, the driver’s license

number of a person In the example database, this is a number that every person has (or may need

before inclusion in our database, perhaps) However, the value of the driver’s license number is just

a series of integers This number did not actually occur in nature tattooed on the back of the

per-son’s neck at birth In the database where that number was created, it was actually more of a

surrogate key (which we will define in a later section)

Given that three-part names are common in the United States, it is usually relatively rare that you’ll have two people working in the same company or attending the same school who have the

same three names (Of course, if you work in a company with 200,000 people, the odds will go up

that you will have duplicates.) If you include prefixes and suffixes, it is a bit less likely, but “rare” or

even “extremely rare” cannot be implemented in a manner that makes it a safe key If you happen to

hire two people called Sir Lester James Fredingston III, then the second of them probably isn’t going

to take kindly to being called Les for short just so your database system isn’t compromised

One notable profession where names must be unique is acting No two actors who have their union cards can have the same name Some change their names from Archibald Leach to

some-thing more pleasant like Cary Grant, but in some cases the person wants to keep his or her name, so

in the actors database they add a uniquifier to the name to make it unique

A uniquifier might be some meaningless value added to a column or set of columns to give you

a unique key For example, five people (up from four, last edition) are listed on the Internet Movie

Database site (http://www.imdb.com) with the name Gary Grant (not Cary, but Gary) Each has a

dif-ferent number associated with his name to make him a unique Gary Grant (Of course, none of

these people has hit the big time, but watch out—it could be happening soon!)

■ Tip We tend to think of names in most systems as a kind of semiunique natural key This isn’t good enough for

identifying a single row, but it’s great for a human to find a value The phone book is a good example of this Say

you need to find Ray Janakowski in the phone book There might be more than one person with this name, but it

might be a “good enough” way to look up a person’s phone number This semiuniqueness is a very interesting

attribute of a table and should be documented for later use, but only in rare cases would you use the semiunique

values and make a key from them using a uniquifier

Smart Keys

A commonly occurring type of natural key in computer systems is a smart or intelligent key Some

identifiers will have additional information embedded in them, often as an easy way to build a

unique value for helping a human identify some real-world thing In most cases, the smart key can

be disassembled into its parts In some cases, however, the data will probably not jump out at you

Take the following example of the fictitious product serial number XJV102329392000123:

• X: Type of product (LCD television)

• JV: Subtype of product (32-inch console)

Trang 3

• 1023: Lot that the product was produced in (the 1023rd batch produced)

• 293: Day of year

• 9: Last digit of year

• 2: Color

• 000123: Order of production

The simple-to-use smart key values serve an important purpose to the end user, in that the technician who received the product can decipher the value and see that in fact this product was built in a lot that contained defective whatchamajiggers, and he needs to replace it The essential thing for us during the logical design phase is to find all the bits of information that make up the smart keys because each of these values is likely going to need to be stored in its own column Smart keys, while useful in some cases, often present the database implementor with problems that will occur over time When at all possible, instead of implementing a single column with all of these values, consider having multiple column values for each of the different pieces of information and calculating the value of the smart key The end user gets what they need, and you in turn get what you need, a column value that never needs to be broken down into parts to work with

A big problem with smart keys is that it is possible to run out of unique values for the con-stituent parts, or some part of the key (e.g., the product type or subtype) may change It is

imperative that you be very careful and plan ahead if you use smart keys to represent multiple pieces of information When you have to change the format of smart keys, it often becomes a large validation problem to make sure that different values of the smart key are actually valid

■ Note Smart keys are useful tools to communicate a lot of information to the user in a small package However, all the bits of information that make up the smart key need to be identified, documented, and implemented in a straightforward manner Optimum SQL code expects the data to all be stored in individual columns, and as such, it

is of great importance that you needn’t ever base computing decisions on decoding the value We will talk more about the subject of choosing implementation keys in Chapter 5

Surrogate Keys

Surrogate keys (sometimes described as artificial keys) are kind of the opposite of natural keys The

wordsurrogate means “something that substitutes for,” and in this case, a surrogate key substitutes

for a natural key Sometimes there may not be a natural key that you think is stable or reliable enough to use, in which case you may decide to use a surrogate key In reality, many of our exam-ples of natural keys were actually surrogate keys in their original database but were elevated to a natural status by usage in the “real” world

A surrogate key can uniquely identify each instance of an entity, but it has no actual meaning with regard to that entity other than to represent existence Surrogate keys are usually maintained

by the system Common methods for creating surrogate key values are using a monotonically increasing number (e.g., an Identity column), some form of hash function, or even a globally unique identifier (GUID), which is a very long identifier that is unique on all machines in the world The concept of a surrogate key can be troubling to purists Since the surrogate key does not describe the row at all, can it really be an attribute of the row? Nevertheless, an exceptionally nice aspect of a surrogate key is that the value of the key should never change This, coupled with the fact that surrogate keys are always a single column, makes several aspects of implementation far easier The only reason for the existence of the surrogate key is to identify a row The main reason for

an artificial key is to provide a key that an end user never has to view and never has to interact with Think of it like your driver’s license number, an ID number that is given to you when you begin to

Trang 4

drive It may have no other meaning than a number that helps a police officer look up who you are

when you’ve been testing to see just how fast you can go in sixth gear (although in the United

King-dom it is a scrambled version of the date of birth) The surrogate key should always have some

element that is just randomly chosen, and it should never be based on data that can change If your

driver’s license number were a smart key and decoded to include your hair color, the driver’s license

number might change frequently (for some youth and we folks whose hair has turned a different

color) No, this value is good only for looking you up in a database

Usually a true surrogate key is never shared with any users It will be a value generated on the computer system that is hidden from use, while the user directly accesses only the natural keys’

val-ues Probably the best reason for this definition is that once a user has access to a value, it then may

need to be modified For example, if you were customer 0000013 or customer 00000666, you might

request a change

■ Note In some ways, surrogate keys should probably not even be mentioned in the logical design section of this

book, but it is important to know of their existence, since they will undoubtedly still crop up in some logical

designs A typical flame war on the newsgroups (and amongst the tech reviewers of this book) is concerning

whether surrogate keys are a good idea I’m a proponent of their use (as you will see), but I try to be fairly open in

my approach in the book to demonstrate both ways of doing things Generally speaking, if a value is going to be

accessible to the end user, my preference is that it really needs to be modifiable and readable You can also have

two surrogate keys in a table: one that is the unchanging “address” of a value, the other that is built for user

con-sumption (that is compact, readable, and changeable if it somehow offends your user)

Just as the driver’s license number probably has no meaning to the police officer other than a means to quickly call up and check your records, the surrogate is used to make working with the

data programmatically easier Since the source of the value for the surrogate key does not have any

correspondence to something a user might care about, once a value has been associated with a row,

there is not ever a reason to change the value This is an exceptionally nice aspect of surrogate keys

The fact that the value of the key does not change, coupled with the fact that it is always a single

col-umn, makes several aspects of implementation far easier This will be made clearer later in the book

when choosing a primary key

Thinking back to the driver’s license analogy, if the driver’s license card has just a single value (the surrogate key) on it, how would Officer Uberter Sloudoun determine whether you were actually

the person identified? He couldn’t, so there are other attributes listed, such as name, birth date, and

usually your picture, which is an excellent unique key for a human to deal with (except possibly for

identical twins, of course) In this very same way, a table ought to have other keys defined as well, or

it is not a proper table

Consider the earlier example of a product identifier consisting of seven parts:

• X: Type of product (LCD television)

• JV: Subtype of product (32-inch console)

• 1023: Lot that the product was produced in (the 1023rd batch produced)

• 293: Day of year

• 9: Last digit of year

• 2: Color

• 000123: Order of production

A natural key would consist of these seven parts There is also a product serial number, which is the concatenation of the values such as XJV102329392000123 to identify the row Say you also have

Trang 5

a surrogate key on the table that has a value of 3384038483 If the only key defined on the rows is the surrogate, the following situation might occur:

SurrogateKey ProductSerialNumber

–––––––––––– –––––––––––––––––––

3384038483 XJV102329392000123

3384434222 ZJV104329382043534

The first two rows are not duplicates, but since the surrogate key values have no real meaning,

in essence these are duplicate rows, since the user could not effectively tell them apart

This sort of problem is common, because most people using surrogate keys do not understand that only having a surrogate key opens them up to having rows with duplicate data in the columns where the data has some logical relationship to each other A user looking at the preceding table would have no clue which row actually represented the product he or she was after, or if both rows did

■ Note When doing logical design, I tend to model each table with a surrogate key, since during the design process I may not yet know what the final keys will in fact turn out to be This approach will become obvious throughout the book, especially in the case study presented throughout much of the book

Missing Values (NULLs)

If you look up the definition of a “loaded subject” in a computer dictionary, you will likely find the word NULL In the database, there must exist some way to say that the value of a given column is not known or that the value is irrelevant Often, a value outside of legitimate actual range (sometimes referred to as a sentinel value) is used to denote this value For decades, programmers have used

ancient dates in a date column to indicate that a certain value does not matter, they use a negative value where it does not make sense in the context of a column, or they simply use a text string of 'UNKNOWN' or 'N/A' These approaches are fine, but special coding is required to deal with these val-ues, for example:

IF (value<>'UNKNOWN') THEN

This is OK if it needs to be done only once The problem, of course, is that this special coding is needed every time a new type of column is added Instead, it is common to use a value of NULL,

which in relational theory means an empty set or a set with no value Going back to Codd’s rules, the third rule states the following:

NULL values (distinct from empty character string or a string of blank characters or zero) are

supported in the RDBMS for representing missing information in a systematic way, independ-ent of data type.

There are a couple of properties of NULL that you need to consider:

• Any value concatenated with NULL is NULL NULL can represent any valid value, so if an unknown value is concatenated with a known value, the result is still an unknown value

• All math operations with NULL will return NULL, for the very same reason that any value con-catenated with NULL returns NULL

• Logical comparisons can get tricky when NULL is introduced

Tiêu đề	Introduction to database concepts
Năm xuất bản	2008

Định dạng
Số trang	5
Dung lượng	212,76 KB