Joe Celko s SQL for Smarties - Advanced SQL Programming P12 potx

For example, if we exclude day, hour, gate → flight, we must then include day, hour, gate → pilot, and vice versa, because each is used in the other’s derivation.. flight → destination a

Trang 1

1) (day, hour, gate) -> pilot 2) (day, hour, pilot) -> flight

prove that:

(day, hour, gate) -> flight.

3) (day, hour) -> (day, hour); Reflexive 4) (day, hour, gate) -> (day, hour); Augmentation on 3 5) (day, hour, gate) -> (day, hour, pilot); Union 1 & 4 6) (day, hour, gate) -> flight; Transitive 2 and 5 Q.E.D.

The answer is to start by attempting to derive each of the FDs from the rest of the set What we get is several short proofs, each requiring different “given” FDs in order to get to the derived FD

Here is a list of each of the proofs used to derive the ten fragmented FDs in the problem With each derivation, we include every derivation step and the legal FD calculus operation that allows us to make that step

An additional operation that we include here, which was not included in the axioms we listed earlier, is left reduction Left reduction says that if

XX → Y then X → Y The reason it was not included is that this is actually

a theorem, and not one of the basic axioms (a side problem: can you derive left reduction?)

Prove: (day, hour, pilot) -> gate a) day -> day; Reflexive b) (day, hour, pilot) -> day; Augmentation (a) c) (day, hour, pilot) -> (day, flight); Union (6, b) d) (day, hour, pilot) -> gate; Transitive (c, 3) Q.E.D.

Prove: (day, hour, gate) -> pilot a) day -> day; Reflexive b) day, hour, gate -> day; Augmentation (a) c) day, hour, gate -> (day, flight); Union (9, b) d) day, hour, gate -> pilot; Transitive (c, 4) Q.E.D.

Prove: (day, flight) -> gate a) (day, flight, pilot) -> gate; Pseudotransitivity (2, 5)

Trang 2

b) (day, flight, day, flight) -> gate; Pseudotransitivity (a, 4) c) (day, flight) -> gate; Left reduction (b)

Q.E.D.

Prove: (day, flight) -> pilot

a) (day, flight, gate) -> pilot; Pseudotransitivity (2, 8) b) (day, flight, day, flight) -> pilot; Pseudotransitivity (a, 3)

c) (day, flight) -> pilot; Left reduction (b)

Q.E.D.

Prove: (day, hour, gate) -> flight

a) (day, hour) -> (day, hour); Reflexivity

b) (day, hour, gate) -> (day, hour); Augmentation (a)

c) (day, hour, gate) -> (day, hour, pilot); Union (b, 8) d) (day, hour, gate) -> flight; Transitivity (c, 6) Q.E.D.

Prove: (day, hour, pilot) -> flight

a) (day, hour) -> (day, hour); Reflexivity

b) (day, hour, pilot) -> (day, hour); Augmentation (a)

c) (day, hour, pilot) -> day, hour, gate; Union (b, 5)

d) (day, hour, pilot) -> flight; Transitivity (c, 9)

Q.E.D.

Prove: (day, hour, gate) -> destination

a) (day, hour, gate) -> destination; Transitivity (9, 1)

Q.E.D.

Prove: (day, hour, pilot) -> destination

a) (day, hour, pilot) -> destination; Transitivity (6, 1) Q.E.D.

Now that we’ve shown you how to derive eight of the ten FDs from other FDs, you can try mixing and matching the FDs into sets so that each set meets the following criteria:

1 Each attribute must be represented on either the left or right side of at least one FD in the set

2 If a given FD is included in the set, then all the FDs needed to derive it cannot also be included

Trang 3

3 If a given FD is excluded from the set, then the FDs used to derive it must be included

This produces a set of “nonredundant covers,” which can be found through trial and error and common sense For example, if we exclude (day, hour, gate) → flight, we must then include (day, hour, gate) →

pilot, and vice versa, because each is used in the other’s derivation If you want to be sure your search was exhaustive, however, you may want

to apply a more mechanical method, which is what the CASE tools do for you

The algorithm for accomplishing this task is basically to generate all the combinations of sets of the FDs (flight → destination) and (flight →

hour) are excluded in the combination generation because they cannot

be derived This gives us (2^8), or 256, combinations of FDs Each combination is then tested against the criteria

Fortunately, a simple spreadsheet does all the tedious work In this problem, the first criterion eliminates only 15 sets Then the second criterion eliminates 152 sets, and the third criterion drops another 67 This leaves us with 22 possible covers, 5 of which are the answers we are looking for (we will explain the other 17 later)

These five nonredundant covers are:

Set I:

flight -> destination flight -> hour

(day, hour, gate) -> flight (day, hour, gate) -> pilot (day, hour, pilot) -> gate

Set II:

(day, hour, gate) -> pilot (day, hour, pilot) -> flight (day, hour, pilot) -> gate

Set III:

(day, flight) -> gate (day, flight) -> pilot (day, hour, gate) -> flight

Trang 4

Set IV:

flight -> destination

flight -> hour

(day, flight) -> gate

(day, hour, gate) -> pilot

(day, hour, pilot) -> flight

Set V:

flight -> destination

flight -> hour

(day, flight) -> pilot

(day, hour, gate) -> flight

(day, hour, pilot) -> gate

(day, hour, pilot) -> flight

At this point, we perform unions on FDs with the same left-hand side and make tables for each grouping with the left-hand side as a key We can also eliminate symmetrical FD’s (defined as X → Y and Y → X, and written with a two headed arrow, X ↔ Y) by collapsing them into the same table

These possible schemas are at least in 3NF They are given in

shorthand SQL DDL (Data Declaration Language) without data type declarations

Solution 1:

CREATE TABLE R1 (flight, destination, hour,

PRIMARY KEY (flight));

CREATE TABLE R2 (day, hour, gate, flight, pilot,

PRIMARY KEY (day, hour, gate),

UNIQUE (day, hour, pilot),

UNIQUE (day, flight),

UNIQUE (flight, hour));

Solution 2:

CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY

(flight));

CREATE TABLE R2 (day, flight, gate, pilot,

PRIMARY KEY (day, flight));

CREATE TABLE R3 (day, hour, gate, flight,

PRIMARY KEY (day, hour, gate),

Trang 5

UNIQUE (flights, hour));

CREATE TABLE R4 (day, hour, pilot, flight, PRIMARY KEY (day, hour, pilot));

Solution 3:

CREATE TABLE R1 (flight, destination, hour, flight PRIMARY KEY (flight));

CREATE TABLE R2 (day, flight, gate, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, pilot,

PRIMARY KEY (day, hour, gate), UNIQUE (day, hour, pilot), UNIQUE (day, hour, gate));

CREATE TABLE R4 (day, hour, pilot, flight PRIMARY KEY (day, hour, pilot),

UNIQUE(day, flight), UNIQUE (flight, hour));

Solution 4:

CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight));

CREATE TABLE R2 (day, flight, pilot, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, flight,

PRIMARY KEY (day, hour, gate), UNIQUE (flight, hour));

CREATE TABLE R4 (day, hour, pilot, gate, PRIMARY KEY (day, hour, pilot));

These solutions are a mess, but they are a 3NF mess! Is there a better answer? Here is one in BCNF and only two tables, proposed by Chris Date (Date 1995, p 224)

CREATE TABLE DailySchedules (flight, destination, hour PRIMARY KEY (flight));

CREATE TABLE PilotSchedules (day, flight, gate, pilot, PRIMARY KEY (day, flight));

This is a workable schema, but we could expand the constraints to give us better performance and more precise error messages, since schedules are not likely to change:

CREATE TABLE DailySchedules (flight, hour, destination,

Trang 6

UNIQUE (flight, hour, destination),

UNIQUE (flight, hour),

UNIQUE (flight));

CREATE TABLE PilotSchedules

(day, flight, day, hour, gate, pilot,

UNIQUE (day, flight, gate),

UNIQUE (day, flight, pilot),

FOREIGN KEY (flight, hour) REFERENCES R1(flight, hour));

2.10 Practical Hints for Normalization

CASE tools implement formal methods for doing normalization In particular, E-R (entity-relationship) diagrams are very useful for this process However, a few informal hints can help speed up the process and give you a good start

Broadly speaking, tables represent either entities or relationships, which is why E-R diagrams work so well as a design tool Tables that represent entities should have a simple, immediate name suggested by their contents—a table named Students has student data in it, not student data and bowling scores It is also a good idea to use plural or collective nouns as the names of such tables to remind you that a table is

a set of entities; the rows are the single instances of them

Tables that represent many-to-many relationships should be named

by their contents, and should be as minimal as possible For example, Students are related to Classes by a third (relationship) table for their attendance These tables might represent a pure relationship, or they might contain attributes that exist within the relationship, such as a grade for the class attended Since the only way to get a grade is to attend the class, the relationship is going to have a column for it, and will be named “ReportCards,” “Grades” or something similar Avoid naming entities based on many-to-many relationships by combining the two table names For example, Student_Course is a bad name for the Enrollment entity

Avoid NULLs whenever possible If a table has too many NULL-able columns, it is probably not normalized properly Try to use a NULL only for a value that is missing now, but which will be resolved later Even better, you can put missing values into the encoding schemes for that

column, as discussed in as discussed in Section 5.2 of SQL Programming Style, ISBN 0-12-088797-5, on encoding schemes.

Trang 7

A normalized database will tend to have a lot of tables with a small number of columns per table Don’t panic when you see that happen People who first worked with file systems (particularly on computers that used magnetic tape) tend to design one monster file for an application and do all the work against those records This made sense

in the old days, since there was no reasonable way to JOIN a number of small files together without having the computer operator mount and dismount lots of different tapes The habit of designing this way carried over to disk systems, since the procedural programming languages were still the same

The same nonkey attribute in more than one table is probably a normalization problem This is not a certainty, just a guideline The key that determines that attribute should be in only one table, and therefore the attribute should be with it

As a practical matter, you are apt to see the same attribute under different names, and you will need to make the names uniform throughout the entire database The columns date_of_birth, birthdate, birthday, and dob are very likely the same attribute for an employee

2.11 Key Types

The logical and physical keys for a table can be classified by their behavior and their source Table 2.1 is a quick table of my classification system

Table 2.1 Classification System for Key Types

Natural Artificial "Exposed Surrogate Physical

Locator"

===================================================================== Constructed from attributes |

in the reality |

of the data model | Y N N Y

|

Verifiable in reality | Y N N N

|

Verifiable in itself | Y Y N N

|

Visible to the user | Y Y Y N

Trang 8

2.11.1 Natural Keys

A natural key is a subset of attributes that occur in a table and act as a unique identifier The user sees them You can go to external reality and verify them Examples of natural keys include the UPC codes on consumer goods (read the package barcode) and coordinates (get a GPS)

Newbies worry about a natural compound key becoming very long

My answer is, “So what?” This is the 21st century; we have much better computers than we did in the 1950s, when key size was a real physical issue To replace a natural two- or three-integer compound key with a huge GUID that no human being or other system can possibly

understand, because they think it will be faster, only cripples the system and makes it more prone to errors I know how to verify the (longitude, latitude) pair of a location; how do you verify the GUID assigned to it?

A long key is not always a bad thing for performance For example, if

I use (city, state) as my key, I get a free index on just (city) in many systems I can also add extra columns to the key to make it a super-key, when such a super-key gives me a covering index (i.e., an index that contains all of the columns required for a query, so that the base table does not have to be accessed at all)

2.11.2 Artificial Keys

An artificial key is an extra attribute added to the table that is seen by the user It does not exist in the external reality, but can be verified for syntax or check digits inside itself One example of an artificial key is the open codes in the UPC/EAN scheme that a user can assign to his own stuff The check digits still work, but you have to verify them inside your own enterprise

Experienced database designers tend toward keys they find in industry standard codes, such as UPC/EAN, VIN, GTIN, ISBN, etc They know that they need to verify the data against the reality they are modeling A trusted external source is a good thing to have I know why this VIN is associated with this car, but why is an auto-number value of

42 associated with this car? Try to verify the relationship in the reality you are modeling It makes as much sense as locating a car by its parking space number

2.11.3 Exposed Physical Locators

An exposed physical locator is not based on attributes in the data model and is exposed to the user There is no way to predict it or verify it The system obtains a value through some physical process totally unrelated

Trang 9

to the logical data model The user cannot change the locators without destroying the relationships among the data elements

Examples of exposed physical locators would be physical row locations encoded as a number, string or proprietary data type If hashing tables were accessible in an SQL product they would qualify, but they are usually hidden from the user

Many programmers object to putting IDENTITY and other auto-numbering devices into this category To convert the number into a physical location requires a search rather than a hashing table lookup or positioning a read/writer head on a disk drive, but the concept is the same The hardware gives you a way to go to a physical location that has nothing to do with the logical data model, and that cannot be changed in the physical database or verified externally

Most of the time, exposed physical locators are used for faking a sequential file’s positional record number, so I can reference the physical storage location—a 1960s ISAM file in SQL You lose all the advantages

of an abstract data model and SQL set-oriented programming, because you carry extra data and destroy the portability of code

The early SQLs were based on preexisting file systems The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns—in short, just like a deck of punch cards or a magnetic tape Most programmers still carry that mental model, which is why I keep ranting about file versus table, row versus record and column versus field

But physically contiguous storage is only one way of building a relational database—and it is not the best one The basic idea of a relational database is that the user is not supposed to know how or where things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product on particular hardware at a particular time This is discussed further in Section 1.2.1, “IDENTITY Columns.”

Finally, an appeal to authority, with a quote from Dr Codd:

“Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them .”

This means that a surrogate ought to act like an index: created by the

user, managed by the system, and never seen by a user That means never

used in code, DRI, or anything else that a user writes

Codd also wrote the following:

Trang 10

“There are three difficulties in employing user-controlled keys as permanent surrogates for entities

1 The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g., if two companies merge, the two employee databases might be combined, with the result that some or all of the serial numbers might be changed)

2 Two relations may have user-controlled keys defined on

distinct domains (e.g., one uses Social Security numbers, while the other uses employee serial numbers), and yet the entities denoted are the same

3 It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value, or after

it has ceased to have one (e.g., an applicant for a job and a retiree).”

These difficulties have the important consequence that an equi-join

on common key values may not yield the same result as a join on common entities One solution—proposed in Chapter 4 and more fully

in Chapter 14—is to introduce entity domains, which contain

system-assigned surrogates “Database users may cause the system to generate or

delete a surrogate, but they have no control over its value, nor is its value ever displayed to them .” (Codd 1979)

2.11.4 Practical Hints for Denormalization

The subject of denormalization is a great way to get into religious wars

At one extreme, you will find relational purists who think that the idea of not carrying a database design to at least 3NF is a crime against nature

At the other extreme, you will find people who simply add and move columns all over the database with ALTER statements, never keeping the schema stable

The reason given for denormalization is performance A fully

normalized database requires a lot of JOINs to construct common VIEWs

of data from its components JOINs used to be very costly in terms of time and computer resources, so “preconstructing” the JOIN in a denormalized table can save quite a bit

Today, only data warehouses should be denormalized—never a production OLTP system

Định dạng
Số trang	10
Dung lượng	133,68 KB