The columns are: ■ Test: A description of the test or quiz given ■ Student: The student who took the test ■ Date: The date on which the test was taken ■ TotalPoints: The total number of
Trang 1Despite these provisos, certain database design principles have evolved over time
to guide us in our quest for an optimal design structure It should be said from the outset that the most influential architect of relational database design is E.F Codd, who published his groundbreaking article, ‘‘A Relational Model of Data for Large Shared Data Banks’’ in 1970 This article laid the foundation for what
we now call the relational model and the concept of normalization.
Goals of Normalization
The term normalization refers to a specific process that allows designers to turn
unstructured data into a properly designed set of tables and data elements The best way to understand normalization is to illustrate what it isn’t To do this, we’ll start with the presentation of a poorly designed table with a number of obvious problems The following is a table named Grades, and it attempts to present information about all of the grades that students have received for the tests they’ve taken Each row represents a grade for a particular student
Test Student Date
Total Points Grade TestFormat Teacher Assistant
Pronoun Quiz Amy 2009-03-02 10 8 Multiple Choice Smith Collins Pronoun Quiz Jon 2009-03-02 10 6 Multiple Choice Jones Brown Solids Quiz Beth 2009-03-03 20 17 Multiple Choice Kaplan NULL China Test Karen 2009-02-04 50 45 Essay Harris Taylor China Test Alex 2009-03-04 50 38 Essay Harris Taylor Grammar Test Karen 2009-03-05 100 88 Multiple Choice,
Essay
Smith Collins
Let’s first list the information that each column in this table is meant to provide The columns are:
■ Test: A description of the test or quiz given
■ Student: The student who took the test
■ Date: The date on which the test was taken
■ TotalPoints: The total number of possible points for the test
■ Grade: The number of points that the student received
Chapter 19 ■ Principles of Database Design
196
Trang 2■ TestFormat: The format of the test, either essay, multiple choice, or both
■ Teacher: The teacher who gave the test
■ Assistant: The person who assisted the teacher in this class
We’re going to assume that the primary key for this table is a composite primary
key consisting of the Test and Student columns Each row in the table is meant to
express a grade for a specific test and student
Let’s now discuss two obvious problems with this table First, certain data is
unnecessarily duplicated For example, you can see that the Pronoun Quiz,
which was given on 2009-03-02, had a total of 10 points The problem, however,
is that this information needs to be repeated on every row for that quiz It
would be better if we could simply record the total points for that particular
quiz once
A second problem is that data is repeated within certain single cells We have a
row for which the TestFormat is both Multiple Choice and Essay This was done
because the test had both types of questions But this makes the data difficult to
utilize If we wanted to retrieve all tests with essay questions, how could we
do that?
To be more general, the main problem with this table is that it attempts to put
all information into a single table It would be much better to break down the
information in this table into separate entities, such as students, grades, and
teachers, representing each entity as a separate table The power of SQL can then
be used to join tables together to retrieve any needed information
With this discussion in mind, let’s now formalize what the process of
normal-ization hopes to accomplish There are two main goals:
■ Eliminate redundant data The above example clearly illustrates the issue of
redundant data But why is this important? What exactly is the problem
with listing the same data on multiple rows? Besides the obvious duplication
of effort, one answer is that redundancy reduces flexibility When data is
repeated, that means that any changes to particular values affect multiple
rows rather than just one
■ Eliminate insert, delete, and update anomalies The problem of redundant
data also relates to this second goal, which is to eliminate insert, delete, and
Trang 3update anomalies Let’s say, for example, that one particular teacher gets married and changes her name You would like the data to reflect the new name You now need to do an update on all rows that contain her name Because the data is stored redundantly, you need to update a large amount
of data, rather than just one row
There are also insert and delete anomalies For example, let’s say you just hired a new teacher to teach music You would like to record that some-where in your database However, since that teacher hasn’t yet given any tests, there is nowhere to put this information, since you don’t have a table specific to the entity of teachers
Similarly, a delete anomaly would occur if you wanted to delete a row, but
by doing so that would eliminate some related piece of information To use another example, if you had a database of books and wanted to delete a row for a book by Nathaniel Hawthorne, and if that were the only book for
Mr Hawthorne, then that row deletion would not only eliminate the book, but also the fact that Nathanial Hawthorne is an author of other books you might acquire in the future
How to Normalize Data
We’ve been throwing around the term normalization for a while It’s now time to
be more specific about what it means
The term itself originates with E.F Codd, and it refers to a series of recommended steps taken to remove redundancy and update anomalies from a database design
The steps involved in the normalization process are commonly referred to as first normal form, second normal form, third normal form, and so on Although certain individuals have described steps up to sixth normal form, the usual practice is to
go only through first, second, and third normal form When data is in third normal form, it is generally said to be sufficiently normalized
We are not going to describe the entire set of rules and procedures for converting data into first, second, and third normal form There are texts that will lead you through the process in great detail, showing you how to transform data first into first normal form, then into second form, and then finally into third normal form Instead, we are going to summarize the rules for getting your data into third normal form In practice, an experienced database administrator can jump from
Chapter 19 ■ Principles of Database Design
198
Trang 4unstructured data to third normal form without having to follow every
inter-mediate procedure We will do the same thing here
The three main rules for normalizing your data are as follows:
■ Eliminate repeating data This rule means that no multivalued attributes
are allowed In the previous example, we cannot allow a value such as
Multiple Choice, Essay to exist in a single data cell The existence of multiple
values in a single cell creates obvious difficulties in retrieving data by any
given specified value
A corollary to this rule is that repeated columns are not allowed In our
example, the database might have been designed so that, rather than a single
column named TestFormat, we had two separate columns named Test
Format1 and TestFormat2 With this alternative approach, we might have
placed the value Multiple Choice in the Test Format1 column and Essay in
the TestFormat2 column This would not be permitted We don’t want to
have repeated data, whether it is multiple values in a single column or
multiple columns to handle similar data
■ Eliminate partial dependencies This rule refers primarily to situations
where the primary key for a table is a composite primary key, meaning a key
composed of multiple columns The rule states that no column in the table
can be related only to part of the primary key
Let’s illustrate with an example As mentioned, the primary key in the
Grades table is a composite key consisting of the Student and Test columns
The problem occurs with columns such as TotalPoints The TotalPoints
column is really an attribute of the test and has nothing to do with students
This rule mandates that all non-key columns in a table refer to the entire
key, and not just a part of the key In essence, partial dependencies indicate
that the data in the table relates to more than one entity
■ Eliminate transitive dependencies This rule refers to situations where a
column in the table refers not to the primary key, but to another non-key
column in the same table In this example, the Assistant column is really an
attribute of the Teacher column The fact that Assistant relates to the teacher
and not to anything in the primary key (test or student) indicates that the
information doesn’t belong in this table
Trang 5So we’ve seen the problems and have talked about the rules for fixing the data How are proper database design changes actually determined? This is where experience comes in And there is generally not a single solution to the problem That said, the following is one solution to this design problem In this new design, several tables have been created from the one original table, and all data is now in normalized form Figure 19.1 shows the tables in the new design, shown without data
The primary keys in each table are shown in bold A number of ID columns with auto-increment values have been added to the tables, allowing relationships between the tables to be defined All the other columns are the same as shown before
The main point to notice is that every entity discussed in this example has been broken out into separate tables The Students table has information about each student The only attribute in this table is the student name The Grades table has information about each grade It has a composite primary key of StudentID and TestID because each grade is tied to a student and to a specific test given
The Tests table has information about each test given, such as the date, TeacherID, the test description, and the total points for the test
The Formats table has information about the test formats Multiple rows are added to this table for each test, to show whether the test is multiple choice, essay, or both
The Teachers table has information about each teacher, including the teacher’s assistant, if there is one
The following shows the data contained in these new tables, corresponding to the data in the original Grades table
Figure 19.1
Normalized design.
Chapter 19 ■ Principles of Database Design
200