6 Normalization his chapter focuses on the fundamentals of normal forms for rela-tional databases and the database design step that normalizes the candidate tables [step IId of the datab
Trang 16 Normalization
his chapter focuses on the fundamentals of normal forms for rela-tional databases and the database design step that normalizes the candidate tables [step II(d) of the database design life cycle] It also inves-tigates the equivalence between the conceptual data model (e.g., the ER model) and normal forms for tables As we go through the examples in this chapter, it should become obvious that good, thoughtful design of a conceptual model will result in databases that are either already normal-ized or can be easily normalnormal-ized with minor changes This truth illus-trates the beauty of the conceptual modeling approach to database design, in that the experienced relational database designer will develop
a natural gravitation toward a normalized model from the beginning For most database practitioners, Sections 6.1 through 6.4 cover the critical normalization needed for everyday use, through Boyce-Codd normal form (BCNF) Section 6.5 covers the higher normal forms of mostly theoretical interest; however, we do show the equivalency between higher normal forms and ternary relationships in the ER model and UML for the interested reader
6.1 Fundamentals of Normalization
Relational database tables, whether they are derived from ER or UML models, sometimes suffer from some rather serious problems in terms of
T
Trang 2performance, integrity and maintainability For example, when the entire database is defined as a single large table, it can result in a large amount of redundant data and lengthy searches for just a small number
of target rows It can also result in long and expensive updates, and dele-tions in particular can result in the elimination of useful data as an unwanted side effect
Such a situation is shown in Figure 6.1, where products, salespersons, customers, and orders are all stored in a single table called Sales In this table, we see that certain product and customer information is stored redundantly, wasting storage space Certain queries, such as “Which cus-tomers ordered vacuum cleaners last month?” would require a search of the entire table Also, updates such as changing the address of the cus-tomer Dave Bachmann would require changing many rows Finally, deleting an order by a valued customer such as Qiang Zhu (who bought
an expensive computer), if that is his only outstanding order, deletes the only copy of his address and credit rating as a side effect Such informa-tion may be difficult (or sometimes impossible) to recover These prob-lems also occur for situations in which the database has already been set
up as a collection of many tables, but some of the tables are still too large
If we had a method of breaking up such a large table into smaller tables so that these types of problems would be eliminated, the database would be much more efficient and reliable Classes of relational database schemes or table definitions, called normal forms, are commonly used to accomplish this goal The creation of a normal form database table is called normalization Normalization is accomplished by analyzing the
Figure 6.1 Single table database
DVD player
Plymouth product_name
Sales
order_no cust_name cust_addr credit date
1-3-03 4-15-05 9-12-04 12-5-04 5-10-05 8-3-02 10-1-04 4-19-99 1-4-04 3-15-04
sales_name Carl Bloch Ted Hanss Dick Phillips Fred Remley
R Metz Paul Basile Carl Bloch Carl Bloch Dick Phillips Fred Remley
6 10 8 3 7 8 7 8 6 6
Austin Ann Arbor Detroit Chicago Mumbai Chicago Detroit Austin East Lansing
1458 2730 2460 519 1986 1817 1865 1885 1943 2315
Dave Bachmann Qiang Zhu Mike Stolarchuck Peter Honeyman Charles Antonelli C.V Ravishankar Charles Antonelli Betsy Karmeisool Dave Bachmann Sakti Pramanik
vacuum cleaner
computer
refrigerator
radio
CD player
vacuum cleaner
vacuum cleaner
refrigerator
television
Trang 3interdependencies among individual attributes associated with those tables and taking projections (subsets of columns) of larger tables to form smaller ones
Let us first review the basic normal forms, which have been well established in the relational database literature and in practice
6.1.1 First Normal Form
Definition A table is in first normal form (1NF) if and only if all
col-umns contain only atomic values, that is, each column can have only one value for each row in the table
Relational database tables, such as the Sales table illustrated in Figure
6.1, have only atomic values for each row and for each column Such tables are considered to be in first normal form, the most basic level of normalized tables
To better understand the definition for 1NF, it helps to know the
dif-ference between a domain, an attribute, and a column A domain is the
set of all possible values for a particular type of attribute, but may be used for more than one attribute For example, the domain of people’s names is the underlying set of all possible names that could be used for either customer-name or salesperson-name in the database table in Fig-ure 6.1 Each column in a relational table represents a single attribute, but in some cases more than one column may refer to different attributes from the same domain When this occurs, the table is still in 1NF because the values in the table are still atomic In fact, standard SQL assumes only atomic values and a relational table is by default in 1NF A nice explanation of this is given in Muller [1999]
6.1.2 Superkeys, Candidate Keys, and Primary Keys
A table in 1NF often suffers from data duplication, update performance, and update integrity problems, as noted above To understand these issues better, however, we must define the concept of a key in the
con-text of normalized tables A superkey is a set of one or more attributes,
which, when taken collectively, allows us to identify uniquely an entity
or table Any subset of the attributes of a superkey that is also a superkey,
and not reducible to another superkey, is called a candidate key A primary
key is selected arbitrarily from the set of candidate keys to be used in an index for that table
Trang 4As an example, in Figure 6.2 a composite of all the attributes of the table forms a superkey because duplicate rows are not allowed in the relational model Thus, a trivial superkey is formed from the composite
of all attributes in a table Assuming that each department address (dept_addr) in this table is single valued, we can conclude that the com-posite of all attributes except dept_addr is also a superkey Looking at smaller and smaller composites of attributes and making realistic assumptions about which attributes are single valued, we find that the composite (report_no, author_id) uniquely determines all the other attributes in the table and is therefore a superkey However, neither report_no nor author_id alone can determine a row uniquely, and the composite of these two attributes cannot be reduced and still be a super-key Thus, the composite (report_no, author_id) becomes a candidate key Since it is the only candidate key in this table, it also becomes the primary key
A table can have more than one candidate key If, for example, in Figure 6.2, we had an additional column for author_ssn, and the com-posite of report_no and author_ssn uniquely determine all the other attributes of the table, then both (report_no, author_id) and (report_no, author_ssn) would be candidate keys The primary key would then be an arbitrary choice between these two candidate keys
Other examples of multiple candidate keys can be seen in Figure 5.5
(see Chapter 5) In Figure 5.5a the table uses_notebook has three
candi-date keys: (emp_id, project_name), (emp_id, notebook_no), and
(project_name, notebook_no); and in Figure 5.5b the table assigned_to
has two candidate keys: (emp_id, loc_name) and (emp_id, project_name) Figures 5.5c and 5.5d each have only a single candidate key
Figure 6.2 Report table
5789
design report_no
Report
editor dept_no dept_name dept_addr author_id author_name
mantei bolton koenig fry umar koenig
author_addr cs-tor mathrev mathrev folkstone prise mathrev
53 44 71 26 38 71
design design analysis analysis analysis
argus1 argus1 argus1 argus2 argus2 argus2
woolf woolf woolf koenig koenig koenig
15 15 15 27 27 27
4216
4216
4216
5789
5789
Trang 56.1.3 Second Normal Form
To explain the concept of second normal form (2NF) and higher, we introduce the concept of functional dependence, which was briefly described in Chapter 2 The property of one or more attributes that uniquely determine the value of one or more other attributes is called
functional dependence Given a table (R), a set of attributes (B) is
function-ally dependent on another set of attributes (A) if, at each instant of time, each A value is associated with only one B value Such a functional dependence is denoted by A -> B In the preceding example from Figure 6.2, let us assume we are given the following functional dependencies
for the table report:
report: report_no -> editor, dept_no
dept_no -> dept_name, dept_addr author_id -> author_name, author_addr
Definition A table is in second normal form (2NF) if and only if it is
in 1NF and every nonkey attribute is fully dependent on the primary key An attribute is fully dependent on the primary key if it is on the right side of an FD for which the left side is either the primary key itself or something that can be derived from the primary key using the transitivity of FDs
An example of a transitive FD in report is the following:
report_no -> dept_no dept_no -> dept_name Therefore we can derive the FD (report_no -> dept_name), since dept_name is transitively dependent on report_no
Continuing our example, the composite key in Figure 6.2, (report_no, author_id), is the only candidate key and is therefore the primary key However, there exists one FD (dept_no -> dept_name, dept_addr) that has no component of the primary key on the left side, and two FDs (report_no -> editor, dept_no and author_id -> author_name, author_addr) that contain one component of the primary key on the left
side, but not both components As such, report does not satisfy the
condition for 2NF for any of the FDs