SQL Server developers generally refer to database elements as tables, rows, and columns when discussingthe SQL DDL layer or physical schema, and sometimes use the terms entity, tuple, an
Trang 2Relational Database
Design
IN THIS CHAPTER Introducing entities, tuples, and attributes
Conceptual diagramming vs.
SQL DDL Avoiding normalization over-complexity Choosing the right database design pattern
Ensuring data integrity Exploring alternative patterns Normal forms
Iplay jazz guitar — well, I used to play before life became so busy
(You can listen to some of my MP3s on my ‘‘about’’ page on
www.sqlserverbible.com.) There are some musicians who can
hear a song and then play it; I’m not one of those I can feel the rhythm, but
I have to work through the chords and figure them out almost mathematically
before I can play anything but a simple piece To me, building chords and chord
progressions is like drawing geometric patterns on the guitar neck using the frets
and strings
Music theory encompasses the scales, chords, and progressions used to make
music Every melody, harmony, rhythm, and song draws from music theory For
some musicians there’s just a feeling that the song sounds right For those who
make music their profession, they understand the theory behind why a song feels
right Great musicians have both the feel and the theory in their music
Designing databases is similar to playing music Databases are designed by
combining the right patterns to correctly model a specific solution to a problem
Normalization is the theory that shapes the design There’s both the mathematic
theory of relational algebra and the intuitive feel of an elegant database
Designing databases is both science and art
Database Basics
The purpose of a database is to store the information required by an organization
Any means of collecting and organizing data is a database Prior to the
Informa-tion Age, informaInforma-tion was primarily stored on cards, in file folders, or in ledger
books Before the adding machine, offices employed dozens of workers who spent
all day adding columns of numbers and double-checking the math of others The
job title of those who had that exciting career was computer.
Trang 3Author’s Note
Welcome to the second of five chapters that deal with database design Although they’re spread out in
the table of contents, they weave a consistent theme that good design yields great performance:
■ Chapter 2, ‘‘Data Architecture,’’ provides an overview of data architecture
■ This chapter details relational database theory
■ Chapter 20, ‘‘Creating the Physical Database Schema,’’ discusses the DDL layer of
database design and development
■ Partitioning the physical layer is covered in Chapter 68, ‘‘Partitioning.’’
■ Designing data warehouses for business intelligence is covered in Chapter 70,
‘‘BI Design.’’
There’s more to this chapter than the standard ‘‘Intro to Normalization.’’ This chapter draws on the lessons
I’ve learned over the years and has a few original ideas
This chapter covers a book’s worth of material (which is why I rewrote it three times), but I tried to concisely
summarize the main ideas The chapter opens with an introduction to database design term and concepts
Then I present the same concept from three perspectives: first with the common patterns, then with my
custom Layered Design concept, and lastly with the normal forms I’ve tried to make the chapter flow, but
each of these ideas is easier to comprehend after you understand the other two, so if you have the time, read
the chapter twice to get the most out of it
As the number crunching began to be handled by digital machines, human labor, rather than being
eliminated, shifted to other tasks Analysts, programmers, managers, and IT staff have replaced the
human ‘‘computers’’ of days gone by
Speaking of old computers, I collect abacuses, and I know how to use them too — it keeps me in touch with the roots of computing On my office wall is a very cool nineteenth-century Russian abacus.
Benefits of a digital database
The Information Age and the relational database brought several measurable benefits to organizations:
■ Increased data consistency and better enforcement of business rules
■ Improved sharing of data, especially across distances
■ Improved ability to search for and retrieve information
■ Improved generation of comprehensive reports
■ Improved ability to analyze data trends
The general theme is that a computer database originally didn’t save time in the entry of data, but rather
in the retrieval of data and in the quality of the data retrieved However, with automated data collection
in manufacturing, bar codes in retailing, databases sharing more data, and consumers placing their own
orders on the Internet, the effort required to enter the data has also decreased
Trang 4The previous chapter’s sidebar titled ‘‘Planning Data Stores’’ discusses different types or
styles of databases This chapter presents the relational database design principles and
pat-terns used to develop operational, or OLTP (online transaction processing), databases.
Some of the relational principles and patterns may apply to other types of databases, but databases that
are not used for first-generation data (such as most BI, reporting databases, data warehouses, or
refer-ence data stores) do not necessarily benefit from normalization.
In this chapter, when I use the term ‘‘database,’’ I’m referring exclusively to a relational, OLTP-style
database.
Tables, rows, columns
A relational database collects related, or common, data in a single list For example, all the product
information may be listed in one table and all the customers in another table
A table appears similar to a spreadsheet and is constructed of columns and rows The appeal (and the
curse) of the spreadsheet is its informal development style, which makes it easy to modify and add to
as the design matures In fact, managers tend to store critical information in spreadsheets, and many
databases started as informal spreadsheets
In both a spreadsheet and a database table, each row is an item in the list and each column is a specific
piece of data concerning that item, so each cell should contain a single piece of data about a single item
Whereas a spreadsheet tends to be free-flowing and loose in its design, database tables should be very
consistent in terms of the meaning of the data in a column Because row and column consistency is so
important to a database table, the design of the table is critical
Over the years, different development styles have referred to these concepts with various different terms,
listed in Table 3-1
TABLE 3-1
Comparing Database Terms
The List of Common A Piece of Information Development Style Items An Item in the List in the List
Spreadsheet Spreadsheet/worksheet/
named range
Row Column/cell
Relational algebra/
logical design
Entity, or relation Tuple (rhymes with
couple)
Attribute
Object-oriented
design
Class Object instance Property
Trang 5SQL Server developers generally refer to database elements as tables, rows, and columns when discussing
the SQL DDL layer or physical schema, and sometimes use the terms entity, tuple, and attribute when
discussing the logical design The rest of this book uses the SQL terms (table, row, column), but this
chapter is devoted to the theory behind the design, so I also use the relational algebra terms (entity,
tuple, and attribute)
Database design phases
Traditionally, data modeling has been split into two phases, the logical design and the physical design;
but Louis Davidson and I have been co-presenting at conferences on the topic of database design and
I’ve become convinced that Louis is right when he defines three phases to database design To avoid
confusion with the traditional terms, I’m defining them as follows:
■ Conceptual Model: The first phase digests the organizational requirements and identifies the
entities, their attributes, and their relationships
The conceptual diagram model is great for understanding, communicating, and verifying the organization’s requirements The diagramming method should be easily understood by all the stakeholders — the subject-matter experts, the development team, and management
At this layer, the design is implementation independent: It could end up on Oracle, SQL Server, or even Access Some designers refer to this as the ‘‘logical model.’’
■ SQL DDL Layer: This phase concentrates on performance without losing the fidelity of the
logical model as it applies the design to a specific version of a database engine — SQL Server
2008, for example, generating the DDL for the actual tables, keys, and attributes Typically, the SQL DDL layer generalizes some entities, and replaces some natural keys with surrogate computer-generated keys
The SQL DDL layer might look very different than the conceptual model
■ Physical Layer: The implementation phase considers how the data will be physically stored
on the disk subsystems using indexes, partitioning, and materialized views Changes made to this layer won’t affect how the data is accessed, only how it’s stored on the disk
The physical layer ranges from simple, for small databases (under 20Gb), to complex, with multiple filegroups, indexed views, and data routing partitions
This chapter focuses on designing the conceptual model, with a brief look at normalization followed by
a repertoire of database patterns
Implementing a database without working through the SQL DLL Layer design phase is a certain path to a poorly performing database I’ve seen far too many database purists who didn’t care to learn SQL Server implement conceptual designs only to blame SQL Server for the horrible
performance.
The SQL DLL Layer is covered in Chapter 20, ‘‘Creating the Physical Database Schema.’’
Tuning the physical layer is discussed in Chapters 64, ‘‘Indexing Strategies,’’ and 68,
‘‘Partitioning.’’
Trang 6In 1970, Dr Edgar F Codd published ‘‘A Relational Model of Data for Large Shared Data Bank’’ and
became the father of relational database During the 1970s Codd wrote a series of papers that defined
the concept of database normalization He wrote his famous ‘‘Codd’s 12 Rules’’ in 1985 to define what
constitutes a relational database and to defend the relational database from software vendors who
were falsely claiming to be relational Since that time, others have amended and refined the concept of
normalization
The primary purpose of normalization is to improve the data integrity of the database by reducing or
eliminating modification anomalies that can occur when the same fact is stored in multiple locations
within the database
Duplicate data raises all sorts of interesting problems for inserts, updates, and deletes For example, if
the product name is stored in the order detail table, and the product name is edited, should every order
details row be updated? If so, is there a mechanism to ensure that the edit to the product name
prop-agates down to every duplicate entry of the product name? If data is stored in multiple locations, is it
safe to read just one of those locations without double-checking other locations? Normalization prevents
these kinds of modification anomalies
Besides the primary goal of consistency and data integrity, there are several other very good reasons to
normalize an OLTP relational database:
■ Performance: Duplicate data requires extra code to perform extra writes, maintain
consis-tency, and manipulate data into a set when reading data On my last large production contract
(several terabytes, OLTP, 35K transactions per second), I tested a normalized version of the
database vs a denormalized version The normalized version was 15% faster I’ve found similar
results in other databases over the years
Normalization also reduces locking contention and improves multiple-user concurrency
■ Development costs: While it may take longer to design a normalized database, it’s easier to
work with a normalized database and it reduces development costs
■ Usability: By placing columns in the correct table, it’s easier to understand the database and
easier to write correct queries
■ Extensibility: A non-normalized database is often more complex and therefore more difficult
to modify
The three ‘‘Rules of One’’
Normalization is well defined as normalized forms — specific issues that address specific potential
errors in the design (there’s a whole section on normal forms later in this chapter) But I don’t design a
database with errors and then normalize the errors away; I follow normalization from the beginning to
the conclusion of the design process That’s why I prefer to think of normalization as positively stated
principles
When I teach normalization I open with the three ‘‘Rules of One,’’ which summarize normalization from
a positive point of view One type of item is represented by one entity (table) The key to designing
Trang 7a schema that avoids update anomalies is to ensure that each single fact in real life is modeled by a
single data point in the database Three principles define a single data point:
■ One group of similar things is represented by one entity (table)
■ One thing is represented by one tuple (row)
■ One descriptive fact about the thing is represented by one attribute (column)
Grok these three simple rules and you’ll be a long way toward designing a properly normalized
database
Normalization As Story
The Time Traveler’s Wife, by Audrey Niffenegger, is one of my favorite books Without giving away the
plot or any spoilers, it’s an amazing sci-fi romance story She moves through time conventionally, while
he bounces uncontrollably through time and space Even though the plot is more complex than the average
novel, I love how Ms Niffenegger weaves every detail together into an intricate flow Every detail fits and
builds the characters and the story
In some ways, a database is like a good story The plot of the story is in the data model, and the data
represents the characters and the details.Normalization is the grammar of the database
When two writers tell the same story, each crafts the story differently There’s no single correct way to tell a
story Likewise, there may be multiple ways to model the database There’s no single correct way to model
a database — as long as the database contains all the information needed to extract the story and it follows
the normalized grammar rules, the database will work (Don’t take this to mean that any design might be
a correct design While there may be multiple correct designs, there are many more incorrect designs.) A
corollary is that just as some books read better than others, so do some database schemas flow well, while
other database designs are difficult to query
As with writing a novel, the foundation of data modeling is careful observation, an understanding of reality,
and clear thinking Based on those insights, the data modeler constructs a logical system — a new virtual
world — that models a slice of reality Therefore, how the designer views reality and identifies entities and
their interactions will influence the design of the virtual world Like postmodernism, there’s no single perfect
correct representation, only the viewpoint of the author/designer
Identifying entities
The first step to designing a database conceptual diagram is to identify the entities (tables) Because any
entity represents only one type of thing, it takes several entities together to represent an entire process
or organization
Entities are usually discovered from several sources:
■ Examining existing documents (order forms, registration forms, patient files, reports)
■ Interviews with subject-matter experts
■ Diagramming the process flow
At this early stage the goal is to simply collect a list of possible entities and their facts Some of the
entities will be obvious nouns, such as customers, products, flights, materials, and machines
Trang 8Other entities will be verbs: shipping, processing, assembling parts to build a product Verbs may be
entities, or they may indicate a relationship between two entities
The goal is to simply collect all the possible entities and their attributes At this early stage, it’s also
useful to document as many known relationships as possible, even if those relationships will be edited
several times
Generalization
Normalization has a reputation of creating databases that are complex and unwieldy It’s true that some
database schemas are far too complex, but I don’t believe normalization, by itself, is the root cause
I’ve found that the difference between elegant databases that are a joy to query and overly complex
designs that make you want to polish your resume is the data modeler’s view of entities
When identifying entities, there’s a continuum, illustrated in Figure 3-1, ranging from a broad
all-inclusive view to a very specific narrow definition of the entity
FIGURE 3-1
Entities can be identified along a continuum, from overly generalized with a single table, to overly
specific with too many tables
Result:
Overly
Simple
One Table
Overly Complex Specific Tables
• Data-driven design
• Fewer tables
• Easier to extend
The overly simple view groups together entities that are in fact different types of things, e.g., storing
machines, products, and processes in the single entity This approach might risk data integrity for two
reasons First, it’s difficult to enforce referential integrity (foreign key constraints) because the primary
key attempts to represent multiple types of items Second, these designs tend to merge entities with
different attributes, which means that many of the attributes (columns) won’t apply to various rows
and will simply be left null Many nullable columns means the data will probably be sparsely filled and
inconsistent
At the other extreme, the overly specific view segments entities that could be represented by a single
entity into multiple entities, e.g., splitting different types of subassemblies and finished products into
multiple different entities This type of design risks flexibility and usability:
■ The additional tables create additional work at every layer of the software
■ Database relationships become more complex because what could have been a single
rela-tionship is now multiple relarela-tionships For example, instead of relating an assembly process
between any part, the assembly relationship must now relate with multiple types of parts
Trang 9■ The database has now hard-coded the specific types of similar entities, making it very difficult
to add another similar type of entity Using the manufacturing example again, if there’s an entity for every type of subassembly, then adding another type of subassembly means changes
at every level of the software
The sweet spot in the middle generalizes, or combines, similar entities into single entities This approach
creates a more flexible and elegant database design that is easier to query and extend:
■ Look for entities with similar attributes, or entities that share some attributes
■ Look for types of entities that might have an additional similar entity added in the future
■ Look for entities that might be summarized together in reports
When designing a generalized entity, two techniques are essential:
■ Use a lookup entity to organize the types of entities For the manufacturing example, a
subassemblytypeattribute would serve the purpose of organizing the parts by subassembly type Typically, this would be a foreign key to asubassemblytypeentity
■ Typically, the different entity types that could be generalized together do have some differences
(which is why a purist view would want to segment them) Employing the supertype/subtype (discussed in the ‘‘Data Design Patterns’’ section) solves this dilemma perfectly
I’ve heard from some that generalization sounds like denormalization — it’s not When generalizing, it’s
critical that the entities comply with all the rules of normalization
Generalized databases tend to be data-driven, have fewer tables, and are easier to extend I was once
asked to optimize a database design that was modeled by a very specific-style data modeler His design
had 78 entities, mine had 18 and covered more features For which would you rather write stored
procedures?
On the other hand, be careful to merge entities because they actually do share a root meaning in
the data Don’t merge unlike entities just to save programming The result will be more complex
programming
Best Practice
Granted, knowing when to generalize and when to segment can be an art form and requires a repertoire of
database experience, but generalization is the buffer against database over-complexity; and consciously
working at understanding generalization is the key to becoming an excellent data modeler
In my seminars I use an extreme example of specific vs generalized design, asking groups of three to
four attendees to model the database in two ways: first using an overly specific data modeling technique,
and then modeling the database trying to hit the generalization sweet spot
Assume your team has been contracted to develop a database for a cruise ship’s activity director — think
Julie McCoy, the cruise director on the Love Boat
Trang 10The cruise offers a lot of activities: tango dance lessons, tweetups, theater, scuba lessons, hang-gliding,
off-boat excursions, authentic Hawaiian luau, hula-dancing lessons, swimming lessons, Captain’s dinners,
aerobics, and the ever-popular shark-feeding scuba trips These various activities have differing
require-ments, are offered multiple times throughout the cruise, and some are held at different locations A
pas-senger entity already exists; you’re expected to extend the database with new entities to handle activities
but still use the existing passenger entity
In the seminars, the specialized designs often have an entity for every activity, every time an activity is
offered, activities at different locations, and even activity requirements I believe the maximum number
of entities by a seminar group is 36 Admittedly, it’s an extreme example for illustration purposes, but
I’ve seen database designs in production using this style
Each group’s generalized design tends to be similar to the one shown in Figure 3-2 A generalized
activity entity stores all activities and descriptions of their requirements organized by activity type The
ActivityTimeentity has one tuple (row) for every instance or offering of an activity, so if hula-dance
lessons are offered three times, there will be three tuples in this entity
FIGURE 3-2
A generalized cruise activity design can easily accommodate new activities and locations
Generalized Design
ActivityType
Activity
Time
Activity Time
SignUp
Primary keys
Perhaps the most important concept of an entity (table) is that it has a primary key — an attribute or
set of attributes that can be used to uniquely identify the tuple (row) Every entity must have a primary
key; without a primary key, it’s not a valid entity
By definition, a primary key must be unique and must have a value (not null)