In the online simulation called “Volunteer Now!” you will help a group of college students who have been trying to match people who want to volunteer their time to orga- nizations that need their skills, such as homeless shelters,
hospitals, and animal rescue services. They have been keeping track of all their data using three-ring binders and sticky notes, but mistakes are common. One volunteer who signed up to work in a soup kitchen was given the wrong address and wound up wandering around lost in a deserted warehouse. What Volunteer Now! needs is a database that fits its mission, one that the staff and volunteers can access from their laptops and smartphones at any time of day. You will learn a lot about databases as you help them design it.
IntroductIon
An online, interactive decision-making simulation that reinforces chapter contents and uses key terms in context can be found in MyMISLab™.
c h a p t e r
4 Databases and Data Warehouses
MyMISLab™
• Online Simulation: Volunteer Now! A Role-Playing Simulation on Designing the Database for a Volunteer-Matching Service
• Discussion Questions: #4-1,
#4-2, #4-3
• Writing Assignments: #4-22,
#4-23
www.freebookslides.com
Volunteer now!
a role-playing Simulation on designing the database for a Volunteer-Matching Service
I nformation resources are central to any organization’s success. And these resources are growing at an astound- ing rate. Data stored in digital format are multiplying everywhere on a vast array of physical media, ranging from the organization’s own computers to hosts that might be located anywhere on the planet. Data also reside on DVDs, CD-ROMs, and tapes and inside people’s digital cameras, cell phones, iPods, and flash drives on a keychain. On your own workspace, for instance, objects that don’t store or display
information of some kind are scarce—perhaps the coffee cup or stapler (Figure 4-1).
People understand that some information is powerful and valuable, but far more is useless junk that should be tossed.
We need a strategy to manage information resources so that what is important is secure, organized, and easily accessible to managers, employees, customers, suppliers, and other stakeholders. This enormous challenge is the subject of this chapter.
MyMISLab
Online Simulation
mangostock/Shutterstock.
Figure 4-1
The modern workspace:
An information storehouse.
Source: Kathy Burns-Millyard/Shutterstock.
the Nature of Information resources
structured, Unstructured, and semi-structured information
Every organization relies on structured information, the kind that is usually considered to be facts and data (Figure 4-2). It is reasonably ordered, in that it can be broken down into com- ponent parts and organized into hierarchies. Your credit card company, for example, maintains your customer record in a structured format. It contains your last name, first name, street address, phone number, email address, and other data. It would also maintain your purchases, each with a transaction date, description, debit or credit amount, and reference numbers.
Straightforward relationships among the data elements are also relatively easy to iden- tify. A customer’s order would be related to the customer record, and the items purchased as part of the order would be related to the order itself. This kind of information is the heart of an organization’s operational information systems, with electronically stored customer records, orders, invoices, transactions, employee records, shipping tables, and similar kinds of information. It is the kind that databases are designed to store and retrieve.
In contrast, unstructured information has no inherent structure or order, and the parts can’t be easily linked together, except perhaps by stuffing them in a manila folder or box. It is more difficult to break down, categorize, organize, and query. Consider a company involved in a touchy lawsuit. Relevant information might include letters, emails, Twitter feeds, sticky notes, text messages, meeting minutes, phone calls, videos, Facebook posts, resumes, or photos.
Drawing information out of unstructured collections also presents challenges. A catering business might have a back room stacked with boxes containing unstructured information on hundreds of contracts. If the owner wants to know which contracts went over budget and then see who handled those, every box would have to be opened. Because unstructured collections Explain the nature of information 1
resources in terms of structure and quality, and show how metadata can be used to describe these resources.
Type of Information Resource Example
Structured information A sales transaction with clearly defined fields for date, customer number, item number, and amount
Unstructured information Manila folder containing assorted items about a lawsuit, such as photos, handwritten notes, newspaper articles, or affidavits Semi-structured information A web page with a title, subtitle, content, and a few images
Figure 4-2
Types of information resources.
metadata
data about data that clarifies the nature of the information.
semi-structured information
Information category that falls between structured and unstructured information. It includes facts and data that show at least some structure, such as web pages and documents, which bear creation dates, titles, and authors.
unstructured information
Information that has no inherent structure or order and the parts can’t be easily linked together.
structured information
Facts and data that are reasonably ordered or that can be broken down into component parts and organized into hierarchies.
have no means to enforce rules about what types of information must be included, the owner may find little to go on.
A vast gray area exists between the extremes of structured and unstructured informa- tion; this is the area within which semi-structured information falls. This type includes information that shows at least some structure, such as web pages that have dates, titles, and authors. Spreadsheets can also be semi-structured, especially when they are created by differ- ent people to keep track of the same kind of information. One salesperson, for instance, might put a contact’s work phone and mobile phone in different columns labeled “Work Phone”
and “Mobile,” but another might keep them in the same column under the heading “Phones.”
Resources like these don’t have the strong structure, enforced by advance planning, to clearly define entities and their relationships, and they lack controls about completeness and format- ting. Nevertheless, such data are easier to query and combine than the unstructured variety.
Metadata
Metadata is data about data, and it clarifies the nature of the information. For structured information, metadata describes the definitions of each of the fields, tables, and their relation- ships. For semi-structured and unstructured information, metadata is used to describe proper- ties of a document or other resource and is especially useful because it layers some structure on information that is less easily categorized and classified. YouTube’s database, for example, contains metadata about each of its videos that can be searched and sorted. A library’s card catalog provides metadata about the books, such as where they are physically shelved.
The photo-sharing website Flickr relies on metadata to search its enormous photo collection. A father’s beach scene photos, with filenames such as “image011.jpg,” become more accessible, meaningful, and
sharable for friends and family when metadata are added to their properties, such as location, sub- ject, date taken, and photographer (Figure 4-3).
the Quality of information
Not all information has high quality, as anyone who surfs the net knows. Here are the most important characteristics that affect quality:
Accuracy. Mistakes in birth dates, spelling, or price reduce the quality of the information.
Precision. Rounding to the nearest mile might not reduce quality much when you estimate the drive to the mall. However, for property surveys, “about 2 miles” is unacceptable.
Completeness. Omitting the zip code on the customer’s address record might not be a problem because the zip can be determined by the address. But leaving off the house number would delay the order.
Consistency. Reports that show “total sales by region” may conflict because the people generating the reports are using slightly different definitions. When results are inconsis- tent, the quality of both reports is in question.
Timeliness. Outdated information has less value than up-to-date information and thus is lower quality unless you are looking for historical trends. The actual definition for what is up-to-date varies. In stock trading, timeliness is measured in fractions of a second.
Productivity tiP
Adding metadata to the properties of your documents, photos, and videos makes them easier to search and locate later. Right-clicking on the filename usually brings up a menu that includes Properties. You can also remove information from a file’s properties so others will not see it.
Bias. Biased information lacks objectivity, and that reduces its value and quality. To make sales seem higher, a manager might choose to include canceled orders, though the CEO might not be pleased.
Duplication. Information can be redundant, resulting in misleading and exaggerated summaries. In customer records, people can easily appear more than once if their address changes.
The data collected by online surveys illustrates many of the problems surrounding infor- mation quality.1 The sample of people who actually respond is biased, and people may race through the questions or turn in more than one survey. Virtual Surveys Ltd., a company that specializes in web-based research, discovered that one person completed an online survey 750 times because a raffle ticket was offered as an incentive.2 To avoid relying on poor quality data like that, managers must define what constitutes quality for the information they need.
Photo Metadata Description
Photo title Ocean beach scene
Date taken 12/15/2011
License type Royalty free
Photographer Felipe DiMarco
Key words Ocean, waves, outdoors, sunshine, beach, vacation, swimming, swimmers, fishing, surf
Figure 4-3
Metadata for a beach scene photo.
Source: Rigucci/Shutterstock.
Managing Information: From Filing cabinets to the Database
Human ingenuity was applied to the challenges of information management long before the digital age. Before Edwin Siebels invented the lateral filing cabinet in 1898, businesses often organized documents by putting them in envelopes, in rows of small pigeonholes that lined entire walls from top to bottom. The change to vertical manila folders, neatly arranged in cabi- net drawers, was quite an improvement for record keeping and much appreciated by file clerks (Figure 4-4). The real revolution, however, occurred in the 1960s when computers entered the picture. These relied on an organizing strategy built around the concept of the record.
Compare file processing systems 2
to the database, explaining the database’s advantages.
table
a group of records for the same entity, such as employees. each row is one record, and the fields of each record are arranged in the table’s columns.
field
an attribute of an entity. a field can contain numeric data or text, or a combination of the two.
record
a means to represent an entity, which might be a person, a product, a purchase order, an event, a building, a vendor, a book, a video, or some other “thing” that has meaning to people. the record is made up of attributes of that thing.
data definition
Specifies the characteristics of a field, such as the type of data it will hold or the maximum number of characters it can contain.
tables, records, and Fields
A table is a group of records for the same entity, such as employees, products, books, videos, or some other “thing” that has meaning to people (Figure 4-5). The record is a row in the table, and it represents an instance of the entity—a single person, for instance. The record is made up of attributes of that thing, and each of the attributes is called a field. The fields are the columns in the table. Fields typically contain numeric data, text, or a combination of the two.
Each field should have a data definition that specifies the field’s properties, such as the type of data it will hold (e.g., alphabetic, alphanumeric, or numeric) and the maximum number of characters it can contain. It also includes rules that might restrict what goes into the field or make the field required.
Figure 4-4
Early information management approaches.
Source: Edwin Verin/Shutterstock and deepspacedave/Shutterstock.
PetID Description Name Gender Breed Birthday
201447 Dog Champion M Mixed 12/1/2014
201448 Dog Keiko F Beagle 5/25/2016
201449 Cat Sunny F Persian 4/15/2015
201450 Cat Mister M Siamese 6/14/2016
Figure 4-5
A table for the entity “pets” for a veterinarian’s office, showing records (rows) and fields (columns).
Consider, for example, a table that will hold employee records, created using MS Access.
The field names might include employee ID, last name, first name, birth date, gender, email, and phone, and the data type appears next to each field name. The properties for BirthDate appear in the bottom half of Figure 4-6. The designer decided to make the field required, make sure users enter it as MM/DD/YYYY, and also only allow dates that are less than today’s date.
the rise and Fall of File Processing systems
Initially, electronic records were created and stored as computer files, and programmers wrote computer programs to add, delete, or edit the records. Each department maintained its own records with its own computer files, each containing information that was required for opera- tions. For example, the payroll office maintained personnel records and had its own computer programs to maintain and manage its set of files. At the end of the month when it was time to generate payroll checks, the payroll system’s computer programs would read each record in the file and print out checks and payroll stubs for each person, using the information con- tained in the files for that department. That kind of activity is called batch processing. The program is sequentially conducting operations on each record in a large batch.
Accounts payable and receivable, personnel, payroll, and inventory were the first benefi- ciaries of the digital age. Compared to the manual method of generating a payroll, in which deductions and taxes were computed by hand and each check was individually typed, the monthly batch processing of computer-generated checks was revolutionary. However, it didn’t take long for problems to surface as other offices began to develop their own file pro- cessing systems. Understanding what went wrong is crucial to grasp why the database offers so many benefits.
Data reDunDancy anD inconsistency
Because each set of computer programs operated on its own records, much information was redundant and inconsistent (Figure 4-7). The payroll office record might list your name as ANNAMARIE, but the personnel office that handles benefits shows you as ANNMARIE.
Further, the extra workload involved in resolving redundant records was not trivial and often never got done.
Figure 4-6
Data definition for the field “birthdate” in MS Access.
Source: Microsoft® Access, Microsoft Corporation. Reprinted with permission.
Lack oF Data integration
Integrating data from the separate systems was a struggle (Figure 4-8). For example, the pay- roll system might maintain information about name, address, and pay history, but gender and ethnicity are in personnel records. If a manager wanted to compare pay rates by ethnicity, new programs were written to match up the records. This clumsy integration affects customers, as well, who fume when they can’t resolve inconsistencies in their accounts (Figure 4-9).
inconsistent Data DeFinitions
When programmers write code to handle files, differences in format creep in. Phone num- bers may include the dashes and be formatted as a text field in one system but be treated as numbers in another. A more subtle problem involves the way people actually choose to use the system. Data definitions may seem similar across systems, but they are used differently, and summaries become misleading. For example, employees in the personnel department at a retail chain categorize software purchases as “computers.” Their coworkers in sales pre- fer to lump software with pencils, staplers, and clocks as “supplies” because less paperwork is needed to justify the purchase. The CEO lamented that there was no way anyone could possibly know how much this chain was spending on technology because of inconsistent coding (Figure 4-10).
Data DepenDence
These early systems became maintenance nightmares because the programs and their files were so interconnected and dependent on one another. The programs all defined the fields and their formats, and business rules were all hard-coded or embedded in the programs. Even a minor change to accommodate a new business strategy took a lot of work. IT staff were constantly busy but kept falling behind anyway.
The disadvantages to the file processing approach led to a better way of organizing structured data, one that relies on the database.
2C[TQNN #EEQWPVKPI 5CNGU
2C[TQNN #EEQWPVKPI 5CNGU *4
*WOCP 4GUQWTEGU
'ORNQ[GG0COG 5OKVJ#PPOCTKG 8QTICU
'ORNQ[GG0COG 5OKVJ#PPOCTKG8 'ORNQ[GG0COG
#PPCOCTKG 8QTICU5OKVJ
'ORNQ[GG0COG 6&QWINCU)WCTKPQ 5CNCT[JT
'ORNQ[GG0COG )WCTKPQ6JGQFQTG
&QWI )GPFGT/
*WOCP 4GUQWTEGU 2C[TQNN
2C[TQNN *4
Figure 4-7
Data redundancy problems. Separate file processing systems often contain redundant and inconsistent data.
Figure 4-8
Information in separate file processing systems is difficult to integrate.
For example, a report listing hourly rates by gender would need extra programming effort in this business.
batch processing
the process of sequentially executing operations on each record in a large batch.
Databases and Database Management systems
The foundation of today’s information management relies on the database and the software that manages it. The database is an integrated collection of information that is logically related and stored in such a way as to minimize duplication and facilitate rapid retrieval. Its major advantages over file processing systems include:
Reduced redundancy and inconsistency Improved information integrity and accuracy Improved ability to adapt to changes Improved performance and scalability Increased security
a database management system (DBMS) is used to create and manage the database.
This software provides tools for ensuring security, replication, retrieval, and other administra- tive and housekeeping tasks. The DBMS serves as a kind of gateway to the database itself and as a manager for handling creation, performance tuning, transaction processing, general maintenance, access rights, deletion, and backups.
%WUVQOGT0COG ,CTTQF4QDGTVQ
#WVQ+PUWTCPEG .KHG+PUWTCPEG $KNNKPI
%NCKOU
%WUVQOGT0COG ,CTTQF4QDGTV
%WUVQOGT0COG ,CTTQF4QDGTV ,CTTQF5VGRJCPKG
%WUVQOGT0COG ,CTTQF4QDGTV
Department Object Code Amount Category Description
Sales 4211 1888.25 Computers Desktop Computers
Sales 4300 249.95 Computer supplies Image editing software
Sales 4100 29.99 Office supplies Flash drive
Personnel Personnel Personnel
4211 59.00 Computers Stastical software 4300 14.95 Computer supplies Flash drive 4211 2500.21 Computers Laptop Computers Warehouse
Warehouse
4211 59500.00 Computers Web server
4211 2500.00 Computers Printer/copier/scanner/fax
Figure 4-9
Separate file processing systems lead to a fragmented customer interface, frustrating customers who have to contact several offices to straighten out inconsistencies.
Source: Photo: William Casey/Shutterstock.
Figure 4-10
When data definitions are inconsistent, the meaning of different fields will vary across departments and summaries will be misleading. Note how the three departments use categories in different ways.
Database architecture
To be most useful, a database must handle three types of relationships with a minimum of redundancy (Figure 4-11):
One-to-one One-to-many Many-to-many
The one-to-one relationship is relatively easy to accommodate, and even file processing systems can handle it. For instance, each person has one and only one birth date. The one- to-many relationship between records is somewhat more challenging. A person might have one or more dependents, for example, or one or more employees reporting to him or her. The many-to-many relationship is also more complicated to support. This might involve a situa- tion in which a person might be working on any number of projects, each of which can have any number of employees assigned to it.
Earlier database architectures offered different strategies to organize and link records (Figure 4-12). For example, one intuitive way to organize information is to follow the organi- zational chart, and the hierarchical database did just that (Figure 4-13). This approach worked well for one-to-many relationships but stumbled when many-to-many links complicated the chart, such as when a person worked for two bosses. The network database (Figure 4-14) had more flexibility to link entities that didn’t fall along a neat hierarchy and could handle many- to-many relationships. But another inventive approach—the relational model—soon won out.
the reLationaL Database
E. F. Codd, a British mathematician working at IBM, invented the relational database, which organizes information into tables of records that are related to one another by linking a field
1PGVQQPG
1PGVQOCP[0 'ORNQ[GG+&
'ORNQ[GG+&
'ORNQ[GG+&
'ORNQ[GG$KTVJ&CVG
&GRGPFGPVUoPCOGU
2TQLGEV+&
&QPCNF
#KFGP
%CTG[
$GVJ
'NNC
,GTQOG
/CP[VQOCP[/0
Figure 4-11 Relationship types.
database
an integrated collection of information that is logically related and stored in such a way as to minimize duplication and facilitate rapid retrieval.
relational database
the widely used database model that organizes information into tables of records that are related to one another by linking a field in one table to a field in another table with matching data.
database management system (DBMS)
Software used to create and manage a database; it also provides tools for ensuring security, replication, retrieval, and other administrative and housekeeping tasks.