Tài liệu Database Systems: The Complete Book- P3 pptx

RELATIONAL ALGEBRA Why Bags Can Be More Efficient Than Sets As a simple example of why bags can lead to implementation efficiency, if you take the union of two relations but do not elim

Trang 1

176 CHAPTER 4 OTHER DATA JdODELS

data structures that support efficient answering of queries, as we shall discuss

begillning in Chapter 13

lret the flexibility of semistructured data has made it important in two applications We shall discuss its use in documents in Section 4.7, but here we

shall consider its use as a tool for information integration As databases have

proliferated, it has become a common requirement that data in two or more

of tllem be accessible as if they were one database For instance, companies

may merge; each has its own personnel database, its own database of sales

inventory, product designs, and perhaps many other matters If corresponding

databases had the same schemas, then combining them would be simple; for

instance, we could take the union of the tuples in two relations that had the

same schema and played the same roles in the the two databases

However, life is rarely that simple Independently developed databases are unlikely to share a schema, even if they talk about the same things, such as per-

sonnel For instance, one employee database may record spouse-name, another

not One may have a way to represent several addresses, phones, or emails

for an employee, another database may allow only one of each One database

might be relational, another object-oriented

To make matters more complex, databases tend over time to be used in so many different applications that it is impossible to shut them down and copy or

translate their data into another database, even if we could figure out an efficient

way to transform the data from one schema to another This situation is often

reffwed to as the legacy-database problem; once a database has been in existence

for a xt-liile, it becomes impossible to disentangle it from the applications that

grow up around it, so the database can never be decommissioned

.4 possible solution to the legacy-database problem is suggested in Fig 4.20

We show two legacy databases with an interface; there could be many legacy

systems involved The legacy systems are each unchanged, so they can support

their usual applications

User

C Interface

0

For flexibility in integration, the interface supports semistructured data, and the user is allowed to query the interface using a query language that is suitable for such data The semistructured data may be constructed by translating the data a t the sources, using components called wrappers (or "adapters") that are

each designed for the purpose of translating one source to semistructured data Alternatively, the semistructured data at the interface may not exist a t all Rather, the user queries the interface as if there were semistructured data, while the interface answers the query by posing queries to the sources, each referring

to the schema found at that source

E x a m p l e 4.27 : \%re can see in Fig 4.19 a possible effect of information about stars being gathered from several sources Notice that the address information for Carrie Fisher has an address concept, and the address is then broken into street and city That situation corresponds roughly to data that had a nested- relation schema like Stars(name, a d d r e s s ( s t r e e t , c i t y ) )

On the other hand, the address information for hiark Hamill has no address concept a t all, just street and city This information may have come from

a schema such as Stars(name, s t r e e t , city) that only has the ability to represent one address for a star Some of the other variations in schema that are not reflected in the tiny example of Fig 4.19, but that could be present if movie information were obtained from several sources, include: optional film-type information, a director, a producer or producers, the owning studio, revenue, and information on where the movie is currently playing

4.6.4 Exercises for Section 4.6

Exercise 4.6.1 : Since there is no schema to design in the semistructured-data model, ~t-e cannot ask you to design schemas to describe different situations Rather in the follo\ving exercises we shall ask you to suggest how particular data might be organized to reflect certain facts

* a) idd to Fig 4.19 the facts that Star Wars was directed by George Lucas

and produced by Gary Kurtz

b) Add to Fig 4.19 informat,ion about Empire Strikes Back and Return of

the Jedi, including the facts t,hat Carrie Fisher and Mark Hamill appeared

in these movies

C ) .Add to (b) information about the studio (Fox) for these movies and t h e

address of the studio (Holly~vood)

* Exercise 4.6.2: Suggest llow typical data about banks and customers as in Exercise 2.1.1 could be represented in the semistructured model

Exercise 4.6.3 : Suggest how typical data about players, teams and fans, Figure 4.20: Integrating two legacy databases through an interface that sup- as ~vas described in Exercise 2.1.3, could be represented in the semistructured Ports semistructured data

Trang 2

1 78 CHAPTER 4 OTHER DATA iiODELS 1.7 XiML AXD ITS DATA MODEL 179

Exercise 4.6.4 : Suggest how typical data about a genealogy, as was described semist,ructured data As m-e shall see in Section 4.7.3, DTD's generally

in Exercise 2.1.6, could be represented in the semistructured model allow more flexibility in the data than does a conventional schema; DTD's

often allow optional fields or missing fields, for instance

*! Exercise 4.6.5 : The E/R model and the semistructured-data model are both

"graphical:' in nature, in the sense that they use nodes, labels, and connections

among nodes as the medium of expression Yet there is an essential difference 4.7.2 Well-Formed XML

a declaration that it is XML, and that it have a root tag surrounding the entire

XML (Extensible Markup Language) is a tag-based notation for "marking" doc- <? XML VERSION = "1.0" STANDALONE = "yes" ?>

uments, much like the familiar HTML or less familiar SGML A document is

nothing more nor less than a file of characters However, while HMTL's tags

talk about the presentation of the information contained in documents - for

instance, which portion is to be displayed in italics or what the entries of a list

are - XML tags talk about the meaning of substrings within the document The first line indicates that the file is an XML document The parameter

In this section we shall introduce the rudiments of XML We shall see t.hat it STANDALONE = "yes" indicates that there is no DTD for this document; i.e., it captures, in a linear form, the same structure as do the graphs of semistructured is a-ell-formed XRIL Notice that this initial declaration is delineated by special data introduced in Section 4.6 In particular, tags play the same role as did markers <? ?>

the labels on the arcs of a semistructured-data graph UTe then introduce the

DTD ("document type definition"), which is a flexible form of schema that lye <? XML VERSION = "1.0" STANDALONE = "yes" ?>

<STAR><NAME>Carrie Fisher</NAME>

~CITY>Hollywood</CITY></ADDRESS>

Tags in XML are text surrounded by triangular brackets, i.e., < .>, as in <ADDRESS><STREET>5 Locust Ln.</STREET>

HIITL Also as in HThlL, tags generally come in matching pairs, with a be- <CITY>Malibu</CITY></ADDRESS>

ginning tag like <FOO> and a matching ending tag that is the same word with a </STAR>

slash, like </FOO> In HTRL there is an option to have tags with no matching <STAR><NAME>Mark Hamill</NAME>

ender, like <P> for paragraphs, but such tags are not permitted in XhIL \T,-hen <STREET>456 Oak Rd.</STREET><CITY>Brentwood</CITY> tags come in matching begin-end pairs, there is a requirement that the pairs be </STAR>

nested That is, between a matching pair <FOO> and </FOO>, t,here can be any <MOVIE><TITLE>Star Wars</TITLE><YEAR>1977</YEAR>

number of other matching pairs, but if the beginning of a pair is in this range </MOVIE>

XLIL is designed to be used in two s o m e ~ h a t different modes:

1 il'ell-formed XR.IL allows you to invent your own tags, much like the arc- Figure 4.21: In XlIL document about stars and movies

labels in semistructured data This mode corresponds quite closely to semistructured data, in that t,here is no schema, and each document is free to use whatever tags the author of the document 1%-ishes Example 4.28 : In Fig 4.21 is an XLIL document that corresponds roughly to

the data in Fig 4.19 The root tag is STAR-MOVIE-DATA We see two sections

2 Valid XAIL involves a Document Type Definition that specifies the al- surrounded by the tag <STAR> and its matching </STAR> Within each section Ion-able tags arid gives a grammar for how they may be nested This are subsections giving the name of the star One: for Carrie Fisher, has two form of SAIL is intermediate between the strict-schema models such as subsections, each giving the address of one of her homes These sections are the relational or ODL models, and the completely schernaless world of surrounded by an <ADDRESS> tag and its ender The section for Mark Hamill

Trang 3

180 CHAPTER 4 OTHER DATA MODELS 4.7 XALL AND ITS DATA MODEL 181

has only entries for one street and one city, and does not use an <ADDRESS> tag ing tag is STARS (XML, like HTML, is case-insensitive, so STARS is clearly the

to group these This distinction appeared as well in Fig 4.19 root-tag) The first element definition says that inside the matching pair of tags i\Totice that the document of Fig 4.21 does not represent the relationship <STARS> .</STARS> we will find zero or more STAR tags, each representing a :+,tars-inV between stars and movies We could store information about each single star It is the * in (STAR*) that says "zero or more," i.e., "any number movie of a st,ar within the section devoted to that star, for instance:

<sTAR><NAME>Mark Hamill</NAME>

< S T R E E T > O ~ ~ < / S T R E E T > < C I T Y > B ~ ~ ~ ~ W O O ~ < / C I T Y > <!DOCTYPE Stars [

<MOVIE><TITLE>Star w ~ ~ ~ < / T I T L E > < Y E A R > ~ ~ ~ ~ < / Y E A R > < / M o v I E > <!ELEMENT STARS (STAR*)>

< M O V I E > < T I T L E > E ~ ~ ~ ~ ~ < / T I T L E > < Y E A R > ~ ~ ~ ~ < / Y E A R > < / M O V I E > < ! ELEMENT STAR (NAME, ADDRESS+, MOVIES) >

However, that approach leads t o redundancy, since all information about the <!ELEMENT ADDRESS (STREET, CITY)>

movie is repeated for each of its stars (we have shown no information except a <!ELEMENT STREET (#PCDATA)>

movie's key - title and year - which does not actually represent- an instance <!ELEMENT CITY (#PCDATA)>

of redundancy) We shall see in Section 4.7.5 how XML handles the problem < !ELEMENT MOVIES (MOVIE*) >

<!ELEMENT TITLE (#PCDATA)>

< !ELEMENT YEAR (#PCDATA) >

In order for a computer to process XML documents automatically, there needs

to be something like a schema for the documents That is, we need t o be told

what tags can appear in a collection of documents and how tags can be nested Figure 4.22: 1.1 DTD for movie stars

The descriptioll of the schema is given by a grammar-like set of rules, called a

document type definition, or DTD It is intended that companies or communities The second element, STAR, is declared to consist of three kinds of subele- wishing to share dat,a will each create a DTD that describes the form(s) of the ments: NAME, ADDRESS, and MOVIES They must appear in this order, and each documents they share and establishing a shared view of the semantics of their must be present Ho~vever, the + following ADDRESS says "one or more"; that tags Fo; instance, there could be a DTD for describing protein structures, a is, there can be any number of addresses listed for a star, but there must be at DTD for dmcribing t,he purchase and sale of auto parts, and so on least one The NAME element is then defined to be *PCD.lTAl7' i.e., simple test

and a city, in that order

< ! DOCTYPE root-tag [ Then, the MOVIES element is defined to have zero or more elements of type

1 >

an example of a document that conforms to the DTD of Fig 4.22 o

I The root-tag is used (with its matching ender) to surround a document that

.' conforms to the rules of this DTD An element is described by its name, which is The components of an element E are generally other elements They must the tagused to surround portions of the document that represent that element, appear between the tags <E> and </E> in the order listed Horr-ever there and a parenthesized list of components The latter are tags that may or must are several operators that control the number of times e1etllent.s appear appear within the tags for the element being described The exact requirements

on each coniponlent are indicated in a manner we shall see short,lg 1 A * follorving an element means that the element nlay occur any tiutllbcr There is, however, an important special case (#PCDATA) after an element of times, including zero t,imes

name means that element has a value that is text, and it has no tags nested 2 A + following an element means that the element may occur one or more

Exampie 4.29 : In Fig 4.22 rve see a DTD for stars." The name and surround- 3 A ? following an element nieans that the element may occur either zero

\

Trang 4

CHAPTER 4 OTHER D.4T)l AZODELS

.7 X&IL AND ITS DATA iVIODEL

Example 4.30 : Here is how we might introduce the document of Fig 4.23 to

assert that it is intended to conform to the DTD of Fig 4.22

<?XML VERSION = "1.0" STANDALONE = "nou?>

<!DOCTYPE Stars SYSTEM "star.dtdl'>

The parameter STANDALONE = "no" says that a DTD is being used Recall we set this parameter to "yes" when we did not wish to specify a DTD for the document The location from which the DTD can be obtained is given in the

! DOCTYPE clause, where the keyword SYSTEM followed by a file name gives this location U

4.7.5 Attribute Lists

There is a strong relationship between XML documents and semistructured data Suppose that for some pair of matching tags <T> and < I T > in a document we create a node n Then, if <S> and < I S > are matching tags nested directly within the pair <T> and < / T > (i.e., there are no matched pairs surrounding the S-pair but surrounded by the T-pair), we draw an arc labeled S from node n to the node for the S-pair Then the result will be an instance of semistructured data that has essentially the same structure as the document Gnfortunately, the relationship doesn't go the other way, with the limited subset of XML we have described so far We need a way to express in XML

the idea that an instance of an element might have more than one arc leading

to that element Clearly, \ve canilot nest a tag-pair directly within more than one tag-pair, so nesting is not sufficient to represent multiple predecessors of a

node The additional features that allow us to remesent all semistructured data

in X51L are attributes within tags, identifiers (ID's), and identifier references Figure 4.23: Example of a document following the DTD of Fig 4.22 (IDREF'S)

Opening tags can have attributes that appear within the tag, in analogy to

4 The symbol I may appear between elements, or between parenthesized constructs like <A HREF = > in HTML Keyxvord ! ATTLIST introduces a list groups of elements to signify "or"; that is, either the element(s) on the of attributes and their types for a given element One common use of attributes left appear or the element(s) on the right appear, but not both For is t o associate single, labeled values with a tag This usage is a n alternative t o example, the expression (#PCDATA I (STREET, CITY)) as components subtags that are simple text (i.e., declared as PCDAT.4)

for element ADDRESS ivould mean that an address could be either simple Another important purpose of such attributes is to represent semistructured test, or consist of tagged street and city components data that does not have a tree form An attribut,e for elements of type E that

is declared to be an ID ~a-ill be given values that uniquely identify each portion

of the document that is surro~l~lded by an <E> and matching </E> tag In

4.7.4 Using a DTD terms of scmistructured data, an ID provides a unique name for a ~loclc

If a document is intended to conform to a certain DTD, we can either: Other attributes may be declared to be IDREF's Their values are the

ID's associated with other tags By giving one tag instance (i.e., a node in a) Include the DTD itself as a preamble to the document, or semistructured data) an ID ~vith a value v and another tag instance an IDREF

with value v, the latter is effectively given an arc or link to the former The b) In the opening line, refer t o the DTD, which must be stored separately following example illustrates both the syntax for declaring ID'S and IDREF's

in the file system accessible to the application that is processing the doc- and the significance of using them in data

Trang 5

184 CHAPTER 4 OTHER DATA MODELS

<!DOCTYPE Stars-Movies [

<!ELEMENT STARS-MOVIES (STAR*, MOVIE*)>

<!ELEMENT STAR (NAME, ADDRESS+)>

<!ATTLIST STAR starId ID starredIn IDREFS>

<!ELEMENT NAME (#PCDATA)>

< !ELEMENT ADDRESS (STREET, CITY )>

<!ELEMENT STREET (#PCDATA)>

<!ELEMENT CITY (#PCDATA)>

<!ELEMENT MOVIE (TITLE, YEAR)>

<!ATTLIST MOVIE movieId ID starsOf IDREFS

<!ELEMENT TITLE (#PCDATA) >

<!ELEMENT YEAR (#PCDATA)>

I >

Figure 4.24: A DTD for stars and movies, using ID'S and IDREF'S

Example 4.31 : Figure 4.24 shows a revised DTD, in which stars and movies are given equal status, and ID-IDREF correspondence is used to describe the many-many relationship between movies and stars Analogously, the arcs between nodes representing stars and movies describe the same many-many relationship in the semistructured data of Fig 4.19 The name of the root tag for this DTD has been changed to STARS-MOVIES, and its elements are a sequence

of stars followed by a sequence of movies

1 star no longer has a set of movies as subelements as was the case for the DTD of Fig 4.22 Rather, its only subelements are a name and address and

in the beginning <STAR> tag we shall find an attribute starredIn whose value

is a list of ID'S for the movies of the star Sote that the attribute starredIn is declared to be of type IDREFS, rather than IDREF The additional "S" allo~s-s the value of starredIn to be a list of ID's for movies, rather than a single mot-ie

as would be the case if the type IDREF were used

A <STAR> tag also has an attribute starId Since it is declared to be of type ID: the value of starId may be referenced by <MOVIE> tags t o indicate the stars of the movie That is, when we look at the attribute list for MOVIE in

Fig 4.24 we see that it has an attribute movieId of type ID: these are the ID'S that will appear on lists that are the values of starredIn tags Symmetrically

the attribute starsOf of MOVIE is a list of ID's for stars

Figure 4.25 is an example of a document that conforms to the DTD of Fig 4.24 It is quite similar to the semistrl~ctured data of Fig 4.19 It includes

"Ore data - three movies instead of only one However, the only structural

difference is that here, all stars have an ADDRESS subelement, even if they have only one address, while in Fig 4.19 we went directly from the Mark-Hamill node to street and city nodes

<MOVIE movieId = "esb" starsOf = "cf, mh">

<TITLE>Empire Strikes Back</TITLE>

<YEAR>1980</YEAR>

</MOVIE>

<MOVIE movieId = "rj" starsOf = "cf, mh">

<TITLE>Return of the Jedi</TITLE>

<YEAR>1983</YEAR>

</MOVIE>

</STARS-MOVIES>

Figure 4.25: Example of a document following the DTD of Fig 4.24

4.7.6 Exercises for Section 4.7 Exercise 4.7.1 : Add to the document of Fig 4.25 the follo~ving facts:

* a) Harrison Ford also starred in the three movies mentioned and the n i o ~ i e

Witness (1985)

b) Carrie Fisher also starred in Hannah and Her Sisters (1985)

c) Liam Seeson starred in The Phantom Menace (1999)

Trang 6

186 CHAPTER 4 OTHER DATA MODELS 4.9 ,REFEREhTCES FOR CHAPTER 4 187

described in Exercise 2.1.1, could be represented a s a DTD lations, i.e., complex types for attributes of a relation, including relations

as types Other extensions include methods defined for these types, and Exercise 4.7.3 : Suggest how typical data about players, teams, and fans, as the ability of one tuple to refer to another through a reference type was described in Exercise 2.1.3, could be represented as a DTD

+ ~emlstructured Data: In this model, data is represented by a graph Exercise 4.7.4 : Suggest how typical data about a genealogy, as was described Nodes are like objects or values of their attributes, and labeled arcs con-

in Exercise 2.1.6, could be represented as a DTD nect an object to both the values of its attributes and to other objects to

which it is connected by a relationship

+ Object Definition Language: This language is a notation for formally de- tium standard that implements semistructured data in documents (text scribing the schemas of databases in an object-oriented style One defines files) Nodes correspond to sections of the text, and (some) labeled arcs classes, which may have three kinds of properties: attributes, methods, are represented in XML by pairs of beginning and ending tags

+ ODL Relationships: A relationship in ODL must be binary It is repre- XML allows attributes of type I D and IDREF within the beginning tags sented, in the two classes it connects, by names that are declared to be A tag (corresponding to a node of semistructured data) can thus be given inverses of one another Relationships can be many-many, many-one, or an identifier, and that identifier can be referred to by other tags, from one-one, depending on whether the types of the pair are declared to be a which we would like to establish a link (arc in semistructured data) single object or a set of objects

+ The ODL Type System: ODL allows types to be constructed, beginning 4.9 References for Chapter 4

with class names and atomic types such as integer, by applying any of the following type constructors: structure formation, set-of, bag-of, list-of, The manual defining ODL is [6] It is the ongoing work of ODLIG, the Object

oriented database systems from [4], [5], and [8]

+ Extents: A class of objects can have an extent, which is the set of objects of Semistructured data as a model developed from the TSIRIXIIS and LORE that class currently exist,ing in the database Thus, the extent corresponds projects a t Stanford The original description of the model is in [9] LORE and

to a relation instance in the relational model, while the class declaration its query language are described in [3] Recellt surveys of work on semistruc-

+ Keys in ODL: Keys are optional in ODL One is allo~r-ed to declare one data is being compiled on the Web, a t [7]

or more keys, but because objects have an object-ID that is not one of its XXIL is a standard developed by the Xorld-\Vide-Web Consortium The propert,ies, a system implementing ODL can tell the difference between home page for information about XXIL is [Ill

objects, even if they have identical values for all properties

1 S Abiteboul, "Querying semi-structured data," Proc Intl Conf on Dnta- + Converting ODL Designs to Relations: If rve treat ODL as only a de- base Theory (1997); Lecture Sotes in Computer Science 1187 (F Afrati sign language, whose designs are then converted to relations, the simplest and P Kolaitis, eds.), Springer-Verlag, Berlin, pp 1-18

approach is to create a relation for a the attributes of a class and a relation for each pair of inverse relationships However we can combine a 2 Abiteboul, S., D Suciu, and P Buneman, Data on the Web: From Rela- many-one relationship with the relation intended for the attributes of the taons to Semistructured Data and Xml, X4organ-Icaufmann, San Francisco,

"manyn class It is also necessary to create new attributes to represent the key of a class that has no key

3 -4biteboul S., D Quass, J McHugh, J IVidom, and J L Weiner, "The

+ The Object-Relational Model: An alternative to pure object-oriented data- LOREL query language for semistructured data,'' In J Digital Libraries base models like ODL is to extend the relational model to include the

Trang 7

CHAPTER 4 O T H E R DATA MODELS

4 Bancilhon, F., C Delobel, and P Kanellakis, Building an Object-Oriented Database System, Morgan-Kaufmann, San Francisco, 1992

5 Cattell, R G G., Object Data Management, Addison-Wesley, Reading,

ability, and Beyond, ACM press, New York, 1994

9 Pa.pakonstantinou, Y., H Garcia-Molina, and idom, om, "Object es- change across heterogeneous information sources," IEEE Intl Conf on

Data Engineering, pp 251-260, March 1995 This chapter begins a study of database programming, that is, how the user can

ask queries of the database and can modify the contents of the database Our

10 D Suciu (ed.) Special issue on management of semistructured data, SIG- focus is on the relational model! and in particular on a notation for describing

11 NJorld-Wide-Web Consortium, h t t p : //www w3 org/XML/ While ODL uses methods that, in principle, can perform any operation on

data, and the E/R model does not embrace a specific way of manipulating data, the relational model has a concrete set of "standard" operations on data Surprisingly, these operations are not "Turing complete" the way ordinary programming languages are Thus, there are operations we cannot express in relational algebra that could be expressed, for instance, in ODL methods written in C++ This situation is not a defect of the relational model or relational algebra, because the advantage of limiting the scope of operations is that it becomes possible to optimize queries written in a very high level language such

as SQL, tvhich we introduce in Chapter 6

We begin by introducing the operations of relational algebra This algebra formally applies to sets of tuples, i.e., relations Hoxvever, commercial DBkIS's use a slightly different model of relations, which are bags, not sets That is, relations in practice may contain duplicate tuples While it is often useful to think of relational algebra as a set algebra, we also need to be conscious of the effects of duplicates on the results of the operations in relational algebra In the final section of this chapter, n-e consider the matter of how constraints on relations can be expressed

Later chapters let us see the languages and features that today's commercial DBMS's offer the user The operations of relational algebra are all implemented

by the SQL query language, which we study beginning in Chapter 6 These algebraic operations also appear in the OQL language, an object-oriented query language based on the ODL data model and introduced in Chapter 9

Trang 8

190 CHAPTER 5 RELATIONAL ALGEBRA

As we begin our focus on database programming in the relational model, it is

useful to have a specific schema on which to base our examples of queries Our

chosen database schema draws upon the running example of movies, stars, and

studios, and it uses normalized relations similar to the-ones that we developed

in Section 3.6 However, it includes some attributes that we have not used pre-

viously in examples, and it includes one relation - MovieExec - that has not

appeared before The purpose of these changes is to give us some opportunities

to study different data types and different ways of representing information

Figure 5.1 shows the schema

Movie ( TITLE: s t r i n g , YEAR: i n t e g e r , length: i n t e g e r , incolor: boolean, studioName: s t r i n g , producerC#: i n t e g e r )

S t a r s I n ( MOVIETITLE: s t r i n g , MOVIEYEAR: i n t e g e r , STARNAME: s t r i n g ) Moviestar(

NAME: s t r i n g , address: s t r i n g , gender : char,

b i r t h d a t e : date) HovieExec(

name: s t r i n g , address: s t r i n g ,

CERT# : i n t e g e r ,

networth: i n t e g e r )

5.2 AN ALGEBRA OF RELATION-4L OPER.4TIONS 191

Our schema has five relations The attributes of each relation are listed, along with the intended domain for that attribute The key attributes for a

relation are shown in capitals in Fig 5.1, although when we refer to them in text, they will be lower-case as they have been heretofore For instance, all three attributes together form the key for relation S t a r s I n Relation Movie has six attributes; t i t l e and year together constitute the key for Movie, as they have previously Attribute t i t l e is a string, and year is an integer The major nlodifications to the schema compared mit,h what we have seen

There is a notion of a certificate number for movie executives - studio presidents and movie producers This certificate is a unique integer that

we imagine is maintained by some external authority, perhaps a registry

of executives or a "union."

\Ire use certificate numbers as the key for movie executives, although movie stars do not al~vays have certificates and we shall continue to use name as the key for stars That decision is probably unrealistic, since two stars could have the same name, but we take this road in order to illustrate some different options

\Ve introduced the producer as another property of movies This information is represented by a new attribute, producerC#, of relation Movie This attribute is intended to be the certificate number of the producer Producers are expccted to be moyie executives, as are studio presidents There may also be other esecutives in the MovieExec relation

Attribute f ilmType of Movie has been changed from an enumerat,ed type

to a boolean-valued attribute called incolor: true if the movie is in color and false if it is in black and white

The attribute gender has been added for movie stars Its type is "character," either M for male or F for female Attribute b i r t h d a t e , of type

"date" (a special type supported by many commercial database systems

=g, or just a character string if we prefer) has also been added

All addresses have been made strings, rather than pairs consisting of a street and city The purpose is to make addresses in different relations

comparable easily and to simplify operations on addresses

NAME: s t r i n g , address: s t r i n g , TO begin our study of operations on relations we shall learn about a special presC#: i n t e g e r ) algebra, called relattonal algebra, that consists of some simple but po\ierful nays

to construct new relations from given relations When the giwn relations are stored data, then the constructed relations can be answers to queries about this Figure 5.1: Example database schema about movies

Trang 9

192 CHAPTER 5 RELATIONAL ALGEBRA

Why Bags Can Be More Efficient Than Sets

As a simple example of why bags can lead to implementation efficiency, if you take the union of two relations but do not eliminate duplicates, then you can just copy the relations to the output If you insist that the result

be a set, you have to sort the relations, or do something similar to detect identical tuples that come from the two relations

The development of an algebra for relations has a history, which we shall follow roughly in our presentation Initially, relational algebra was proposed

by T Codd as an algebra on sets of tuples (i.e., relations) that could be used

to express typical queries about those relations It consisted of five operations

on sets: union, set difference, and Cartesian product, with which you might

already be familiar, and two unusual operations - selection and projection

To these, several operations that can be defined in terms of these were added:

varieties of "join" are the most important

When DBMS's that used the relational model were first developed, their query languages largely implemented the relational algebra However, for ef-

ficiency purposes, these systems regarded relations as bags, not sets That is

unless the user asked explicitly that duplicate tuples be condensed into one (i.e.,

that "duplicates be eliminated"), relations were allowed to contain duplicates

Thus, in Section 5.3, we shall study the same relational operations on bags and

see the changes necessary

.inother change to the algebra that was necessitated by commercial imple- mentations of the relational model is that several other operations are needed

Nost important is a way of performing aggregation, e.g., finding the average

value of some column of a relation We shall study these additional operations

in Section 5.4

5.2.1 Basics of Relational Algebra

Xn algebra, in general, consists of operators and atomic operands For instance, in the algebra of arithmetic, the atomic operands are variables like .r

and constants like 15 The oDerators are the usual arithmetic ones: addition

2 Constants, which are finite relations

.As we mentioned, in the classical relational algebra, all operands and the results

of expressions are sets The operations of the traditional relational algebra fall into four broad classes:

a) The usual set operations - union, intersection, and difference - applied

d) An operation called 'renamingx that does not affect the tuples of a relation, but changes the relation schema, i.e., the names of the attributes and/or the name of the relation itself

IVe shall generally refer to expressions of relational algebra as 9uerie.s \Yhile

we don't yet have the symbols needed to sho~v many of the expressions of relationaj algebra, you should be familiar with the operations of group (a) and thus recognize (R U S) as an esainple of an expression of relational algebra

R and S are atomic operands standing for relations, whose sets of tuples are unknown This query asks for the union of whatever tuples are in the relations named R and S

5.2.2 Set Operations on Relations

The three most common operations on sets are union intersection; and difference \Ye assume the reader is familiar with these operations n-hich are defined

as follo~vs on arbitrary sets R and S:

R U S: the m i o n of R and S; is the set of elements that are in R or S or both An element appears only once in the union even if it is present in both R and S

subtraction, multiplication, and division Any algebra allows us to build ez- R n S ? the in,ter.section of R and S is the set of elelilents that are in both pressions by applying operators to atomic operands and/or other expressiolls R and S

of the algebra Usually, parentheses are needed to group operators and their operands For instance, in arithmet,ic we have expressions such as (x + y) * z or R - S , the difference of R and S , is the set of elements that are in R but

((x + 7)/(y - 3)) + x not in S Sote that R - S is different froni S - R; the latter is the set of Relational algebra is another example of an algebra Its atomic operallds elements that are in S but not in R

are:

When we apply these operations to relations, tve need to put some conditions

1 Variables that stand for relat,ions

Trang 10

194 CHAPTER 5 RELATIONAL ALGEBR-4

1 R and S must have schemas with identical sets of attributes, and the types (domains) for each attribute must be the same in R and S

2 Before me compute the set-theoretic union, intersection, or difference of sets of tuples, the columns of R and S must be ordered so that the order

of attributes is the same for both relations

Sometimes we would like to take the union, intersection, or difference of relations that have the same number of attributes, with corresponding domains

but that use different names for their attributes If so, we may use the renaming

operator to be discussed in Section 5.2.9 to change the schema of one or both

relations and give them the same set of attributes

5.2 AN ALGEBRA OF REL-4TION4L OPERATIONS 195

Xow, only the Carrie Fisher tuple appears, because only it is in both relations The difference R - S is

That is, the Fisher and Hamill tup!es appear in R and thus are candidates for

R - S Horn-ever: the Fisher tuple also appears in S and so is not in R - S

we conventionally show in the order listed

title year length incolor studioName producerC#

Figure 3.3: The relation Movie

Relation S

Example 5.2 : Consider the relation Movie with the relation schema described

in Section 5.1 -111 instance of this relation is shown in Fig 5.3 We can project Figure 5.2: TIYO relations this relation onto the first three attributes with the expression

7

1 0 t i t l e y e a r l e n g t h (Movie)

relation Moviestar of Section 5.1 Current instances of R and S are shon-n in

-1s another example n-e can project onto the attribute i n c o l o r xith the

The intersection R n S is

Sotice that there is only one tuple in the resulting relation, since all three tuples

name 1 address 1 gender I birthdate of Fig 5.3 have the same value in their component for attribute i n c o l o r , and

Trang 11

196 CHAPTER 5 RELATIONAL ALGEBRA 5.2 AN ALGEBRA OF RELATIOArS4L OPERATIOh*S 197

5.2.5 Cartesian Product 5.2.4 Selection

The selection operator, applied to a relation R, produces a new relation with a

subset of R's tuples The tuples in the resulting relation are those that satisfy some condition C that involves the attributes of R We denote this operation uc(R) The schema for the resulting relation is the same as R's schema, and

we conventionally show the attributes in the same order as we use for R

C is a conditional expression of the type with which we are familiar from conventional programming languages; for example, conditional expressions follow the keyword i f in programming languages such as C or Java The only difference is that the operands in condition C are either constants or attributes

of R We apply C to each tuple t of R by substituting, for each attribute rl

appearing in condition C, the component of t for attribute A If after substituting for each attribute of C the condition C is true, then t is one of the tuples

that appear in the result of uc(R); otherwise t is not in the result

Example 5.3: Let the relation Movie be as in Fig 5.3 Then the wlue of

expression ul,,,th2~oo(Movie) is

title year length incolor studioName producerC#

Star Wars 1977 124 t r u e Fox 12345 Mighty Ducks 1991 104 t r u e Disney 67890 The first tuple satisfies the condition length 2 100 because when we substitute for length the value 124 found in the component of the first tuple for attribute

length, the condition becomes 124 2 100 The latter condition is true, so xe accept the first tuple The same argument explains why the second tuple of Fig 5.3 is in the result

The third tuple has a length component 95 Thus, when we substitute for

length n-e get the condition 95 2 100, which is false Hence the last tuple of

Fig 5.3 is not in the result 0

The Cartesian product (or cross-product, or just product) of two sets R and

S is the set of pairs that can be formed by choosing the first element of the pair to be any element of R and the second any element of S This product

is denoted R x S When R and S are relations, the product is essentially the

same However, since the members of R and S are tuples, usually consisting

of more than one component, the result of pairing a tuple from R with a tuple from S is a longer tuple, with one component for each of the components of the constituent tuples By convention, the components from R precede the components from S in the attribute order for the result

The relation schema for the resulting relation is the union of the schemas for R and S However, if R and S should happen to have some attributes in common, then we need to invent new names for at least one of each pair of identical attributes To disambiguate an attribute A that is in the sclemas of both R and S , we use R 4 for the attribute from R and S.A for the attribute from S

title 1 year 1 length I inColor ] studioName 1 producerC#

Star Wars 1 1977 ( 124 1 t r u e 1 Fox

Figure 5.3: Tn-o relations and their Cartesian product,

is the only one in the resulting relation

Trang 12

198 CHAPTER 5 RELATIONAL A L G E B m 5.2 A X ALGEBRA OF RELATIOX-4L OPERATIOlW 199

Example 5.5 : For conciseness, let us use an abstract example that illustrates Example 5.6: The natural join of the relations R and S from Fig 5.4 is the product operation Let relations R and S have the schemas and tuples

shown in Fig 5.4 Then the product R x S consists of the six tuples shown in

that figure Note how we have paired each of the two tuples of R with each of

the t,hree tuples of S Since B is an attribute of both schemas, we have used

R.B and S.B in the schema for R x S The other attributes are unambiguous,

and their names appear in the resulting schema unchanged The only attribute common to R and S is B Thus, to pair successfully, tuples

need only to agree in their B components If so, the resulting tuple has corn- ponents for attributes A (from R), B (from either R or S), C (from S ) , and D

are common to the schenlas of R and S More precisely, let A1, A2, , A, be 110 effect on the result of R w S X tuple that fails to pair n-it11 any tuple of all the attributes that are in both the schema of R and the schema of S Then the other relation in a join is said to.be a dangling tuple 0

a tuple r from R and a tuple s from S are successfully paired if and only if r

and s agree on each of the attributes ill, A*, ,A, Example 5.7: The previous exalnple does not illustrate all the possibilities

If the tuples r and s are successfully paired in the join R w S, then the inherent in the natural join operator For example, no tuple paired successfully result of the pairing is a tuple, called the joined tuple, with one component for with more than one tuple and there was only one attribute in common to the each of the attributes in the union of the schemas of R and S The joined tuple two relation schemas In Fig 5.6 we see two other relations, Ci and I;, that share agrees with tup!e r in each attribut,e in t.he schema of R, and it agrees with s tu-o attributes between their schcmas: B and C We also show an instance in

in each attribute: i r ~ the schema of S Since r and s are successfully paired, the which one tuple joins with s e ~ e r a l tuples

joined tuple is able to agree with both these tuples on the attributes they have For tuples to pair successfully, they must agree in both the B and C conl-

in common The construction of the joined tuple is suggested by Fig 5.5 ponents Thus, the first tuple of C joins with the first t~vo tuples of I', tvhile

the second and third tuples of li join with the third tuple of I- The result of

5.2.7 Theta-Joins

The natural join forces us t,o pair tuples using one specific condition 1l7hile this vay, equating shared attributes, is the most common basis on n-hich relations are joined, it is sometinles desirable to pair tuples from two relations on some other basis For that purpose, we have a related notation called the theta- join Historically the "theta" refers to an arbitrary condition which ~ve~shall

represent by C rather than 0

The notation for a theta-join of relations R and S based on condition C is

Figure 3.5: Joining tuples R 7 S The result of this operation is constructed as follo~vs:

Sate also that this join operation is the same one that Ire used in Scc- 1 Take the product of R and S

tion 3.6.5 to recombine relations that had been project,ed onto two subsets of 2 Select frorn the product only those tuples that satisfy the condition C their attributes There the motivation was to explain why BCNF decomposi-

tion made sense In Section 5.2.8 we shall see another use for t,he natural join: As with the product operation, the schema for the result is the union of the combining two relations so that we can write a query t,hat relates attributes of schemas of R and S with "R," or "S." prefised to attributes if necessary to

Trang 13

CHAPTER 5 RELATIONAL ALGEBR.4 5.2 AN ALGEBRA OF RELATIONAL OPERATIOIW 201

Relation U

Figure 5.7: Result of U ATD V

Example 5.9 : Here is a theta-join on the same relations U and V that has a more complex condition:

Figure 5.6: Natural join of relat.ions

Example 5.8: Consider the operation U I.', where U and 1.' are the If all rve could do n.as to write single operations on one or t ~ o relations as

relations from Fig 3.6 We must consider all nine pairs of tuples, one from each queries, then relational algebra would not be as useful as it is However, re- relation, and see ~vhetlier the A component from the U-tuple is less than the lational algebra like all algebras, allows us to form expressions of arbitrary

D component of the V-tuple The first tuple of Li, with all d compo~ler~t of 1 complexity by applying operators either to given relations or to relations that successfully pairs with each of the tuples from I- However, the second and third are the result of applying one or more relational operators to relations tuples from U , with 4 component.^ of 6 and 9 respectively, pair successfull!-

One can construct expressions of relational algebra by applying operators 11-ith only the last tuple of V Thus, the result has only five tuples, constructed to subexpressions, using parentheses when necessary to indicate grouping of from the five successful pairings This relation is shown in Fig 5.7 operands It is also possible to represent expressions as expression trees; the

latter often are easier for us to read, although they are less convenient as a

Sotice that the schema for the result in Fig 3.7 consists of all sis a t t r i l ~ u t c ~ machine-readable notation

n-ith li and 1- prefixed to their respective occurrnices of attributes 13 and C to distinguish them Thus, the theta-join contrasts I\-ith natural join, since in the Example 5.10 : Let us reconsider the decomposed Movies relation of Exam- latter coxnmon attributes are merged into one copy Of course it makes sense to pie 3.24 Suppose n-e want to know "What are the titles and years of movies

do so in the case of the natural join, since tuples don't pair unless t,hey agree in made by Fox that are at least 100 minutes long?" One way to compute the their common attributes In the case of a theta-join, there is no guarantee that answer to this query is:

compared attributes will agree in the result, since t,hey may not be compared with = 1 Select those Movies tuples that have length 2 100

Trang 14

202 CHAPTER 5 RELATIONAL ALGEBRA

2 Select those Movies tuples that have studioiVame = 'Fox'

3 Compute the intersection of (1) and (2)

4 Project the relation from (3) onto attributes t i t l e and year

Movies Movies Figure 5.8: Expression tree for a relational algebra expression

In Fig 5.8 we see the above steps represented as an expression tree The two selection nodes correspond to steps (1) and (2) The intersection node

corresponds to step (3), and the projection node is step (4)

Alternatively, we could represent the same expression in a conventional

linear notation, with parentheses The formula

represents the same expression

Incidentally, there is often more than one relational algebra expression that represents the same computation For instance, the above query could also be

written by replacing the intersection by logicd AND within a single selection

operation That is,

Equivalent Expressions and Query Optimization

All database systems have a query-answering system, and many of them are based on a language that is similar in expressive power to relational algebra Thus, the query asked by a user may have many equivalent expres-

sions (expressions that produce the same answer, whenever they are given the same relations as operands), and some of these may be much more quickly evaluated An important job of the query "optimizer" discussed briefly in Section 1.2.5 is to replace one expression of relational algebra by

an equivalent expression that is more efficiently evaluated Optimization

of relational-algebra expressions is covered extensively in Section 16.2

Moviesl with schema { t i t l e , year, length, filmType, studioName) Movies2 with schema { t i t l e , year, starName)

Let us write an expression to answer the query "Find the stars of movies that are at least 100 minutes long." This query relates the starName attribute of Movies2 with the l e n g t h attribute of Moviesl \Ire can connect these attrihutes

by joining the two relations The natural join successfi~lly pairs only those tuples that agree on t i t l e and year: that is, pairs of tuples that refer to the same movie Thus, Moviesl w Movies2 is an expression of relational algebra that produces the relation we called Movies in Esample 3.24 That relation is the

non-BCNF relation whose schema is all sis attributes and that contains several tuples for the same movie when that movie has several stars

To the join of Moviesl and Movies2 Ive must apply a selection that enforces the condition that the length of the movie is at least 100 minutes \ire then project onto the desired attribute: starName The expression

implements the desired query in relational algebra

T t i t l e y e a ~ (glength>1oo AND PoxJ ( ~ o v i e s ) )

is an equivalent form of the query

In order to control the names of the attrihutes used for relations that are constructed by applying relational-algebra operations, it is often convenient to

Example 5.11 : One use of t,he natural join operation is to recombine relations use an operator that explicitly renames relations We shall use the operator that were decomposed to put them into BCNF Recall the decomposed relations PS(A~,A~, ,A,)(R) to rename a relation R The resulting relation has exactly

ernem ember that the relation Movies of that example has a somewhat different relation tributes of the result relation S are named dl: Iz, ,.A,? in order from the

Trang 15

204 CHAPTER 5 RELATIONAL ALGEBRA 5.2 AN ALGEBRA OF RELATIOXAL OPERATIONS 205

Example 5.12 : In Example 5.5 we took the product of two relations R and s is an alternative, we could take the product without renaming, as we did in

from Fig 5.4 and used the convention that when an attribute appears in both 5.5, and then rename the result The expression PRS(A,B,X,C.D)(R x S )

operands, it is renamed by prefixing the relation name to it These relations R ields the same relation as in Fig 5.9, with the same set of attributes But this

Suppose, howetrer, that we do not wish to call the two versions of B by

names R.B and S.B; rather we want to continue to use the name B for the 5.2.10 Dependent and Independent Operations

attribute that comes from R , and we want to use X as the name of the attribute

B coming from S ?Ve can reriame the attributes of S so the first is called x Some of the operations that we have described in Section 5.2 can be expressed The result of the expression p s ( x , c , ~ ) ( S ) is a relation named S that looks just in terms of other relational-algebra operations For example, intersection can like the relation S from Fig 5.4, but its first column has attribute X instead be expressed in terms of set difference:

of B

R n S = R - ( R - S )

That is, if R and S are any two relations with the same schema, the intersection

of R and S can be computed by first subtracting S from R t o form a relation

T consisting of all those tuples in R but not S TVe then subtract T from R, leaving only those tuples of R that are also in S

Theta-join can be expressed by product and selection:

R 7 S = u c ( R x S )

The natural join of R and S can be expressed by starting with the product

R x S n'e then apply the selection operator with a condition C of the form

\\-here .AI: A2: , '4, are all the attributes appearing in the schemas of both R

and S Finally, we must project out one copy of each of the equated attributes Let L be the list of attributes in the schema of R follo~\-ed by those attributes

in the schema of S that are not also in the schema of I? Then

R W s = r L ( u c ( ~ x s))

E x a m p l e 5.13: The natural join of the relations U and V from Fig 5.6 can

be witten in terms of product, selection, and projection as:

That is \\-e take the product C x I,- Then we select for equality between each Figure 5.9: Renaming before taking a product

pair of attributes \vith the same name B and C in this example Finall>-

we project onto all the attributes except one of the B's and one of the C's: xve When 11-e take the product of R with this nex relation, there is no conflict have chosen to eliminate the attributes of 1- whose names also appear in the

of names among the attributes, so no further renaming is done That is, the schema of U

of the expression R x ~ s ( x , c , ~ ) ( S ) is the relation R x S from Fig 5.4 For another example, the theta-join of Example 5.9 can be n-ritten that the five columns are labeled A, B, S, C , and D , froln the left This

relation is shown in Fig 5.9

U A < D AND U.B+IB(C x 1'7

Trang 16

206 CHAPTER 5 RELATIONAL ALGEBRA

That is, we take the product of the relations U and V and then apply the

condition that appeared in the theta-join

The rewriting rules mentioned in this section are the only "redundancies"

among the operations that we have introduced The six remaining operations -

unio11, difference, selection, projection, product, and renaming - form an in-

dependent set, none of which can be written in terms of the other five

5.2.11 A Linear Notation for Algebraic Expressions

In Section 5.2.8 we used trees to represent complex expressions of relational

algebra another alternative is to invent names for the temporary relations that

correspond to the interior nodes of the tree and write a sequence of assignments

that create a value for each The order of the assignments is flexible, as long

as the children of a node N have had their values created before we attempt to

create the value for N itself

The notation we shall use for assignment statements is:

1 A relation name and parenthesized list of attributes for that relation The name Answer will be used conventionally for the result of the final step:

i.e.; the name of the relation a t the root of the expression tree

2 The assignment symbol : =

3 .4ny algebraic expression on the right We can choose to use only one

operator per assignment, in which case each interior node of the tree gets its own assignment statement However, it is also permissible to conibine several algebraic operations in one right side, if it is convenient to do so

Example 5.14: Consider the tree of Fig 5.8 One possible sequence of as-

signments to evaluate this expression is:

R ( t , y , l , i , s , p ) := ~len~th>loo(Movie)

S ( t ,y, l , i , s s p ) := UstudioNarne=~fax' (Movie)

T ( t , y , l , i s p ) := R n S Answer(title, year) : = s t , < (T)

5.2 AN ALGEBRA OF RELATIONAL OPERATIONS

5.2.12 Exercises for Section 5.2 Exercise 5.2.1 : In this exercise we introduce one of our running examples of

a relational database schema and some sample data.2 The database schema consists of four relations, whose schemas are:

product (maker, model, type) PC(mode1, speed, ram, hd, rd, p r i c e )

~aptop(mode1, speed, ram, hd, screen, p r i c e ) Printer (model, c o l o r , type, p r i c e )

The Product relation gives the manufacturer, model number and type (PC, laptop, or printer) of various products We assume for convenience that model numbers are unique over all manufacturers and product types; that assumption

is not realistic, and a real database would include a code for the manufacturer

as part of the model number The PC relation gives for each model number that is a PC the speed (of the processor, in megahertz), the amount of RAM (in megabytes), the size of the hard disk (in gigabytes), the speed and type

of the removable disk (CD or DVD), and the price The Laptop relation is similar, except that the screen size (in inches) is recorded in place of information about the removable disk The Prinzer relation records for each printer model whether the printer produces color output (true if so), the process type (laser, ink-jet or bubble), and the price

Some sample data for the relation Product is shown in Fig 5.10 Sample data for the other three relations is shown in Fig 5.11 Manufacturers and

model numbers haye been "sanitized," but the data is typical of products on sale a t the beginning of 2001

Write expressions of relational algebra to answer the follo~ving queries You

may use the linear notation of Section 5.2.11 if you wish For the data of Figs 5.10 and 3.11, show the result of your query However, your answer should work

for arbitrary data, not just the data of these figures

* a) What P C models have a speed of a t least 1000?

The first step computes the relation of the interior node labeled ulength?loo b) IYhich manufacturers make laptops with a hard disk of at least one giga-

in Fig 5.8, and the second step computes the node labeled U s t u d i o ~ a m e = > F o x L byte?

Notice that we get renaming "for free," since we can use any attributes and

relation name we wish for the left side of an assignment The last two steps c) Find the model nunlber and price of all products (of ally type) made by

It is also permissible to combine some of the steps For instance, we could

R(t , Y , 1 , i , s ,p) : = u,ength2100 - - (Movie) e) Find those manufacturers that sell Laptops but not PC's

S ( t , y , l , i , S ,p) := (TstudioName='~ox' (Movie) Answerctitle, year) := T ~ , ~ ( R n S) *! f) Find those hard-disk sizes that occur in two or more PC's

'Source: manufacturers' \Veb pages and Xmazon.com

Trang 17

Figure 5.10: Sample data for Product

model ( speed / r a m I hd I rd I price

1001 1 700 1 64 1 10 1 48xCD 1 799

model 1 speed ram hd screen 1 price

(b) Sample data for relation Laptop

model color tgpe price

Trang 18

! g) Find those pairs of P C models t h a t have both the same speed and R.A)I

.i pair should be listed only once; e.g., list (i, j) but not (j,i)

*!! h) Find those manufacturers of a t least two different computers (PC's or "i

laptops) with speeds of a t least 700 $

!! i) Find the manufacturer(s) of the computer (PC or laptop) with the highest

available speed

!! j) Find the manufacturers of PC's with a t least three different speeds

!! k) Find the manufacturers who sell exactly three different models of PC

Exercise 5.2.2: Draw expression trees for each of your expressions of Exer-

cise 5.2.1

Exercise 5.2.3: Write each of your expressions from Exercise 5.2.1 in the

linear notation of Section 5.2.11

Exercise 5.2.4 : This exercise introduces another running example, concerning

World War I1 capital ships It involves the following relations:

C l a s s e s ( c l a s s , t y p e , c o u n t r y , numGuns, b o r e , d i s p l a c e m e n t ) Ships(name, c l a s s , launched)

B a t t l e s (name, d a t e ) Outcomes(ship, b a t t l e , r e s u l t ) Ships are built in "classes" from the same design, and the class is usually named

for the first ship of that class The relation C l a s s e s records the name of t h r

c1as.r

Bismarck Iowa Kongo North C a r o l i n a Renown

Revenge Tennessee

Y amat o

UUI

class, the type (bb for battleship or bc for battlecruiser), the country that built = ,

the ship, the number of main guns, the bore (diameter of the gun barrel, in

inches) of the main guns, and the displacement (weight, in tons) Relation

Ships records the name of the ship, the name of its class, and the year in which

the ship was launched Relation B a t t l e s gives the name and date of battles

G t B r i t a i n

G t B r i t a i n USA

involving these ships, and relation Outcomes gives the result (sunk, damaged "A

Figures 5.12 and 5.13 give some sample d a t a for these four relation^.^ S o t e c _ _ _ L 1 _ ,._,- - 1 ,- -A,,,.-

that unlike the data for Exercise 5.2.1 there are some "daneline - -" - tnnlrs" in - r - - - - this

data e.g., ships mentioned in Outcomes that are not mentioned in Ships

Write expressions of relational algebra t o answer the following queries For

J O U L I I U ~ K U L ~ u u a u a r ~ a n a l Tennessee S u r i g a o S t r a i t Washington Guadalcanal

ok sunk damaged

ok sunk

the data of Figs 5.12 and 3.13, show the result of your query However: your Yamashiro I S u r i g a o r u a L * I nu=

answer should work for arbitrary data, not just the dat,a of thcse figures

a) Give the class names and countries of the classes that carried guns of a t ( c ) Sample data for relation Outcomes least 16-inch bore

3Source: J S \Vestwood, Fighting Ships of World W a r I], Follett Publishing, Chicago

TS 1980

Trang 19

212 CHAPTER 5 RELATIOhrAL A LGEBR.4

name

California Haruna Hiei Iowa Kirishima Kongo Hissouri Musashi

1 class I launched

( Tennessee 1 1921 Kongo

Kongo Iowa Kongo Kongo Iowa Yamato New Jersey

Worth Carolina Ramillies

Resolution I Revenge 1 1916 Revenge

I Revenge

Royal Sovereign Revenge

Washington Wisconsin Yamato

North Carolina Iowa

Yamato

Figure 5.13: Sample data for relation Ships

b) Find the ships launched prior to 1921

c) Find the ships sunk in the battle of the North Atlantic

d) The treaty of Washington in 1921 prohibited capital ships heavier than 33,000 tons List the ships that violated the treaty of Washington

e ) List the name, displacement, and number of guns of the ships engaged it1

the battle of Guadalcanal

f ) List all the capital ships mentioned in the database (Remember that all these ships may not appear in the Ships relation.)

Exercise 5.2.5 : Draw expression trees for each of your expressions of Exer- cise 5.2.4

Exercise 5.2.6: Write each of your expressions from Exercise 5.2.4 in the

linear notation of Section 5.2.11

Exercise 5.2.7: What is the difference bet~veen the natural join R w S and the theta-join R S where the condition C is that R.d = S 4 for each attribute

A appearing in the schemas of both R and S?

Exercise 5.2.8 : ;In operator on relations is said to be monotone if whenever

we add a tuple to one of its arguments, the result contains all the tuples that

it contained before adding the tuple, plus perhaps more tuples Which of the operators described in this section are monotone? For each, either explain why

it is monotone or give an example showing it is not

Exercise 5.2.9: Suppose relations R and S have n tuples and m tuples, respectively Give the minimum and maximum numbers of tuples that the results

of the follo~ving expressions can hare

c) uc(R) x S: for sorne condition C

d) vr (R) - S : for sorne list of attributes L

Exercise 5.2.10: The semijoin of relatioils R and S, written R D<S, is the bag of tuples t in R such that there is at least one tuple in S that agrees with t

in all attributes that R and S have in common Give three different expressions

of relational algebra that are equivalent to R D< S

Exercise 5.2.11 : The antisemijoin R T% S is the bag of tuples t in R that

do not agree with any tuple of S in the attributes common to R and S Give

an expression of relational algebra equivalent to R S

Exercise 5.2.12 : Let R be a relation with schema

and let S he a relation ~vith schema (B1 B2 , B,): that is, the attributes

of S axe a subset of the attributes of R The quotient of R and S denoted

! g) Find the classes that had only one ship as a member of that class R + S is the set of tuples t over attributes -41, .a2: , -4, (i.e., the attributes

of R that are not attributes of S ) such that for every tuple s in S, the tuple t s ,

! h) Find those countries that had both battleships and battlecruisers consisting of the components of t for -41, A * , - , -4n and the components of s

for B1: Bz, , B,, is a member of R Give an expression of relational algebra,

! i) Find those ships that "lived t,o fight another day"; they were damaged in using the operators we have defined previously in this section, that is equil-alent one battle, but later fought in another

Trang 20

214 CH'4PTER 5 RELATIONAL ALGEBR-4

\vhile a set of tuples (i.e., a relation) is a simple, natural model of data as it

might appear in a database, commercial database systems rarely, if ever, are

based purely on sets In some situations, relations as they appear in database

systems are permitted to have duplicate tuples Recall that if a "set" is allon-ed

to haye multiple occurrences of a member, then that set is called a bag or

muftiset In this section, nre shall consider relations that are bags rather than

sets; that is, we shall allow the same tuple to appear more than once in a

relation When we refer to a "set," we mean a relation without duplicate

tuples; a "bag" means a relation that may (or may not) have duplicate tuples

Example 5.15: The relation in Fig 5.14 is a bag of tuples In it, the tuple

(1,2) appears three times and the tuple (3,4) appears once If Fig 5.14 were

a set-valued relation, we would have to eliminate two occurrences of the tuple

(1,2) In a bag-valued relation, we do allow multiple occurrences of the same

tuple, but like sets, the order of tuples does not matter

Figure 5.14: A bag

5.3 RELATIOiVAL O P E R A T I O W ON BAGS

Figure 5.15: Bag for Example 5.16

we used the ordinary projection operator of relational algebra, and therefore eliminated duplicates, the result would be only:

Sote that the bag result, although larger, can be computed more quickly, since there is no need to compare each tuple (1,2) or (3,4) with previously generated tuples

Lloreover if we are projecting a relation in order to take an aggregate (discussed in Section 5.4) such as "Find the average value of -I in Fig 5.15." we could not use the set model to think of the relation projected onto attribute -4

-4s a set, the average value of -4 is 2 because there are only two values of A - 1 and 3 - in Fig 5.15 and their average is 2 However if we treat the -4-column

in Fig 5.15 as a bag (1.3.1.1) we get the correct average of '4 which is 1.5, among the four tuples of Fig 5.15

5.3.1 Why Bags?

When xve take the union of tn-o bags, we add the nunlber of occurrences of each Khen we think about implementing relations efficiently, we can see several rvays tuple That is, if R is a bag in n-hich the tuple t appears n times, and S is a bag that allowing relations to be bags rather than sets can speed up operations on in which the tuple t appears m times, then in the bag R U S tuple t appears relations We mentioned a t the beginning of Section 5.2 how allowing the result n f m times Sote that either n or m (or both) can be 0

to be a bag coulcl speed up the union of two relations For another example IYlen ~ v e intersect two bags R and S, in \vhich tuple t appears n and when ~ v e do a projection, allowing the resulting relation to be a bag (even I\-lien m times, respectively in R n S tuple t appears min(n, m) times f hen we the original relation is a set) lets us work with each tuple indepcndent.1~ If \YO compute R - S the difference of bags R and S : tuple t appears in R - S

~vant a set as the result, we need to compare each projected tuple with all thc mas(0,r - m ) times That is if t appears in R more times than it appears in other projected tuples, to make sure that each projection appears only oncc S then in R - S tuple t appears the number of times it appears in R minus the However, if we can accept a bag as the result, then we simply project each tuple number of ti~nes it appears in 5' Ho~vever: if t appears at least as many times and add it to the result; no comparison with other projected tuples is necessary in S as it appears in R then t does not appear at all in R - S Intuitively,

occurrences of t in S each "cancel" one occurrence in R

Example 5.16: The bag of Fig 5.14 could be the result of project,ing the

relation shown in Fig 5.15 onto attributes -4 and B, provided vie allow the Example 5.17: Let R be the relation of Fig 5.14, that is, a bag in which result to be a bag and do not eliminate the duplicate occurreIices of (1,2) Had tuple (1,2) appears three times and (3.4) appears once Let S be the bag

Trang 21

Bag Operations on Sets

Imagine we have two sets R and S Every set may be thought of as a bag; the bag just happens t o have a t most one occurrence of any tuple Suppose we intersect R n S , but we think of R and S as bags and use the bag intersection rule Then we get the same result as we would get if we thought of R and S as sets That is, thinking of R and S as bags, a tuple

t is in R n S the minimum of the number of times it is in R and S Since

R and S are sets, t can be in each only 0 or 1 times IQhether we use the bag or set intersection rules, we find that t can appear a t most once in

R n S , and it appears once exactly when it is in both R and S Similarly,

if we use the bag difference rule to compute R - S or S - R we get exactly the same result as if we used the set rule

However, union behaves differently, depending on whether we think

of R and S as sets or bags If we use the bag rule to compute R U S,

then the result may not be a set, even if R and S are sets In particular,

if tuple t appears in both R and S then t appears tivice in R U S if vie

use the bag rule for union But if we use the set rule then t appears only once in R U S Thus when taking unions, we must be especially careful

t o specify whether we are using the bag or set definition of union

Then the bag union R U S is the bag in which (1,2) appears four times (three

times for its occurrences in R and once for its occurrence in S); (3,4) appears

three times, and (5,G) appears once

The bag intersection R n S is the bag

with one occurrence each of (1,2) and (3,4) That is, (1,2) appears three times

in Rand once in S, and min(3,l) = 1, so (1,2) appears once in R n S Similarly

(3,4) appears min(l,2) = 1 time in R n S Thple (5,6), which appears once in

S but zero times in R appears min(0,l) = 0 times in R n S

The bag difference R - S is the bag

If the elimination of one or rriore attributes during the projection causes

To see why, notice that (1,2) appears three times in R and once in S: so in the same tuple to be created from several tuples, these duplicate tuples are not

R - S it appears max(0,3 - 1) = 2 times Tuple (3,4) appears once in R and eliminated from the result of a bag-projection Thus, the three tuples (1: 2:5), twice in S , so in R - S it appears max(0,l - 2) = 0 times No other tuplc (1,2.7) and (1: 2,8) of the relation R from Fig 5.15 each gave rise t o the same appears in R, so there can be no ot,her tuples in R - S tuple (1: 2) after projection onto attributes A and B In the bag result, there are

As another example, the bag difference S - R is the bag three occurrences of tuple (1.2): while in the set-projection, this tuple appears

AIB

5.3.4 Selection on Bags

To apply a selection t o a bag, we apply the selection condition to each tuple Tuple (3,4) appears once because that is the difference in the number of ti~ncs

it appears in S minus the number of times it appears in R Tuple ( 5 : 6) appears

once in S - R for the same reason The resulting bag happens to be a set ill

5.3.3 Projection of Bags

We hare already illustrated the projection of bags As we saw in Example 5.16

each tuple is processed independently during the projection If R is the bag of

Fig 5.15 and we compute the bag-projection T ~ , ~ ( R ) , then we get the bag of

Trang 22

Algebraic Laws for Bags

An algebraic law is an equivalence between two expressions of relational algebra whose arguments are variables standing for relations The equivalence asserts that no matter what relations we substitute for these variables, the two expressions define the same relation An example of a well- known law is the conimutative law for union: R U S = S U R This law happens to hold whether we regard relation-variables R and S as standing for sets or bags However, there are a number of other laws that hold when relational algebra is applied to sets but that do not hold when relations are interpreted as bags A simple example of such a law is the distributive law

of set difference over union, ( R U S) - T = ( R - T ) U ( S - T ) This law holds for sets but not for bags To see why it fails for bags, suppose R, S,

and T each have one copy of tuple t Then the expression on the left has one t , while the expression on the right has none As sets, neither would have t Some exploration of algebraic laws for bags appears in Exercises

5.3.4 and 3.3.5

(a) The relation R

(b) The relation S

(c) The product R x S

That is, all but the first tuple nieets the selection condition The last two tuples

Figure 3.16: Computing the product of bags

which are duplicates in R , are each included in the result EI

5.3.6 Joins of Bags

5.3.5 Product of Bags

Joining bags also presents 110 surprises We compare each tuple of one relation The rule for the Cartesian product of bags is the expected one Each tuple of xvith each tuple of the other, decide whether or not this pair of tuples joins suc- one relation is paired with each tuple of the other, regardless of whether it is a cessfully, and if so we put the resulting tuple in the answer When constructing duplicate or not As a result, if a tuple r appears in a relation R m times and the answer: ~e do not eliminate duplicate tuples

tuple s appears iz times in relation S, t,lien in the product R x S , the tuple r.9

ill appear m n times

Example 5.19: Let R and S be the bags sho\x-n in Fig 3.16 Then the

~~roduct R x S consists of six tuples, as shown in Fig 5.1G(c) Mote that the

usual convention regarding attribute names that we developed for set-relations

applies equally well to hags Thus, the attribute 13, which belongs to both

relations R and S, appears twice in the product, each time prefixed by one of That is tuple (1: 2) of R joins with (2,3) of S Since there are two copies of the relation names (1.2) in R and one copy of (2: 3) in S , there are two pairs of tuples that join to

give the tuple (1; 2,3) S o other tuples from R and S join successfully

Trang 23

220 CHAPTER 5 RELATIONAL ALGEBR.4

As another example on the same relations R and S , the theta-join

produces the bag

The computation of the join is as follows Tuple (1,2) from R and (4,5) from S

meet the join condition Since each appears twice in its relation, the number of times the joined tuple appears in the result is 2 x 2 or 4 The other possible join

of tuples - (1,2) from R with (2,3) from S - fails to meet the join condition,

so this combination does not appear in the result

5.3.7 Exercises for Section 5.3

* Exercise 5.3.1 : Let PC be the relation of Fig 5.11(a), and suppose we compute the projection iiSpeed(PC) What is the value of this expression as a set? is a bag? What is the ayerage value of tuples in this projection, when treated as a set? -4s a bag?

Exercise 5.3.2 : Repeat Exercise 5.3.1 for the projection 7ihd(~C)

Exercise 5.3.3: This exercise refers to the "batt,leship" relat.ions of Exer-

cise 5.2.4

a) The expression aaOre(Classes) yields a single-column relation with the bores of the various classes For the data of Exercise 5.2.4 ~vhat is this relation as a set? As a bag?

! b) Write an expression of relational algebra to give the bores of the ships (not the classes) Your expression must make sense for bags; that is, the number of times a value b appears must be the number of ships that have bore b

! Exercise 5.3.4: Certain algebraic laws for relations as sets also hold for rc- lations as bags Explain wily each of the laws belo\\- Iiold for bags as ell as sets

* a) The associative law for union: ( R U S ) U T = R U ( S U T)

b) The associative law for intersection: ( R n S ) n T = R f l (S fl T )

c ) The associative law for natural join: (R w S ) w T = R w ( S w T)

d) The commutative law for union: (R U S ) = ( S U R)

e) The commutative law for intersection: (R fl S ) = ( S n R)

f) The commutative law for natural join: ( R w S ) = ( S w R)

g) nL(R U S) = iiL(R) U i i ~ ( S ) Here, L is an arbitrary list of attributes

* h) The distributi~e law of union over intersection: R U (S f l T) = ( R U S ) n

i) u c AND D(R) = uc(R) n oD(R) Here, C and D are arbitrary conditions about the tuples of R

Exercise 5.3.5: The following algebraic laws hold for sets but not for bags Explain why they hold for sets and give counterexamples to show that they do

* a ) ( R n S ) - T = R n ( S - T ) b) The dist,ributi~-e law of intersection over union: R n (S U T ) = (R n S ) u

C) u c OR D(R) = uC(R) U UD(R) Here, C and D are arbitrary conditions about the tuples of R

Section 5.2 presented the classical relational algebra, and Section 5.3 introduced the modifications necessary to treat relations as bags of tuples rather than sets The ideas of these two sections serve as a foundation for most of modern query languages However languages such as SQL have several other operations that have proved quite important in applications Thus, a full treatment of relational operations must include a number of other operators which ~ v e introduce in this section The additions:

1 The duplicate-e1iminatio.n operator 6 turns a bag into a set by eliminating all but one copy of each tuple

2 Aggregation operators such as sums or averages, are not operations of

relational algebra but are used by the grouping operator (described next) .\ggregation operators apply to attributcs (columns) of a relation e.g the sum of a column produces the one number that is the sum of all the values

in that column

3 Grouping of tuples according to their value in one or more attributes has

the effect of partitioning the tuples of a relation into "groups." Aggre- gation can then be applied to columns within each group giving us the

Trang 24

222 CHAPTER 5 RELATIONAL ALGEBR.4 5.4 EXTEXDED OPERATORS OF RELATIONAL ALGEBR.4 223

ability to express a number of queries that are impossible to express in 1 SUM produces the sum of a column with numerical values

the classical relat,ional algebra The grouping operator y is an operator that combines the effect of grouping and aggregation 2 AVG produces the average of a column with numerical values

4 The sorting operator T turns a relation into a list of tuples, sorted accord- 3 M I N and MAX, applied to a column with numerical values, produces the ing to one or more attributes This operator should be used judiciously, smallest or largest value, respectively When applied t o a column with because other relational-algebra operators apply to sets or bags, but never character-string values, they produce the lexicographically (alphabeti-

to lists Thus, T only makes sense as the final step of a series of operations cally) first or last value, respectively

5 Extended projection gives addit,ional power to the operator sr In addition 4 COUNT produces the number of (not necessarily distinct) values in a col-

to projecting out some columns, in its generalized form sr can perform umn Equivalently, COUNT applied to any attribute of a relation produces computations involving the columns of its argument relation to produce the number of tuples of that relation, including duplicates

6 The oute j o i n operator is a variant of the join that avoids losing dangling tuples In the result of the outerjoin, dangling tuples are "padded" with the null value, so the dangling tuples can be represented in the output

5.4.1 Duplicate Elimination Sometimes, we need an operator that converts a bag to a set For that purpose,

we use d(R) to return the set consisti~lg of one copy of every tuple that appears Some examples of aggregations on the attributes of this relation are:

one or more times in relation R 1 SUM(B) = 2 + 4 + 2 + 2 = 10

from Fig 5.14, then 6(R) is

Sote that the tuple (1,2), which appeared three times in R appears only oncc

in d(R)

Often we do not xant simply the average or some other aggregation of an entire column Rather, we need to consider the tuples of a relation in groups corresponding to the value of one or more other colulnns and nr aggregate only within each group .As an esample, suppose we wanted to conlpute the total number of minutes of movies produced by each studio i.e a relation such as:

5.4.2 Aggregation Operators There are several operators that apply to sets or bags of atomic values These operators are used to summarize or "aggregate" the values in one column of

a relation, and thus are referred to as aggregation operators The standard operators of this type are: Starting with the relation

Trang 25

R Since the result of 7 contains exactly one tuple from each group, the

effect of this "grouping" is to eliminate duplicates Horn-ever, because 6 is such a common and important operator, we shall continue t o consider it separately when we study algebraic laws and algorithms for implementing the operators

One can also see y as an extension of the projection operator on sets That is, y~,,,i,, .,,A,(R) is also the same as na,,A ,, , A,(R), if R is a set Howeyer, if R is a bag, then y eliminates duplicates while si does not For this reason, y is often referred to as generalized projection

studioNartte

Disney Disney Disney

MGM MGM

0

Movie(title, year, length, incolor, studioName, producerC#)

from our example database schema of Section 5.1, we must group the tuples

according to their value for attribute studioName We must then sum the

length column within each group That is, we imagine that the tuples of

Movie are grouped as suggested in Fig 5.17, and we apply the aggregation

SUM(1ength) t o each group independently

Figure 5.17: A relation with imaginary division into groups i The grouping attributes' values for that group and

ii The aggregations, over all tuples of that group, for t,he aggregated attributes on list L

nP shall no~v introduce an operator that allo~vs us to group a relation and/or

aggregate some columns If there is grouping? then the aggregation is within E x a m p l e 5.23 : Suppose we have the relation

be a grouping attribute the MIN(year) aggregate However, in order to decide ~i-hich groups satisf>- the

condition that the star appears in at least three movies, we must also compute b) An aggregation operator applied to an attribute of the relation To pro- tlie COUNT(tit1e) aggregate for each group

vide a name for the attribute corresponding to this aggregation in the We begin ~vith the grouping expression result, an arrow and new name are appended t o the aggregation The

underlying attribute is said to be an aggregated attribxte

? s t o r ~ o , n r H I N ( y e n r ) - - t m i n Y e n r ~~~l~~(title)+ct~ltle(StarsIn)

The relation returned by the expression yL(R) is constructed as follo~vs:

The first two colun~ns of the result of this expression are needed for the quer?- re-

1 Partition the tuples of R into groups Each group consists of all tuples sult The third column is an ausiliary attribute, n-hich we have named ctTitle: having one particular assignment of values to tlie grouping attributes in it is needed to determine whether a star has appeared in a t least three movies the list L If there are no grouping attributes, the entire relation R is one That is, we corltinuc the algebraic expression for the query by selecting for group ctTitle >= 3 and then projecting onto the first two columns -An expression

tree for the query is sho~i-n in Fig 5.18 0

2 For each group, produce one tuple consistilig of:

Tiêu đề	Other Data Models
Trường học	Unknown University
Chuyên ngành	Database Systems
Thể loại	Lecture Notes
Năm xuất bản	Unknown Year
Thành phố	Unknown City

Định dạng
Số trang	50
Dung lượng	4,95 MB