An Introduction to Database Systems 8Ed - C J Date - Solutions Manual Episode 2 Part 6 pot

Chapter 22 Principal Sections • Aspects of decision support • DB design for decision support • Data preparation • Data warehouses and data marts • OLAP • Data mining • SQL facilities Gen

Trang 1

individual client, it looks like a regular DBMS.* However, the data is stored, mostly, not at the middleware site, but rather at any number of other sites behind the scenes, under the control of

a variety of other DBMSs (or even file managers) In other words, the middleware product uses the combination of those other DBMSs

and/or file managers as its own storage manager (and coordinates

their operation, of course)

──────────

* In the case of DataJoiner, at least, it is a DBMS (among other

things) Why would you buy DB2 when you can buy DataJoiner

instead? (The question is hypothetical, or rhetorical, but the point is that not all technical questions have technical answers! The answer to this particular question probably has more to do

with IBM's marketing and pricing strategies than it does with

technical issues.)

──────────

21.7 SQL Facilities

Explain client/server capabilities──CONNECT, DISCONNECT, SET

CONNECTION (not in too much detail) By the way, note the syntax:

CONNECT TO but not DISCONNECT FROM (this point isn't mentioned in

the book) You could elaborate on SQL/PSM's stored procedure

support if you like, but it's complicated (see reference [4.20])

Answers to Exercises

21.1 Location independence means users can behave (at least from a

logical standpoint) as if the data were all stored at their own

local site Fragmentation independence means users can behave (at

least from a logical standpoint) as if the data weren't

fragmented Replication independence means users can behave (at

least from a logical standpoint) as if the data weren't

replicated

21.2 Here are some of the reasons:

• Ease of data fragmentation

• Ease of data reconstruction

• Set-level operations

• Optimizability

Trang 2

21.3 See Section 21.2

21.6 No answer provided

Trang 3

Chapter 22

Principal Sections

• Aspects of decision support

• DB design for decision support

• Data preparation

• Data warehouses and data marts

• OLAP

• Data mining

• SQL facilities

General Remarks

David McGoveran was the original author of this chapter

The term decision support covers a multitude of sins! (After

all, classical query processing could certainly be regarded as decision support, of a kind; so too could traditional transaction processing, perhaps with a bit of a stretch.) This chapter begins

by giving some historical perspective, then concentrates on the currently fashionable notions of (a) "data warehouses," "data

marts," and so forth, and (b) "online analytical processing"

(OLAP) It also includes with a brief look at the application of statistical techniques to discover patterns in very large volumes

of data──data mining (a comparatively new field, made possible by

the combined availability of cheap computer storage and fast

computer processing) It concludes with a sketch of the pertinent features of SQL

The chapter is, primarily, a high-level overview of what by now is a large subject in its own right An important quote from Section 22.1: "We remark immediately that one thing [these areas] all have in common is that good logical design principles are

rarely followed in any of them! The practice of decision support

is, regrettably, not as scientific as it might be; often, in fact,

it's quite ad hoc In particular, it tends to be driven by

physical considerations much more than by logical ones──indeed, it

tends to blur the logical vs physical distinction considerably." Caveat lector

We use SQL, not Tutorial D, as the basis for examples; we use

the "fuzzy" terminology of rows, columns, and tables in place of tuples, attributes, and relation values and variables (relvars);

Trang 4

we use logical schema and physical schema in place of conceptual schema and internal schema

The chapter can be skipped or skimmed if desired

22.2 Aspects of Decision Support

Key point: The database is primarily read-only (except for

periodic load or refresh operations) Also:

• Columns tend to be used in combination

• Integrity in general is not a concern; the data is assumed to

be correct when first loaded and isn't subsequently updated (These facts don't mean we don't have to declare integrity constraints, though!──see the next section.)

• Keys often include a temporal component

• The database tends to be large

• The database tends to be heavily indexed

• The database often involves various kinds of controlled

redundancy (including "summary tables" as well as straight data replication)

Decision support queries tend to be quite complex Here are

some of the kinds of complexities that can arise:

• Boolean expression complexity

• Join complexity

• Function complexity

• Analytical complexity

All of the foregoing factors lead to a strong emphasis on

designing for performance Of course, this fact should affect only the physical design of the database, not the logical design, but (as previously noted) vendors and users both typically fail to

distinguish properly between the two segue into the next

section

22.3 DB Design for Decision Support

Self-explanatory Observe in particular:

Trang 5

• The treatment of composite columns

• The fact that integrity constraints need to be considered and stated, even though the database is read-only

• The issues concerning "temporal keys" (forward pointer to Chapter 23)

Note especially the remarks concerning physical design and the subsection on common design errors (especially with respect to

"star schemas"──forward pointer to Section 22.5)

22.4 Data Preparation

Also self-explanatory Note the discussion of extract in

particular (if the section is covered at all──but it could easily

be skipped)

22.5 Data Warehouses and Data Marts

Note first that these terms aren't very precisely defined!

Loosely, however, a data mart is (a copy of) some "hot subset" of the data warehouse Discuss the desirability of separating

decision support and operational processing There are arguments (in fact, they seem to be warming up a little these days) in favor

of integrating them, too

Describe dimensional schemas star schemas fact and

dimension tables Explain "star join." What's the difference between a star schema and a normal schema? This question is hard

to answer with simple examples, because a simple star schema can

look very similar (even identical) to a good relational design

In fact, however, there are several problems with the star schema approach in general:

• It's ad hoc (based on intuition, not principle)

• Star schemas tend to be physical, not logical

• Sometimes information is lost

• The fact table often contains several different types of

facts

• The dimension tables can become nonuniform, too

Trang 6

• The dimension tables are often less than fully normalized

Note: One reviewer of the previous edition said: "[This

section] is critical of the star schema [approach] but proposes no alternative." Actually, the section isn't so much critical of

star schemas as such (how could it be, without a precise

definition of the concept?); rather, it's critical of the fact

that, very often, what people call a "star schema" is simply a bad logical design And, of course, the section does implicitly

propose an alternative: namely, good logical design (i.e., design

done in accordance with well-established relational design

principles, as described in Chapters 12 and 13)

22.6 OLAP

Analytical processing always implies data aggregation, usually

according to many different groupings In classical relational languages (and in SQL too, prior to SQL:1999), each individual

query involves at most one grouping (perhaps implicit) and

produces just one table as its result; hence, n distinct groupings require n distinct queries, producing n distinct results It thus

seems worth trying to find a way:

a Of requesting several levels of aggregation in a single

query, and thereby

b Offering the implementation the opportunity to compute all of those aggregations more efficiently (i.e., in a single pass)

Such considerations are the motivation behind the GROUPING SETS, ROLLUP, and CUBE options on the GROUP BY clause found in certain

SQL implementations and also (since SQL:1999) in the SQL standard

as well

Bundling several queries into one statement might be a good idea, but bundling the results into one table isn't (basically

because the result isn't a relation) What's the predicate?

(Always a good question to ask!)

Explain crosstabs Note that crosstabs aren't a very good way

to display a result involving more than two dimensions──and the more dimensions there are, the worse it gets (see Exercise 22.9)

Describe multi-dimensional databases (relate to crosstabs) ROLAP vs MOLAP Sparse arrays (point out that these are an

artifact of the representation, not a "feature"!)

Please criticize the position that "relations are

two-dimensional." There's massive confusion out there in the

Trang 7

marketplace on this extremely simple point A couple of genuine (bad) quotes in this regard:

• "When you're well trained in relational modeling, you begin

to believe the world is two-dimensional You think you can get anything into the rows and columns of a table" [Douglas Barry, Executive Director, ODMG]

• "There is simply no way to mask the complexities involved in assembling two-dimensional data into a multi-dimensional form" [Richard Finkelstein]

22.7 Data Mining

Data mining is a huge subject in its own right (there are whole books devoted to the topic) The purpose of this section is only

to scratch the surface of the subject, nothing more Probably sufficient just to go through the simple SALES example Explain

the terms population, support level, confidence level

The purpose of the final paragraph in this section is simply

to make the student aware of the names of certain techniques and

(perhaps) to give the faintest of ideas of what each of those

techniques can do It's deliberately not meant to be fully

understandable

22.8 SQL Facilities

GROUPING SETS, ROLLUP, and CUBE were included in the SQL:1999

standard as originally published; other facilities were added the following year in the "OLAP amendment" [22.21] But this stuff

isn't database, it's statistics──and the details don't belong in a database book, in my opinion (They might belong in an SQL book,

of course.) Thus, the intent of this section is merely to give a sense of the scope of that "OLAP amendment," nothing more

References and Bibliography

Note the introductory remark:

(Begin quote)

The "views" mentioned in the titles of references [22.3-22.5], [22.10], [22.12], [22.16], [22.25], [22.28], [22.30], and [22.35] are not views but snapshots Annotation to those references talks

in terms of snapshots, not views

Trang 8

(End quote)

Answers to Exercises

22.1 To quote from Section 22.5: "Operational systems usually have strict performance requirements, predictable workloads, small units of work, and high utilization By contrast, decision

support systems typically have varying performance requirements, unpredictable workloads, large units of work, and erratic

utilization These differences can make it very difficult to

combine operational and decision support processing within a

single system──conflicts arise over capacity planning, resource management, and system performance tuning, among other things For such reasons, operational system administrators are usually reluctant to allow decision support activities on their systems; hence the familiar dual-system approach."

22.2 To quote from Section 22.4: "The data must be extracted

(from various sources), cleansed, transformed and consolidated, loaded into the decision support database, and then periodically

refreshed."

22.3 Controlled redundancy is redundancy that's known to and

managed by the DBMS (involving, in particular, automatic update propagation) Such redundancies might or might not be visible to

the user Uncontrolled redundancy is (of course) redundancy that

isn't controlled in the foregoing sense and must therefore be

managed by the user

Indexes and the transaction log are both examples of

controlled redundancy; so too is replication in the sense of

Chapter 21 Maintaining separate detail and summary information

"by hand" is an example of uncontrolled redundancy

Redundancy is important for decision support because it can make query formulation simpler and query execution faster Such

redundancy is obviously better if it's controlled, however,

because (as with declarative support for queries and the like)

"controlled" means the system does the work, while "uncontrolled" means the user does the work

22.7 In ROLAP, the user sees the data in relational form and

issues relational-style queries In MOLAP, the user sees the data

Trang 9

as a multi-dimensional array and issues array-style queries (more

or less)

22.8 There are eight (= 23) possible groupings for each hierarchy,

so the total number of possibilities is 84 = 4,096 As a

subsidiary exercise, you might like to consider what's involved in

using SQL to obtain all of these summarizations No further

answer provided (the question is rhetorical, somewhat)

22.9 With respect to the SQL queries, we show the GROUP BY clauses only:

a GROUP BY GROUPING SETS ( (S#,P#), (P#,J#), (J#,S#) )

b GROUP BY GROUPING SETS ( J#, (J#,P#), () )

c The trap is that the query is ambiguous──the term (e.g.)

"rolled up along the supplier dimension" has many possible

meanings However, one possible interpretation of the

requirement will lead to a GROUP BY clause looking like this: GROUP BY ROLLUP (S#), ROLLUP (P#)

d GROUP BY CUBE ( S#, P# )

We omit the SQL result tables As for the crosstabs, it

should be clear that crosstabs aren't a very good way to display a result that involves more than two dimensions (and the more

dimensions there are, the worse it gets) For example, one such crosstab──corresponding to GROUP BY S#, P#, J#──might look like this (in part):

┌───────────────────────┬───────────────────────┬─────

│ P1 │ P2 │

├─────┬─────┬─────┬─────┼─────┬─────┬─────┬─────┼─────

│ J1 │ J2 │ J3 │ │ J1 │ J2 │ J3 │ │

┌────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────

│ S1 │ 200 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │

│ S2 │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │

│ S3 │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │

│ S4 │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │

│ S5 │ 0 │ 200 │ 0 │ │ 0 │ 0 │ 0 │ │

│ │ │ │ │ │ │ │ │ │

In a nutshell: The headings are clumsy, and the arrays are sparse

22.11 Perhaps Debate!

Trang 10

Định dạng
Số trang	20
Dung lượng	103,53 KB