Chapter 22 Principal Sections • Aspects of decision support • DB design for decision support • Data preparation • Data warehouses and data marts • OLAP • Data mining • SQL facilities Gen
Trang 1individual client, it looks like a regular DBMS.* However, the data is stored, mostly, not at the middleware site, but rather at any number of other sites behind the scenes, under the control of
a variety of other DBMSs (or even file managers) In other words, the middleware product uses the combination of those other DBMSs
and/or file managers as its own storage manager (and coordinates
their operation, of course)
──────────
* In the case of DataJoiner, at least, it is a DBMS (among other
things) Why would you buy DB2 when you can buy DataJoiner
instead? (The question is hypothetical, or rhetorical, but the point is that not all technical questions have technical answers! The answer to this particular question probably has more to do
with IBM's marketing and pricing strategies than it does with
technical issues.)
──────────
21.7 SQL Facilities
Explain client/server capabilities──CONNECT, DISCONNECT, SET
CONNECTION (not in too much detail) By the way, note the syntax:
CONNECT TO but not DISCONNECT FROM (this point isn't mentioned in
the book) You could elaborate on SQL/PSM's stored procedure
support if you like, but it's complicated (see reference [4.20])
Answers to Exercises
21.1 Location independence means users can behave (at least from a
logical standpoint) as if the data were all stored at their own
local site Fragmentation independence means users can behave (at
least from a logical standpoint) as if the data weren't
fragmented Replication independence means users can behave (at
least from a logical standpoint) as if the data weren't
replicated
21.2 Here are some of the reasons:
• Ease of data fragmentation
• Ease of data reconstruction
• Set-level operations
• Optimizability
Trang 221.3 See Section 21.2
21.4 See Section 21.4
21.5 See Section 21.4
21.6 No answer provided
21.7 No answer provided
21.8 No answer provided
Trang 3Chapter 22
Principal Sections
• Aspects of decision support
• DB design for decision support
• Data preparation
• Data warehouses and data marts
• OLAP
• Data mining
• SQL facilities
General Remarks
David McGoveran was the original author of this chapter
The term decision support covers a multitude of sins! (After
all, classical query processing could certainly be regarded as decision support, of a kind; so too could traditional transaction processing, perhaps with a bit of a stretch.) This chapter begins
by giving some historical perspective, then concentrates on the currently fashionable notions of (a) "data warehouses," "data
marts," and so forth, and (b) "online analytical processing"
(OLAP) It also includes with a brief look at the application of statistical techniques to discover patterns in very large volumes
of data──data mining (a comparatively new field, made possible by
the combined availability of cheap computer storage and fast
computer processing) It concludes with a sketch of the pertinent features of SQL
The chapter is, primarily, a high-level overview of what by now is a large subject in its own right An important quote from Section 22.1: "We remark immediately that one thing [these areas] all have in common is that good logical design principles are
rarely followed in any of them! The practice of decision support
is, regrettably, not as scientific as it might be; often, in fact,
it's quite ad hoc In particular, it tends to be driven by
physical considerations much more than by logical ones──indeed, it
tends to blur the logical vs physical distinction considerably." Caveat lector
We use SQL, not Tutorial D, as the basis for examples; we use
the "fuzzy" terminology of rows, columns, and tables in place of tuples, attributes, and relation values and variables (relvars);
Trang 4we use logical schema and physical schema in place of conceptual schema and internal schema
The chapter can be skipped or skimmed if desired
22.2 Aspects of Decision Support
Key point: The database is primarily read-only (except for
periodic load or refresh operations) Also:
• Columns tend to be used in combination
• Integrity in general is not a concern; the data is assumed to
be correct when first loaded and isn't subsequently updated (These facts don't mean we don't have to declare integrity constraints, though!──see the next section.)
• Keys often include a temporal component
• The database tends to be large
• The database tends to be heavily indexed
• The database often involves various kinds of controlled
redundancy (including "summary tables" as well as straight data replication)
Decision support queries tend to be quite complex Here are
some of the kinds of complexities that can arise:
• Boolean expression complexity
• Join complexity
• Function complexity
• Analytical complexity
All of the foregoing factors lead to a strong emphasis on
designing for performance Of course, this fact should affect only the physical design of the database, not the logical design, but (as previously noted) vendors and users both typically fail to
distinguish properly between the two segue into the next
section
22.3 DB Design for Decision Support
Self-explanatory Observe in particular:
Trang 5• The treatment of composite columns
• The fact that integrity constraints need to be considered and stated, even though the database is read-only
• The issues concerning "temporal keys" (forward pointer to Chapter 23)
Note especially the remarks concerning physical design and the subsection on common design errors (especially with respect to
"star schemas"──forward pointer to Section 22.5)
22.4 Data Preparation
Also self-explanatory Note the discussion of extract in
particular (if the section is covered at all──but it could easily
be skipped)
22.5 Data Warehouses and Data Marts
Note first that these terms aren't very precisely defined!
Loosely, however, a data mart is (a copy of) some "hot subset" of the data warehouse Discuss the desirability of separating
decision support and operational processing There are arguments (in fact, they seem to be warming up a little these days) in favor
of integrating them, too
Describe dimensional schemas star schemas fact and
dimension tables Explain "star join." What's the difference between a star schema and a normal schema? This question is hard
to answer with simple examples, because a simple star schema can
look very similar (even identical) to a good relational design
In fact, however, there are several problems with the star schema approach in general:
• It's ad hoc (based on intuition, not principle)
• Star schemas tend to be physical, not logical
• Sometimes information is lost
• The fact table often contains several different types of
facts
• The dimension tables can become nonuniform, too
Trang 6• The dimension tables are often less than fully normalized
Note: One reviewer of the previous edition said: "[This
section] is critical of the star schema [approach] but proposes no alternative." Actually, the section isn't so much critical of
star schemas as such (how could it be, without a precise
definition of the concept?); rather, it's critical of the fact
that, very often, what people call a "star schema" is simply a bad logical design And, of course, the section does implicitly
propose an alternative: namely, good logical design (i.e., design
done in accordance with well-established relational design
principles, as described in Chapters 12 and 13)
22.6 OLAP
Analytical processing always implies data aggregation, usually
according to many different groupings In classical relational languages (and in SQL too, prior to SQL:1999), each individual
query involves at most one grouping (perhaps implicit) and
produces just one table as its result; hence, n distinct groupings require n distinct queries, producing n distinct results It thus
seems worth trying to find a way:
a Of requesting several levels of aggregation in a single
query, and thereby
b Offering the implementation the opportunity to compute all of those aggregations more efficiently (i.e., in a single pass)
Such considerations are the motivation behind the GROUPING SETS, ROLLUP, and CUBE options on the GROUP BY clause found in certain
SQL implementations and also (since SQL:1999) in the SQL standard
as well
Bundling several queries into one statement might be a good idea, but bundling the results into one table isn't (basically
because the result isn't a relation) What's the predicate?
(Always a good question to ask!)
Explain crosstabs Note that crosstabs aren't a very good way
to display a result involving more than two dimensions──and the more dimensions there are, the worse it gets (see Exercise 22.9)
Describe multi-dimensional databases (relate to crosstabs) ROLAP vs MOLAP Sparse arrays (point out that these are an
artifact of the representation, not a "feature"!)
Please criticize the position that "relations are
two-dimensional." There's massive confusion out there in the
Trang 7marketplace on this extremely simple point A couple of genuine (bad) quotes in this regard:
• "When you're well trained in relational modeling, you begin
to believe the world is two-dimensional You think you can get anything into the rows and columns of a table" [Douglas Barry, Executive Director, ODMG]
• "There is simply no way to mask the complexities involved in assembling two-dimensional data into a multi-dimensional form" [Richard Finkelstein]
22.7 Data Mining
Data mining is a huge subject in its own right (there are whole books devoted to the topic) The purpose of this section is only
to scratch the surface of the subject, nothing more Probably sufficient just to go through the simple SALES example Explain
the terms population, support level, confidence level
The purpose of the final paragraph in this section is simply
to make the student aware of the names of certain techniques and
(perhaps) to give the faintest of ideas of what each of those
techniques can do It's deliberately not meant to be fully
understandable
22.8 SQL Facilities
GROUPING SETS, ROLLUP, and CUBE were included in the SQL:1999
standard as originally published; other facilities were added the following year in the "OLAP amendment" [22.21] But this stuff
isn't database, it's statistics──and the details don't belong in a database book, in my opinion (They might belong in an SQL book,
of course.) Thus, the intent of this section is merely to give a sense of the scope of that "OLAP amendment," nothing more
References and Bibliography
Note the introductory remark:
(Begin quote)
The "views" mentioned in the titles of references [22.3-22.5], [22.10], [22.12], [22.16], [22.25], [22.28], [22.30], and [22.35] are not views but snapshots Annotation to those references talks
in terms of snapshots, not views
Trang 8(End quote)
Answers to Exercises
22.1 To quote from Section 22.5: "Operational systems usually have strict performance requirements, predictable workloads, small units of work, and high utilization By contrast, decision
support systems typically have varying performance requirements, unpredictable workloads, large units of work, and erratic
utilization These differences can make it very difficult to
combine operational and decision support processing within a
single system──conflicts arise over capacity planning, resource management, and system performance tuning, among other things For such reasons, operational system administrators are usually reluctant to allow decision support activities on their systems; hence the familiar dual-system approach."
22.2 To quote from Section 22.4: "The data must be extracted
(from various sources), cleansed, transformed and consolidated, loaded into the decision support database, and then periodically
refreshed."
22.3 Controlled redundancy is redundancy that's known to and
managed by the DBMS (involving, in particular, automatic update propagation) Such redundancies might or might not be visible to
the user Uncontrolled redundancy is (of course) redundancy that
isn't controlled in the foregoing sense and must therefore be
managed by the user
Indexes and the transaction log are both examples of
controlled redundancy; so too is replication in the sense of
Chapter 21 Maintaining separate detail and summary information
"by hand" is an example of uncontrolled redundancy
Redundancy is important for decision support because it can make query formulation simpler and query execution faster Such
redundancy is obviously better if it's controlled, however,
because (as with declarative support for queries and the like)
"controlled" means the system does the work, while "uncontrolled" means the user does the work
22.4 No answer provided
22.5 No answer provided
22.6 No answer provided
22.7 In ROLAP, the user sees the data in relational form and
issues relational-style queries In MOLAP, the user sees the data
Trang 9as a multi-dimensional array and issues array-style queries (more
or less)
22.8 There are eight (= 23) possible groupings for each hierarchy,
so the total number of possibilities is 84 = 4,096 As a
subsidiary exercise, you might like to consider what's involved in
using SQL to obtain all of these summarizations No further
answer provided (the question is rhetorical, somewhat)
22.9 With respect to the SQL queries, we show the GROUP BY clauses only:
a GROUP BY GROUPING SETS ( (S#,P#), (P#,J#), (J#,S#) )
b GROUP BY GROUPING SETS ( J#, (J#,P#), () )
c The trap is that the query is ambiguous──the term (e.g.)
"rolled up along the supplier dimension" has many possible
meanings However, one possible interpretation of the
requirement will lead to a GROUP BY clause looking like this: GROUP BY ROLLUP (S#), ROLLUP (P#)
d GROUP BY CUBE ( S#, P# )
We omit the SQL result tables As for the crosstabs, it
should be clear that crosstabs aren't a very good way to display a result that involves more than two dimensions (and the more
dimensions there are, the worse it gets) For example, one such crosstab──corresponding to GROUP BY S#, P#, J#──might look like this (in part):
┌───────────────────────┬───────────────────────┬─────
│ P1 │ P2 │
├─────┬─────┬─────┬─────┼─────┬─────┬─────┬─────┼─────
│ J1 │ J2 │ J3 │ │ J1 │ J2 │ J3 │ │
┌────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────
│ S1 │ 200 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │
│ S2 │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │
│ S3 │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │
│ S4 │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ │
│ S5 │ 0 │ 200 │ 0 │ │ 0 │ 0 │ 0 │ │
│ │ │ │ │ │ │ │ │ │
In a nutshell: The headings are clumsy, and the arrays are sparse
22.10 No answer provided
22.11 Perhaps Debate!
Trang 1022.12 No answer provided