Ho~vever: our analysis is based on the as- sumption that queries choose either t o aggregate completely in a dimension or not to aggregate a t all.. there are inany fen-er tuples in F th
Trang 11080 C H A P T E R 20 IXFORlI.4TION IXTEGRATIO-\-
it appears Figure 20.17 suggests the process of adding a border to the cube
in each dimension, t o represent the * u l u e and the aggregated values that
it implies In this figure u;e see three din~ensions with the lightest shading representing aggregates in one dimension, darker shading for aggregates over two dimensions, and tlle darkest cube in the corner for aggregation over all three dimensions Notice that if the number of values along each dirnension is reasonably large, but not so large that most poil~ts in tlle cube are unoccupied
then the "border" represents only a small addition t o the volume of the cube (i.e., the number of tuples in the fact table) In that case, the size of the stored data C C B E ( F ) is not much greater than tlic size of F itself
Figure 20.17: The cube operator augments a data cube with a border of aggre- gations in all combinations of ctinien.ions
A tuple of the table CLBE;(F) that has * in one or more dimensions TI-ill have for each dependent attribute the sum (or another aggregate f ~ ~ n c t i o n ) of the values of t h a t attribute in all the tuples that xve can obtain by replacing the *'s by real values In effect we build into the data the result of aggregating along any set of dimensions Sotice holvever that the C U B E operator does not support <\ggregation a t intermediate levels of granularity based on values in the dirnension tables For instance n e may either leave data broken dovi-11 by day (or whatever the finest granularity for time is) or xve may aggregate time completely, but \re cannot, with thc CCBE operator alone, aggregate by weeks
months or years
Example 20.17 : Let us reconsider the -1ardvark database from Esarnple 20.12
in the light of ~ v h a t the C t - B E oprr;i~or can givc us Recall the fact table from that exiumplc, is
S a l e s ( s e r i a l N 0 , d a t e , d e a l e r , p r i c e ) Hoxvever, the dimension represented by s e r i a l N o is not well suited for the cube
since the serial number is a key for S a l e s Thus sumning the price over all dates, or over all dealers, but keeping the serial ~ l u m b r r fixed has 110 effect: n-e n-ould still gct the "sum" for the one auto ~ v i t h that serial number .I Illole
useful data cube would replace the serial number by the t x o attributes - model and color - t o which the serial number connects S a l e s via the dimension table Autos Sotice t h a t if we replace s e r i a l N o by model and c o l o r , then tile cube
no longer has a key among its dimensions Thus, a n entry of the cube ~vould hare the total sales price for all automobiles of a given model with a given color, by a given dealer, on a given date
There is another change t h a t is useful for the data-cube implementation
of the S a l e s fact table Since the C U B E operator normally sums dependent variables, and 13-e might want t o get average prices for sales in some category, n-e need both t h e sum of t h e prices for each category of automobiles (a given model of a given color sold on a given day by a given dealer) and the total number of sales in that category Thus, the relation S a l e s t o which we apply the C C B E operator is
Sales(mode1, c o l o r , d a t e , d e a l e r , v a l , c n t ) The attribute v a l is intended t o be the total price of all automobiles for the given model, color date and dealer, while c n t is the total number of automo- biles in that category Xotice that in this data cube individual cars are not identified: they only affect t h e value and count for their category
Son- let us consider the relation c C ~ ~ ( S a 1 e s ) -I hypothetical tuple that n-ould be in both S a l e s and ti lo sales) is
( ' G o b i ' , ' r e d ' , '2001-05-21', ' F r i e n d l y F r e d ' , 45000, 2) The interpretation is that on May 21; 2001 dealer Friendly Fled sold two red Gobis for a total of $45.000 The tuple
( ' G o b i ' , *, '2001-05-21', ' F r i e n d l y F r e d ' , 152000, 7) says that on SIay 21, 2001 Friendly Fred sold seven Gobis of all colors, for
a total price of S152.000 S o t e that this tuple is in sales) but not in
S a l e s Relation sales) also contains tuples that represent the aggregation over more than one attribute For instance
( ' G o b i ' , * , '2001-05-21', *, 2348000, 100) says rliat on \la!- 21 2001 rllei-e n-ere 100 Gobis sold by all the dealers and the total price of tliose Gobis Tvas S2.348.000
( ' G o b i ' , *, *, *, 1339800000, 58000) Says that over all time, dealers and colors 58.000 Gobis have been sold for a total price of S1.339.800.000 Lastly the tuple
Trang 2tells us that total sales of all Aardvark lnodels in all colors, over all time at all dealers is 198.000 cars for a total price of $3,521,727,000
Consider how to answer a query in \\-hich we specify conditions on certain attributes of the S a l e s relation and group by some other attributes, n-hile asking for the sum, count, or average price In the relation are r sales), we look for those tuples t with the fo1lov;ing properties:
1 If the query specifies a value v for attribute a ; then tuple t has v in its component for a
2 If the query groups by a n attribute a , then t has any non-* value in its conlponent for a
3 If the query neither groups by attribute a nor specifies a value for a then
t has * in its component for a
Each tuple t has tlie sum and count for one of the desired groups If n-e \%-ant the average price, a division is performed on the sum and count conlponents of each tuple t
E x a m p l e 20.18 : The query SELECT c o l o r , AVG(price) FROM S a l e s
WHERE model = 'Gobi' GROUP BY c o l o r ;
is ansn-ered by looking for all tuples of sales) ~ v i t h the form
( ' G o b i ' , C *, *, 21, n )
here c is any specific color In this tuple, v will be the sum of sales of Gobis
in that color, while n will be the nlini!)cr of sales of Gobis in that color Tlie average price although not a n attribute of S a l e s or sales) directly is
v / n Tlie answer t o the query is the set of ( c , vln) pairs obtained fi-om all
('Gobi' c , *, * v n ) tuples
11% suggested in Fig 20.17 t h a t adding aggregations to the cube doesn't cost much in tcrms of space and saves a lot in time \vhen the common kincis of decision-support queries are asked Ho~vever: our analysis is based on the as- sumption that queries choose either t o aggregate completely in a dimension
or not to aggregate a t all For some dime~isions there are many degrees of granularity that could be chosen for a grouping on that dimension
U c have already mentioned thc case of time xvl-here numerolls options such
as aggregation by weeks, months: quarters, or ycars exist,, in addition to the
all-or-nothing choices of grouping by day or aggregating over all time For another esanlple based on our running automobile database, Ive could choose
t o aggregate dealers completely or not aggregate them a t all Hon-ever, we could also choose t o aggregate by city, by state, or perhaps by other regions, larger
or smaller Thus: there are a t least s i s choices of grouping for time a n d a t least four for dealers
l\Tllen the number of choices for grouping along each dimension grows, it becomes increasingly expensive t o store the results of aggregating by every possible conlbination of groupings S o t only are there too many of them, but they are not a s easily organized as the structure of Fig 20.17 suggests for tlle all-or-nothing case Thus, commercial data-cube systems may help t h e user t o choose some n~aterialized views of the data cube A materialized view is the result of some query, which we choose t o store in the database, rather than reconstructing (parts of) it as needed in response t o queries For the d a t a cube, the vie~vs we n-ould choose t o materialize xi11 typically be aggregations of the full data cube
The coarser the partition implied by the grouping, the less space the mate- rialized view takes On the other hand, if ire ~ v a n t to use a view t o answer a certain query, then the view must not partition any dimension more coarsely than the query does Thus, to maximize the utility of materialized views, we generally n-ant some large \-iers that group dimensions into a fairly fine parti- tion In addition, the choice of vien-s t o materialize is heavily influenced by the kinds of queries that the analysts are likely to ask .in example will suggest tlie tradeoffs in\-011-ed
INSERT INTO SalesVl SELECT model, c o l o r , month, c i t y , SUM(va1) AS v a l , SUM(cnt) AS c n t FROM S a l e s JOIN D e a l e r s ON d e a l e r = name GROUP BY model, c o l o r , month, c i t y ;
Figure 20.18: The materialized vien SalesVl
Example 20.19 : Let us return t o the d a t a cube
S a l e s (model, c o l o r , d a t e , d e a l e r , v a l , c n t ) that n e de\-eloped in Esample 20.17 One possible materialized vie\\- groups dates by nionth and dealers by city This view 1%-hich 1%-e call SalesV1, is constlucted by the query in Fig 20.18 This query is not strict SQL since n-e imagine that dates and their grouping units such as months are understood
by the data-cube system n-ithout being told to join S a l e s with the imaginary relation rep~esenting d a j s that \ve discussed in Example 20.14
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Trang 3CHAPTER 20 IiYFORI\IATIOAr IArTEGR.4TION 20.5 DdT.4 CUBES 1055
INSERT INTO SalesV2 SELECT model, week, s t a t e , SUM(va1) AS v a l , SUM(cnt) AS c n t FROM S a l e s JOIN D e a l e r s ON d e a l e r = name GROUP BY model, week, s t a t e ;
Figure 20.19: Another materialized view, SalesV2
Another possible materialized view aggregates colors completely, aggregates time into u-eeks, and dealers by states This view, SalesV2, is defined by the query in Fig 20.19 Either view S a l e s V l or SalesV2 can be used to ansn-er a query that partitions no more finely than either in any dimension Thus, the query
41: SELECT model, SUM(va1) FROM S a l e s
GROUP BY model;
can be answered either by SELECT model, SUM(va1) FROM SalesVl
GROUP BY model;
SELECT model, SUM(va1) FROM SalesV2
GROUP BY model;
On the other hand, the query 42: SELECT model, y e a r , s t a t e , SUM(va1) FROM S a l e s JOIN D e a l e r s ON d e a l e r = name GROUP BY model, y e a r , s t a t e ;
can o n 1 be ans\vered from SalesV1 as SELECT model, y e a r , s t a t e , SUM(va1) FROM SalesVl
GROUP BY model, y e a r , s t a t e ; Incidentally the query inmediately above like the qu'rics that nggregate time units, is not strict SQL That is s t a t e is not ari attribute of SalesVl: only
c i t y is \Ye rmust assume that the data-cube systenl knol\-s how t o perform the
aggregation of cities into states, probably by accessing the dimension table for dealers
\Ye cannot answer Q2 from SalesV2 Although we could roll-up cities into states (i.e aggregate the cities into their states) t o use SalesV1, we carrrlot roll-up ~veeks into years, since years are not evenly divided into weeks and
d a t a from a week beginning say, Dec 29, 2001 contributes t o years 2001 and
2002 in a way we carinot tell from the d a t a aggregated by weeks
Finally, a query like 43: SELECT model, c o l o r , d a t e , ~ ~ ~ ( v a l ) FROM S a l e s
GROUP BY model, c o l o r , d a t e ; can b e anslvered from neither SalesVl nor SalesV2 It cannot be answered from S a l e s v l because its partition of days by ~ n o n t h s is too coarse t o recover sales by day, and it cannot be ans~vered from SalesV2 because t h a t view does not group by color We would have t o answer this query directly from the full
d a t a cube
20.5.3 The Lattice of Views
To formalize the cbservations of Example 20.10 it he!ps to think of a lattice of
p o s s i b l ~ groupings for each dimension of the cube The points of the lattice are the ways that we can partition the ~ a l u c s of a dimension by grouping according
t o one or more attributes of its dimension table nB say that partition PI is belo~v partition P2 written PI 5 P2 if and only if each group of Pl is contained within some group of PZ
All
Years
I
Days
Figure 20.20: A lattice of partitions for time inter\-als
Example 20.20: For the lattice of time partitions n-e might choose the dia- gram of Fig 20.20 -4 path from some node f i dotvn t o PI means that PI 5 4
These are not the only possible units of time, but they \\-ill serve as a n example
Trang 4of what units a s ~ s t e r n might support Sotice that daks lie below both \reeks and months, but weeks do not lie below months The reason is that while a group of events that took place in one day surely took place within one \reek and within one month it is not true that a group of events taking place in one week necessarily took place in any one month Similarly, a week's group need not be contained within the group cor~esponding to one quarter or t o one year
At tlie top is a partition we call "all," meaning that events are grouped into a single group; i.e we niake no distinctions among diffeient times
All
I
State
I City
I
Dealer
Figure 20.21: A lattice of partitions for automobile dealers Figure 20.21 shows another lattice, this time for the dealer dimension of our automobiles example This lattice is siniplcr: it shows that partitioning sales by dealer gives a finer partition than partitioning by the city of the dealer i<-hich is
in turn finer than partitioning by tlie state of tlie dealer The top of tlle ldrtice
is the partition that places all dealers in one group
Having a lattice for each dimension, 15-12 can now define a lattice for all the possible materialized views of a data cube that can be formed by grouping according t o some partition in each dimension If 15 and 1% are two views formed by choosing a partition (grouping) for each dimension, then 1; 5 1 1
means that in each dimension, the partition Pl that ~ v e use in 1; is a t least as fine as the partition Pl that n.e use for that dimension in T i ; that is Pl 5 P?
Man) OLAP queries can also be placed in the lattice of views In fact fie- quently an OLAP query has the same form as the views we have described: the query specifies some pa~titioning (possibly none or all) for each of the dimen- sions Other OL.iP queiics involve tliis same soit of grouping, and then "slice tlie cube t o focus 011 a subset of the data as n a s suggested by the diag~ani in Fig 20.15 The general rule is
I \ c can ansn-er a quciy Q using view 1 - if and o ~ i l y if 1 - 5 Q
Example 20.21 : Figure 20.22 takes the vielvs and queries of Example 20.19 and places them in a lattice Sotice that the S a l e s data cube itself is technically
a view corresponding to tlie finest possible partition along each climensio~l As
we observed in the original example, Q I can be ans~vered from either SalesVl or
S a l e s Figure 20.22: The lattice of views and queries from Example 20.19
SalesV2; of course it could also be answered froni the full d a t a cube S a l e s , but
there is no reason t o want t o do so if one of t h e other views is materialized Q 2
can be answered from either S a l e s V l or S a l e s , while Q 3 can only be answered from S a l e s Each of these relationships is expressed in Fig 20.22 by the paths downxard from the queries t o their supporting vie~vs
Placing queries in the lattice of views helps design data-cube databases Some recently developed design tools for data-cube systems start with a set of queries that they regard as typical" of the application a t hand They then select a set of views to materialize so that each of these queries is above a t least one of the riel\-s, preferably identical to it or very close (i.e., the query and the view use the same grouping in most of the dimensions)
20.5.4 Exercises for Section 20.5
Exercise 20.5.1 : IVhat is the ratio of the size of CCBE(F) to the size of F if fact table F has the follorving characteristics?
* a) F has ten dimension attributes, each with ten different values
b) F has ten dimension attributes each with two differcnt values
Exercise 20.5.2: Let us use the cube ~ n B E ( S a 1 e s ) from Example 20.17,
~vhich was built from the relation
S a l e s (model, c o l o r , d a t e , d e a l e r , v a l , c n t ) Tcll I\-hat tuples of the cube n-e 15-ould use to answer tlle follon-ing queries:
* a ) Find the total sales of I~lue cars for each dealer
b) Find the total nurnber of green Gobis sold by dealer 'Smilin' Sally." c) Find the average number of Gobis sold on each day of March, 2002 by each dealer
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Trang 51088 CHAPTER 20 ISFORJlATIOS IXTEGRA4TIOS
*! Exercise 20.5.3: In Exercise 20.4.1 lve spoke of PC-order data organized as
a cube If we are to apply the CCBE operator, we might find it convenient t o break several dimensions more finely For example, instead of one processor dimension, we might have one dimension for the type (e.g., A l I D Duron or Pentium-IV), and another d~mension for the speed Suggest a set of dimrnsions and dependent attributes that will allow us to obtain answers to a variety of useful aggregation queries In particular, what role does the customer play?
.Also, the price in Exercise 20.4.1 referred t o the price of one macll~ne, while several identical machines could be ordered in a single tuple What should the dependent attribute(s) be?
Exercise 20.5.4 : What tuples of the cube from Exercise 20.5.3 would you use
to answer the following queries?
a) Find, for each processor speed, t h e total number of computers ordered in each month of the year 2002
b) List for each type of hard disk (e.g., SCSI or IDE) and eacli processor type the number of computers ordered
c) Find the average price of computers with 1500 megahertz processors for each month from Jan., 2001
! Exercise 20.5.5 : The computers described in the cube of Exercise 20.5.3 do not include monitors IVhat dimensions would you suggest to represent moni- tors? You may assume that the price of the monitor is included in the price of the computer
Exercise 20.5.6 : Suppose t h a t a cube has 10 dimensions and eacli dimension has 5 options for granularity of aggregation including "no aggregation" and
"aggregate fully.'' How many different views can we construct by clioosing a granularity in each dinlension?
Exercise 20.5.7 : Show how t o add the following time units t o the lattice of Fig 20.20: hours, minutes, seconds, fortnights (two-week periods) decades
and centuries
Exercise 20.5.8: How 15-onld you change the dealer lattice of Fig 20.21 to
include -regions." ~ f :
a ) A region is a set of states
* b) Regions are not com~liensurate with states but each city is in only one region
c) Regions are like area codes: each region is contained \vithin a state some cities are in two or more regions and some regions h a ~ e several cities
! E x e r c i s e 20.5.9: In Exercise 20.5.3 n e designed a cube suitable for use ~ v i t h
the CCBE operator Horn-ever some of the dimensions could also b e given a non- trivial lattice structure In particular, the processor type could b e organized by manufacturer (e g., SUT, Intel .AND llotorola) series (e.g SUN Ult~aSparc Intel Pentium or Celeron A l I D rlthlon, or llotorola G-series), a n d model (e.g., Pentiuni-I\- or G4)
a) Design tlie lattice of processor types following the examples described above
b) Define a view that groups processors by series, hard disks by type, and removable disks by speed, aggregating everything else
c) Define a view that groups processors by manufacturer, hard disks by speed and aggregates everything else except memory size
d) Give esamples of qneries that can be ansn-ered from t h e view of (11) only, the vieiv of (c) only, both, and neither
*!! E x e r c i s e 20.5.10: If the fact table F to n-hicli n-e apply the C u B E operator is sparse (i.e there are inany fen-er tuples in F than the product of the number
of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F) and F can be very large H o n large can it be?
A family of database applications cal!ed data rnin,ing or knowledge discovery in
dntnbases has captured considerable interest because of opportunities to learn surprising facts fro111 esisting databases Data-mining queries can be thought
of as an estended form of decision-support querx, although the distinction is in- formal (see the box on -Data-llining Queries and Decision-Support Queries") Data nli11i11: stresses both the cpcry-optimization and data-management com- ponents of a traditional database system, as 1%-ell a s suggesting some important estensions to database languages, such as language primitix-es that support effi- cient sampling of data In this section, we shall esamine the principal directions data-mining applications have taken Me then focus on tlie problem called "fre- quc'iit iteinsets." n-hich has 1-eceiwd the most attention from the database point
of view
20.6.1 Data-Iblining Applications Broadly data-mining queries ask for a useful summary of data, often ~vithout suggcstir~g the values of para~netcrs that would best yield such a summary This family of problems thus requires rethinking the n a y database systems are
to be used to provide snch insights a b o ~ i t the data Below are some of tlie applications and problems that are being addressed using very large amounts
Trang 71092 CHAPTER 20 I;YFORhlATION INTEGR.4TION
(stop words) such a s .'and" or 'The." which tend t o be present in all docu-
ments and tell us nothing about the content A document is placed in this space according t o t h e fraction of its word occurrences that are any particular word For instance, if the document has 1000 word occurrences, two of which are "database." then the doculllent ~vould be placed a t the ,002 coordinate in the dimension cor~esponding t o "database." By clustering documents in this space, we tend t o get groups of documents that talk about the same thing
For instance, documents that talk about databases might have occurrences of words like "data," "query," "lock," and so on, while documents about baseball are unlikely to have occurrences of these rvords
T h e data-mining problem here is t o take the d a t a and select the "means"
or centers of the clusters Often the number of clusters is given in advance
although t h a t number niay be selectable by the data-mining process as ti-ell
Either way, a naive algorithm for choosing the centers so that the average distance from a point t o its nearest center is minimized involves many queries;
each of which does a complex aggregation
20.6.2 Finding Frequent Sets of Items Now we shall see a data-mining problem for which algorithms using secondary storage effectively have been developed The problem is most easily described
in terms of its principal application: the analysis of market-basket data Stores
today often hold in a d a t a warehouse a record of what customers have bought together T h a t is, a customer approaches the checkout with a 'market basket"
full of the items he or she has selected The cash register records all of these items as part of a single transaction Thus, even if lve don't know anything about the customer, and we can't tell if the customer returns and buys addi- tional items we do know certain items that a single customer b u - s together
If items appear together in market baskets more often than ~vould be es- pected, then the store has an opportunity to learn something about how cus- tomers are likely to traverse the store The items can be placed in the store so that customers will tend t o take certain paths through the store, and attractive items can be placed along these paths
E x a m p l e 20.22 : .A famous example which has been clainied by several peo- ple; is the discovery that people rvho buy diapcrs are unusually likely also to buy beer Theories have been advanced foi n.hy that relationship is true in- cluding tile possibility that peoplc n-110 buy diapers having a baby a t home ale less likely t o go out t o a bar in the evening and therefore tcnd to drink beer at home Stores may use the fact that inany customers 15-ill walk through the store from where the diapers are to where the beer is or vice versa Clever maiketers place beer and diapers near each other, rvitli potato chips in the middle The claim is that sales of all three items then increase
We can represent market-basket d a t a by a fact table:
Baskets (basket, item)
where the first attribute is a 'basket ID," o r unique identifier for a market basket, and the secoild attribute is the ID of some item found in t h a t basket
S o t e that it is not essential for the relation t o come from true ma~ket-basket data; it could be any relation from which we x a n t t o find associated items For
~nstance, the '.baskets" could be documents and the "items" could be words,
in which case n e are really looking for words t h a t appear in many documents together
The simplest form of market-basket analysis searches for sets of items that
frequently appear together in market baskets T h e support for a set of items is
the number of baskets in which all those items appear The problem of finding
frequent sets of ~ t e m s is to find, given a support threshold s , all those sets of
items that have support a t least s
If the number of items in the database is large, then even if we restrict our attention t o small sets, say pairs of items only, t h e time needed to count the support for all pairs of items is enormous Thus, the straightforward way t o solve even the frequent pairs problem - compute the support for each pair of items z and j, as suggested by the SQL query in Fig 20.24 - ~vill not work This query involves joining Baskets r ~ i t h itself, grouping the resulting tuples
by the tri-o l t e ~ n s found 111 that tuple, and throwing anay groups where the number of baskets is belon- the support threshold s S o t e t h a t the condition
I item < J item in the WHERE-clause is there t o prevent the same pair from being considered in both orders or for a 'pair" consisting of the same item twice from being considered at all
SELECT I.itern, J.item, COUNT(I.basket) FROM Baskets I, Baskets J
WHERE 1.basket = J.basket AND I.item < J.item
GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;
Figure 20.24: Saive way to find all high-support pairs of items
20.6.3 T h e A-Priori Algorithm There is an optimization that greatly reduccs the running time of a qutry like Fig 20.21 \\-hen the support threshold is sufficiently large that few pairs meet
it It is ieaso~iable to set the threshold high, because a list of thousands or millions of pairs would not be very useful anyxay; ri-e xi-ant t h e data-mining query to focus our attention on a s n ~ a l l number of the best candidates The
a-przorz algorithm is based on the folloiving observation:
Trang 81094 C H A P T E R 20 IATFORlI~4TION INTEGR.ATION
Association Rules
A more complex type of market-basket mining searches for associatzon
~ x l e s of the form {il, 22, , i n ) 3 j TKO possible properties that \ve might want in useful rules of this form are:
1 Confidence: the probability of finding item j in a basket that has
all of {il,i2 , i n ) is above a certain threshold e.g., 50%; e.g "at least 50% of the people who buy diapers buy beer."
2 Interest: t h e probability of finding item j in a basket that has all
of {il, i2, , i n } is significantly higher or lower than the probability
of finding j in a random basket In statistical terms, j correlates with {il, i z , , i,,), either positively or negatively The discovery in Example 20.22 was really t h a t the rule {diapers) + beer has high interest
S o t e that el-en if a n association rule has high confidence or interest it n-ill tend not to be useful unless t h e set of items inrrolved has high support
The reason is t h a t if the support is low, then the number of instances of the rule is not large, which limits the benefit of a strategy that exploits the rule
If a set of items S has support s then each subset of A' must also have support a t least s
In particular, if a pair of items say {i j ) appears in, say, 1000 baskets then
we know there are a t least 1000 baskets with item i and we know there are at least 1000 baskets xvith item j
The converse of the above rule is t h a t if we are looking for pairs of items
~ v i t h support a t least s we may first eliminate from consideration any item that does not by itself appear in a t least s baskets The a-priorz algorltl~m ans11-ers the same query as Fig 20.24 by:
1 First finding the srt of candidate nte~ns - those that appear in a sufficient number of baskets by thexnsel~es - and then
2 Running the query of Fig 20.24 o n only the candidate items
The a-priori algorithnl is thus summarized by the sequence of two SQL queries
in Fig 20.25 It first computes Candidates the subset of the Baskets relation
i ~ h o s e iter~ls ha\-c high support by theniselves then joins Candidates ~vith itself
as in the naive algorithm of Fig 20.24
INSERT INTO Candidates SELECT *
FROM Baskets WHERE item IN (
SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s
> ;
SELECT I.item, J.item, ~ ~ ~ N ~ ( ~ b a s k e t ) FROM Candidates I, Candidates J
WHERE 1.basket = J.basket AND I.item < J.item
GROUP BY I.item, J.item HAVING COUNT(*) >= s;
Figure 20.25: Tlie a-priori algorithm first finds frequent items before finding frequent pairs
E x a m p l e 20.23 : To get a feel for how the a-priori algorithm helps, consider a supermarket that sells 10,000 different items Suppose t h a t the average market- basket has 20 items in it Also assume that the database keeps 1,000,000 baskets
as data ( a small number compared with what would be stored in practice) Then the Baskets relation has 20,000,000 tuples, a n d t h e join in Fig 20.24 (the naive algorithm) has 190,000,000 pairs This figure represents one million baskets times ( y ) which is 190: pairs of items These 190,000,000 tuples must all be grouped and counted
However, suppose that s is 10,000, i.e., 1% of the baskets It is impossi- ble that Inore than 20.000,000/10,000 = 2000 items appear in a t least 10,000 baskets because there are only 20,000.000 tuples in Baskets, a n d any item ap- pearing in 10.000 baskets appears in a t least 10,000 of those tuples Thus: if we use the a-priori algoritllrn of Fig 20.25, the subquery t h a t finds the candidate
i t e ~ n s cannot produce more than 2000 items and I\-ill probably produce many fewer than 2000
\\'e cannot he sure h o ~ v large Candidates is since in t h e norst case all t h e
items that appear in Baskets will appear in a t least 1% of them Honever in practice Candidates will be considerably smaller than Baskets if t h e threshold
s is high For sake of argument, suppose Candidates has on the average 10 itelns per basket: i.e., it is half the size of Baskets Then the join of Candidates
with itself in step (2) has 1,000,000 times ( y ) = 45,000,000 tuples, less than 11-1 of the number of tuples in the join of Baskets ~ - i t h itself \Ye ~vould thtis expect the a-priori algorithm t o run in about 111 t h e time of the naive
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Trang 91096 C H A P T E R 20 IlYFORM-rlTI0.V INTEGRATION
algorithm In common situations, where C a n d i d a t e s has much less than half tlie tuples of B a s k e t s , the improvement is even greater, since running time shrinks quadratically with the reduction in the number of tuples involved in the join
20.6.4 Exercises for Section 20.6
Exercise 20.6.1: Suppose we are given the eight "market baskets" of Fig
20.26
B1 = {milk, coke, beer)
BP = {milk, pepsi, juice)
B3 = {milk, beer)
B4 = {coke, juice)
Bg = {milk, pepsi, beer)
B6 = {milk, beer, juice, pepsi)
B7 = {coke, beer, juice)
B8 = {beer, pepsi)
Figure 20.26: Example market-basket data
* a) As a percentage of t h e baskets, what is the support of the set {beer, juice)?
b) What is t h e support of the set {coke, pepsi)?
* c) What is t h e confidence of milk given beer (i.e., of the association rule {beer) + milk)?
d) What is the confidence of juice given milk?
e) What is t h e confidence of coke, given beer and juice?
* f ) If the support threshold is 35% (i.e., 3 out of the eight baskets are needed), which pairs of items are frequent?
g) If the support threshold is 50%, which pairs of items are frequent?
! Exercise 20.6.2 : T h e a-priori algorithm also may be used to find frequent sets
of more than ttvo items Recall that a set S of k items cannot have support at least s t~nless every proper subset of S has support a t least s In particular
the subsets of X t h a t are of size k - 1 must all have support a t least s Thus
having found the frequent itemsets (those with support a t least s) of size k - 1
we can define the candidate sets of size k to be those sets of k items, all of nhose subsets of size k - 1 have support a t least s Write SQL queries that, given the frequent itemsets of size k - 1 first compute the candidate sets of size k , and then compute t h e frequent sets of size k
E x e r c i s e 20.6.3: Using the baskets of Exercise 20.6.1, answer the following: a) If the support threshold is 35%, what is the set of candidate triples?
b) If the support threshold is 35%, what sets of triples are frequent?
+ Integration of Information: Frequently, there exist a variety of databases
or other information sources that contain related information nTe have the opportunity t o combine these sources into one Ho~vever, hetero- geneities in the schemas often exist; these incompatibilities include dif- fering types, codes or conventions for values, interpretations of concepts, and different sets of concepts represented in different schernas
+ Approaches to Information Integration: Early approaches involved "fed- eration," where each database would query t h e others in the terms under- stood by the second Nore recent approaches involve ~varehousing, where
d a t a is translated to a global schema and copied to the warehouse An alternative is mediation, where a virtual warehouse is created to allolv queries to a global schema; the queries are then translated to the terms
of the data sources
+ Extractors and Wrappers: Warehousing and mediation require compo- nents a t each source, called extractors and wrappers, respectively X ma- jor function is t o translate querics and results betneen the global schema and the local schema a t the source
+ Wrapper Generators: One approach to designing wrappers is t o use tem- plates, which describe how a query of a specific form is translated from the global schema t o the local schema These templates are tabulated and in- terpreted by a driver that tries t o match queries t o templates The driver may also have the ability t o combine templates in various ways, and/or perform additional ~vork such a s filtering t o answer more con~plex queries
+ Capability-Based Optimtzation: The sources for a mediator often are able
or ~villing to answer only limited forms of queries Thus the mediator must select a query plan based on the capabilities of its sources, before it can el-en think about optiniizing the cost of query plans a s con\-entional DBAIS's do
+ OLAP: An important application of data I<-arehouses is the ability t o ask complex queries that touch all or much of the data a t the same t i ~ n e that transaction processing is conducted at the d a t a sources These queries, which usually involve aggregation of data are termed on-line analytic processing, or OLAP; queries
Trang 101098 C H A P T E R 20 IXFORJIIATION IhTTEGR.4TI0.\' 20.8 REFERENCES FOR CH-APTER 20 1099 + ROLAP and AIOLAP: It is frequently useful when building a warehouse
for OLAP, to think of the d a t a a s residing in a multidimensional space
with diniensions corresponding t o independent aspects of the data repre- sented Systems t h a t support such a vie~v of data take either a relational point of view (ROLAP, or relational OLAP systems), or use the special- ized data-cube model (lIOL.AP, or multidimensional OLAP systems)
+ Star Schernas: In a star schema, each data element (e.g., a sale of an item)
is represented in one relation, called tlie fact table, while inforniation helping t o interpret the values along each dimension (e.g what kind of product is iten1 1234?) is stored in a diinension table for each diinension
+ The Cube Operator: A specialized operator called CCBE pre-aggregates the fact table along all subsets of dimensions It may add little to the space needed by the fact table, and greatly increases the speed with which many OLAP queries can be answered
+ Dzmenszon Lattices and Alaterialized Vzews: A more polverful approach than the CLBE operator, used by some data-cube implementations is to establish a lattice of granularities for aggregation along each dimension (e.g., different time units like days, months, and years) The ~vareliouse
is then designed by materializing certain v i e w that aggregate in different
\va!.s along the different dimensions, and the rien- with the closest fit is used t o answer a given query
+ Data Mining: IVareliouses are also used to ask broad questions that in- volve not only aggregating on command as in OL.1P queries, but search- ing for the "right" aggregation Common types of data mining include clustering d a t a into similar groups designing decision trees t o predict one attribute based on the value of others and finding sets of items that occur together frequently
+ The A-Priori Algorithm: -An efficiellt \\-a?; t o find frequent itemsets is t o use the a-priori algorithm This technique exploits the fact that if a set occurs frequently then so do all of its subsets
Recent smveys of \varehonsing arid related technologics are in [9] [3] and [ T I
Federated systems are surveyed 111 11'21 The concept of tlic mediato1 conies from [14]
Implementation of mediators and \\-rappers, especially tlie mapper-genera- tor approach is covered in [5] Capabilities-based optilnization for iriediators n-as explored in [ll 131
The cube operator was proposed in 161 The i~iipleinentation of cubes by materialized vie\\-s appeared in 181
[4] is a survey of data-mining techniques, and [13] is a n on-line survey of data mining The a-priori algorithm was del-eloped in [I] and 121
1 R Agranal, T Imielinski, and A Sn-ami: '.lIining association rules be- tween sets of items in large databases," Proc -ACAi SIGAlOD Intl Conf
on ibfanagement of Data (1993), pp 203-216
2 R Agrawal, and R Srikant, "Fast algorithms for mining association rules,"
Proc Intl Conf on V e q Large Databa.ses (1994), pp 487-199
3 S Chaudhuri and U Dayal, 'Ail overview of d a t a warehousing and OLAP technology," SIGAJOD Record 26: 1 (1997), pp 63-74
4 U 52 Fayyad, G Piatetsky-Shapiro P Smyth, and R Uthurusamy, Ad- Lances i n Knowledge Discovery and Data hlznzng AAAI Press, hlenlo Park CA, 1996
3 H Garcia-llolina, Y Papakonstalltinou D Quass -1 Rajalaman, Y Sa- giv V \Bssalos J D Ullman, and J n7idorn) T h e T S I l I l I I S approach
t o mediation: data nlodels and languages J Intellzgent Informatzon Sys- tems 8:2 (1997), pp 117-132
6 J S Gray, A Bosworth, A Layman and H Pirahesh, 'Data cube: a relational aggregation operator generalizing group-by cross-tab, and sub- totals." Proc Intl Conf on Data Englneerzng (1996) pp 132-139
7 -1 Gupta and I S SIumick A.laterioltzed Vieccs: Technzques, Implemcn- tatzons, and Applzcatzons l I I T Pres4 Cambridge 11-1 1999
8 V Harinarayan, -1 Rajaraman, and J D Ullman ~~Implementiiig data cubes efficiently." Proc ACAf SIGilfOD Intl Conf on Management of Data (1996) pp 205-216
9 D Loniet and J U-idom (eds.) Special i ~ s u e on materialized l-ie~vs and data warehouses IEEE Data Erlg?ilcerlng B u i l e t ~ n 18:2 (1395)
10 I* Papakonstantinou H Garcia-llolina arid J n'idom "Object ex- change across heterogeneous information sources." Proc Intl Conf on Data Englneerlng (1993) pp 251-260
11 I- Papakonstantinou .I Gupta and L Haas "Capnl>ilities-base query ren-riting in mediator s!-stems." Conference 011 Par(111el and Distributed Informntion Systc~ns (1996) ,\l-;lil~il~le as:
12 .A P Sheth and J -1 Larson "Federated databases for managing dis- tributed heterogeneous and autonomous databases." Cornputzng Surreys
22:3 (1990), pp 183-236
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.