Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.. A major challenge in mining frequent itemse
Trang 1206 Chapter 4 Data Cube Computation and Data Generalization
Step 2 collects statistics on the working relation This requires scanning the relation
at most once The cost for computing the minimum desired level and determining
the mapping pairs, (v, v0), for each attribute is dependent on the number of distinct
values for each attribute and is smaller than N, the number of tuples in the initial
relation
Step 3 derives the prime relation, P This is performed by inserting generalized tuples into P There are a total of N tuples in W and p tuples in P For each tuple, t, in
W, we substitute its attribute values based on the derived mapping-pairs This results
in a generalized tuple, t0 If variation (a) is adopted, each t0 takes O(log p) to find
the location for count increment or tuple insertion Thus the total time complexity
is O(N × log p) for all of the generalized tuples If variation (b) is adopted, each t0
takes O(1) to find the tuple for count increment Thus the overall time complexity is
O(N) for all of the generalized tuples.
Many data analysis tasks need to examine a good number of dimensions or attributes
This may involve dynamically introducing and testing additional attributes rather than
just those specified in the mining query Moreover, a user with little knowledge of the
truly relevant set of data may simply specify “in relevance to ∗” in the mining query,
which includes all of the attributes into the analysis Therefore, an advanced conceptdescription mining process needs to perform attribute relevance analysis on large sets
of attributes to select the most relevant ones Such analysis may employ correlation orentropy measures, as described in Chapter 2 on data preprocessing
“Attribute-oriented induction generates one or a set of generalized descriptions How can these descriptions be visualized?” The descriptions can be presented to the user in a num-
ber of different ways Generalized descriptions resulting from attribute-oriented
induc-tion are most commonly displayed in the form of a generalized relainduc-tion (or table) Example 4.22 Generalized relation (table) Suppose that attribute-oriented induction was performed
on a sales relation of the AllElectronics database, resulting in the generalized description
of Table 4.14 for sales in 2004 The description is shown in the form of a generalizedrelation Table 4.13 of Example 4.21 is another example of a generalized relation
Descriptions can also be visualized in the form of cross-tabulations, or crosstabs In
a two-dimensional crosstab, each row represents a value from an attribute, and each
col-umn represents a value from another attribute In an n-dimensional crosstab (for n > 2),
the columns may represent the values of more than one attribute, with subtotals shown
for attribute-value groupings This representation is similar to spreadsheets It is easy to
map directly from a data cube structure to a crosstab
Example 4.23 Cross-tabulation The generalized relation shown in Table 4.14 can be transformed into
the 3-D cross-tabulation shown in Table 4.15
Trang 24.3 Attribute-Oriented Induction—An Alternative Method 207
Table 4.14 A generalized relation for the sales in 2004
location item sales (in million dollars) count(in thousands)
Table 4.15 A crosstab for the sales in 2004
item
Example 4.24 Bar chart and pie chart The sales data of the crosstab shown in Table 4.15 can be
trans-formed into the bar chart representation of Figure 4.20 and the pie chart representation
of Figure 4.21
Finally, a 3-D generalized relation or crosstab can be represented by a 3-D data cube,which is useful for browsing the data at different levels of generalization
Example 4.25 Cube view Consider the data cube shown in Figure 4.22 for the dimensions item, location,
and cost This is the same kind of data cube that we have seen so far, although it is presented
in a slightly different way Here, the size of a cell (displayed as a tiny cube) represents the
countof the corresponding cell, while the brightness of the cell can be used to represent another measure of the cell, such as sum (sales) Pivoting, drilling, and slicing-and-dicing
operations can be performed on the data cube browser by mouse clicking
A generalized relation may also be represented in the form of logic rules Typically,each generalized tuple represents a rule disjunct Because data in a large database usually
span a diverse range of distributions, a single generalized tuple is unlikely to cover, or
Trang 3208 Chapter 4 Data Cube Computation and Data Generalization
250 200 150 100 50 0
TV
Computers TV + Computers
Asia Europe North America
Figure 4.20 Bar chart representation of the sales in 2004
North America (50.91%)
Asia (27.27%)
Europe (21.82%)
North America (42.56%)
Asia (25.53%)
Europe (31.91%)
TV Sales
Computer Sales
North America (43.43%)
Asia (25.71%)
Europe (30.86%)
TV 1 Computer Sales
Figure 4.21 Pie chart representation of the sales in 2004
represent, 100% of the initial working relation tuples, or cases Thus, quantitative
infor-mation, such as the percentage of data tuples that satisfy the left- and right-hand side ofthe rule, should be associated with each rule A logic rule that is associated with quanti-
tative information is called a quantitative rule.
To define a quantitative characteristic rule, we introduce the t-weight as an
interest-ingness measure that describes the typicality of each disjunct in the rule, or of each tuple
Trang 44.3 Attribute-Oriented Induction—An Alternative Method 209
lo cat io
item
cost
23.00–799.00 799.00–3,916.00 3,916.00–25,677.00 Not sp
ecifie d
North America Europe Australia Asia Alarm sy stem
CD pla yer Compa
ct disc Compu ter Cordless p
hone
Mouse Printer Softwar e Speak ers TV
Figure 4.22 A 3-D cube view representation of the sales in 2004
in the corresponding generalized relation The measure is defined as follows Let the class
of objects that is to be characterized (or described by the rule) be called the target class.
Let q a be a generalized tuple describing the target class The t-weight for q ais the centage of tuples of the target class from the initial working relation that are covered by
per-q n Formally, we have
t weight = count(qa)/Σn i=1count(qa), (4.1)
where n is the number of tuples for the target class in the generalized relation; q1, , q n
are tuples for the target class in the generalized relation; and q a is in q1, , q n Obviously,the range for the t-weight is [0.0, 1.0] or [0%, 100%]
A quantitative characteristic rule can then be represented either (1) in logic form by
associating the corresponding t-weight value with each disjunct covering the target class,
or (2) in the relational table or crosstab form by changing the count values in these tablesfor tuples of the target class to the corresponding t-weight values
Each disjunct of a quantitative characteristic rule represents a condition In general,
the disjunction of these conditions forms a necessary condition of the target class, since
the condition is derived based on all of the cases of the target class; that is, all tuples
of the target class must satisfy this condition However, the rule may not be a sufficient
condition of the target class, since a tuple satisfying the same condition could belong toanother class Therefore, the rule should be expressed in the form
∀X, target class(X) ⇒ condition (X)[t : w ]∨ · · · ∨ condition (X)[t : w ] (4.2)
Trang 5210 Chapter 4 Data Cube Computation and Data Generalization
The rule indicates that if X is in the target class, there is a probability of w i that X
satisfies condition i , where w i is the t-weight value for condition or disjunct i, and i is
in {1, , m}.
Example 4.26 Quantitative characteristic rule The crosstab shown in Table 4.15 can be transformed
into logic rule form Let the target class be the set of computer items The correspondingcharacteristic rule, in logic form, is
∀X, item(X) = “computer” ⇒
(location(X) = “Asia”) [t : 25.00%] ∨ (location(X) = “Europe”) [t : 30.00%] ∨(location(X) = “North America”) [t : 45, 00%]
Notice that the first t-weight value of 25.00% is obtained by 1000, the value
corres-ponding to the count slot for “(Asia,computer)”, divided by 4000, the value ing to the count slot for “(all regions, computer)” (That is, 4000 represents the total
correspond-number of computer items sold.) The t-weights of the other two disjuncts were larly derived Quantitative characteristic rules for other target classes can be computed
simi-in a similar fashion
“How can the t-weight and interestingness measures in general be used by the data mining system to display only the concept descriptions that it objectively evaluates as interesting?” A threshold can be set for this purpose For example, if the t-weight
of a generalized tuple is lower than the threshold, then the tuple is considered torepresent only a negligible portion of the database and can therefore be ignored
as uninteresting Ignoring such negligible tuples does not mean that they should beremoved from the intermediate results (i.e., the prime generalized relation, or the datacube, depending on the implementation) because they may contribute to subsequentfurther exploration of the data by the user via interactive rolling up or drilling down
of other dimensions and levels of abstraction Such a threshold may be referred to
as a significance threshold or support threshold, where the latter term is commonly
used in association rule mining
Different Classes
In many applications, users may not be interested in having a single class (or concept)described or characterized, but rather would prefer to mine a description that compares
or distinguishes one class (or concept) from other comparable classes (or concepts) Class
discrimination or comparison (hereafter referred to as class comparison) mines
descrip-tions that distinguish a target class from its contrasting classes Notice that the target and
contrasting classes must be comparable in the sense that they share similar dimensions and attributes For example, the three classes, person, address, and item, are not compara-
ble However, the sales in the last three years are comparable classes, and so are computerscience students versus physics students
Trang 64.3 Attribute-Oriented Induction—An Alternative Method 211
Our discussions on class characterization in the previous sections handle multileveldata summarization and characterization in a single class The techniques developedcan be extended to handle class comparison across several comparable classes Forexample, the attribute generalization process described for class characterization can
be modified so that the generalization is performed synchronously among all the
classes compared This allows the attributes in all of the classes to be generalized
to the same levels of abstraction Suppose, for instance, that we are given the tronics data for sales in 2003 and sales in 2004 and would like to compare these two classes Consider the dimension location with abstractions at the city, province or state, and country levels Each class of data should be generalized to the same location level That is, they are synchronously all generalized to either the city level, or the province or state level, or the country level Ideally, this is more useful than comparing,
AllElec-say, the sales in Vancouver in 2003 with the sales in the United States in 2004 (i.e.,where each set of sales data is generalized to a different level) The users, however,should have the option to overwrite such an automated, synchronous comparisonwith their own choices, when preferred
“How is class comparison performed?” In general, the procedure is as follows:
1 Data collection: The set of relevant data in the database is collected by query
process-ing and is partitioned respectively into a target class and one or a set of contrastprocess-ing class(es).
2 Dimension relevance analysis: If there are many dimensions, then dimension
rele-vance analysis should be performed on these classes to select only the highly relevantdimensions for further analysis Correlation or entropy-based measures can be usedfor this step (Chapter 2)
3 Synchronous generalization: Generalization is performed on the target class to the
level controlled by a user- or expert-specified dimension threshold, which results in
a prime target class relation The concepts in the contrasting class(es) are ized to the same level as those in the prime target class relation, forming the prime contrasting class(es) relation.
general-4 Presentation of the derived comparison: The resulting class comparison description
can be visualized in the form of tables, graphs, and rules This presentation usuallyincludes a “contrasting” measure such as count% (percentage count) that reflects thecomparison between the target and contrasting classes The user can adjust the com-parison description by applying drill-down, roll-up, and other OLAP operations tothe target and contrasting classes, as desired
The above discussion outlines a general algorithm for mining comparisons in bases In comparison with characterization, the above algorithm involves synchronousgeneralization of the target class with the contrasting classes, so that classes are simulta-neously compared at the same levels of abstraction
data-The following example mines a class comparison describing the graduate students
and the undergraduate students at Big University.
Trang 7212 Chapter 4 Data Cube Computation and Data Generalization
Example 4.27 Mining a class comparison Suppose that you would like to compare the general
properties between the graduate students and the undergraduate students at Big sity, given the attributes name, gender, major, birth place, birth date, residence, phone#, and gpa.
Univer-This data mining task can be expressed in DMQL as follows:
useBig University DBmine comparison as“grad vs undergrad students”
in relevance toname, gender, major, birth place, birth date, residence,phone#, gpa
for“graduate students”
wherestatus in “graduate”
versus“undergraduate students”
wherestatus in “undergraduate”
analyzecount%
fromstudentLet’s see how this typical example of a data mining query for mining comparisondescriptions can be processed
First, the query is transformed into two relational queries that collect two sets of
task-relevant data: one for the initial target class working relation, and the other for the initial contrasting class working relation, as shown in Tables 4.16 and 4.17 This can also be viewed as the construction of a data cube, where the status {graduate, undergraduate} serves as one dimension, and the other attributes form the remaining
dimensions
Table 4.16 Initial working relations: the target class (graduate students)
Jim Woodman M CS Vancouver, BC, Canada 8-12-76 3511 Main St., Richmond 687-4598 3.67Scott Lachance M CS Montreal, Que, Canada 28-7-75 345 1st Ave., Vancouver 253-9106 3.70Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., Burnaby 420-5232 3.83
Table 4.17 Initial working relations: the contrasting class (undergraduate students)
Bob Schumann M Chemistry Calgary, Alt, Canada 10-1-78 2642 Halifax St., Burnaby 294-4291 2.96Amy Eau F Biology Golden, BC, Canada 30-3-76 463 Sunset Cres., Vancouver 681-5417 3.52
Trang 84.3 Attribute-Oriented Induction—An Alternative Method 213
Second, dimension relevance analysis can be performed, when necessary, on the two
classes of data After this analysis, irrelevant or weakly relevant dimensions, such as name, gender, birth place, residence, and phone#, are removed from the resulting classes Only
the highly relevant attributes are included in the subsequent analysis
Third, synchronous generalization is performed: Generalization is performed on thetarget class to the levels controlled by user- or expert-specified dimension thresholds,
forming the prime target class relation The contrasting class is generalized to the same levels as those in the prime target class relation, forming the prime contrasting class(es) relation, as presented in Tables 4.18 and 4.19 In comparison with undergraduate
students, graduate students tend to be older and have a higher GPA, in general
Finally, the resulting class comparison is presented in the form of tables, graphs,and/or rules This visualization includes a contrasting measure (such as count%) thatcompares between the target class and the contrasting class For example, 5.02% of thegraduate students majoring in Science are between 26 and 30 years of age and have
a “good” GPA, while only 2.32% of undergraduates have these same characteristics.Drilling and other OLAP operations may be performed on the target and contrastingclasses as deemed necessary by the user in order to adjust the abstraction levels ofthe final description
“How can class comparison descriptions be presented?” As with class
characteriza-tions, class comparisons can be presented to the user in various forms, including
Table 4.18 Prime generalized relation for the target class (graduate
Table 4.19 Prime generalized relation for the contrasting
class (undergraduate students)
Trang 9214 Chapter 4 Data Cube Computation and Data Generalization
generalized relations, crosstabs, bar charts, pie charts, curves, cubes, and rules Withthe exception of logic rules, these forms are used in the same way for characterization
as for comparison In this section, we discuss the visualization of class comparisons
in the form of discriminant rules
As is similar with characterization descriptions, the discriminative features of the get and contrasting classes of a comparison description can be described quantitatively
tar-by a quantitative discriminant rule, which associates a statistical interestingness measure, d-weight, with each generalized tuple in the description.
Let q a be a generalized tuple, and C j be the target class, where q acovers some tuples of
the target class Note that it is possible that q aalso covers some tuples of the contrasting
classes, particularly since we are dealing with a comparison description The d-weight
for q ais the ratio of the number of tuples from the initial target class working relation
that are covered by q ato the total number of tuples in both the initial target class and
contrasting class working relations that are covered by q a Formally, the d-weight of q a
for the class C jis defined as
d weight = count(q a ∈ C j)/Σm i=1count(q a ∈ C i), (4.3)
where m is the total number of the target and contrasting classes, C j is in {C1, , C m},
and count (q a ∈ C i)is the number of tuples of class C i that are covered by q a The rangefor the d-weight is [0.0, 1.0] (or [0%, 100%])
A high d-weight in the target class indicates that the concept represented by the eralized tuple is primarily derived from the target class, whereas a low d-weight impliesthat the concept is primarily derived from the contrasting classes A threshold can be set
gen-to control the display of interesting tuples based on the d-weight or other measures used,
as described in Section 4.3.3
Example 4.28 Computing the d-weight measure In Example 4.27, suppose that the count distribution
for the generalized tuple, major = “Science” AND age range = “21 25” AND gpa = “good”, from Tables 4.18 and 4.19 is as shown in Table 20.
The d-weight for the given generalized tuple is 90/(90 + 210) = 30% with respect tothe target class, and 210/(90 + 210) = 70% with respect to the contrasting class That is,
if a student majoring in Science is 21 to 25 years old and has a “good” gpa, then based on the data, there is a 30% probability that she is a graduate student, versus a 70% probability that
Table 4.20 Count distribution between graduate and undergraduate
students for a generalized tuple
Trang 104.3 Attribute-Oriented Induction—An Alternative Method 215
she is an undergraduate student Similarly, the d-weights for the other generalized tuples
in Tables 4.18 and 4.19 can be derived
A quantitative discriminant rule for the target class of a given comparison description
is written in the form
∀X, target class(X)⇐condition(X) [d:d weight], (4.4)where the condition is formed by a generalized tuple of the description This is differentfrom rules obtained in class characterization, where the arrow of implication is from left
to right
Example 4.29 Quantitative discriminant rule Based on the generalized tuple and count distribution in
Example 4.28, a quantitative discriminant rule for the target class graduate student can
be written as follows:
∀X, Status(X) = “graduate student”⇐
major(X) = “Science” ∧ age range(X) = “21 25” (4.5)
∧ gpa(X) = “good”[d : 30%].
Notice that a discriminant rule provides a sufficient condition, but not a necessary one,
for an object (or tuple) to be in the target class For example, Rule (4.6) implies that if X satisfies the condition, then the probability that X is a graduate student is 30% However,
it does not imply the probability that X meets the condition, given that X is a graduate
student This is because although the tuples that meet the condition are in the targetclass, other tuples that do not necessarily satisfy this condition may also be in the target
class, because the rule may not cover all of the examples of the target class in the database.
Therefore, the condition is sufficient, but not necessary
Example 4.30 Crosstab for class characterization and class comparison Let Table 4.21 be a crosstab
showing the total number (in thousands) of TVs and computers sold at AllElectronics
in 2004
Trang 11216 Chapter 4 Data Cube Computation and Data Generalization
Table 4.21 A crosstab for the total number (count) of TVs and
computers sold in thousands in 2004
item location TV computer both items
Table 4.22 The same crosstab as in Table 4.21, but here the t-weight and d-weight values associated
with each class are shown
item
location count t-weight d-weight count t-weight d-weight count t-weight d-weight
Let Europe be the target class and North America be the contrasting class The t-weights
and d-weights of the sales distribution between the two classes are presented in Table 4.22
According to the table, the t-weight of a generalized tuple or object (e.g., item = “TV”) for a given class (e.g., the target class Europe) shows how typical the tuple is of the given
class (e.g., what proportion of these sales in Europe are for TVs?) The d-weight of a tupleshows how distinctive the tuple is in the given (target or contrasting) class in comparisonwith its rival class (e.g., how do the TV sales in Europe compare with those in NorthAmerica?)
For example, the t-weight for “(Europe, TV)” is 25% because the number of TVs sold
in Europe (80,000) represents only 25% of the European sales for both items (320,000)
The d-weight for “(Europe, TV)” is 40% because the number of TVs sold in Europe
(80,000) represents 40% of the number of TVs sold in both the target and the contrastingclasses of Europe and North America, respectively (which is 200,000)
Notice that the count measure in the crosstab of Table 4.22 obeys the general erty of a crosstab (i.e., the count values per row and per column, when totaled, match
prop-the corresponding totals in prop-the both items and both regions slots, respectively)
How-ever, this property is not observed by the t-weight and d-weight measures, becausethe semantic meaning of each of these measures is different from that of count, as
we explained in Example 4.30
Trang 124.3 Attribute-Oriented Induction—An Alternative Method 217
“Can a quantitative characteristic rule and a quantitative discriminant rule be expressed together in the form of one rule?” The answer is yes—a quantitative characteristic rule
and a quantitative discriminant rule for the same class can be combined to form a
quantitative description rule for the class, which displays the t-weights and d-weights
associated with the corresponding characteristic and discriminant rules To see howthis is done, let’s quickly review how quantitative characteristic and discriminant rulesare expressed
As discussed in Section 4.3.3, a quantitative characteristic rule provides a necessarycondition for the given target class since it presents a probability measurement for eachproperty that can occur in the target class Such a rule is of the form
∀X, target class(X)⇒condition1(X)[t : w1]∨ · · · ∨ condition m (X)[t : w m], (4.6)where each condition represents a property of the target class The rule indicates that
if X is in the target class, the probability that X satisfies condition iis the value of the
t-weight, w i , where i is in {1, , m}.
As previously discussed in Section 4.3.4, a quantitative discriminant rule provides asufficient condition for the target class since it presents a quantitative measurement ofthe properties that occur in the target class versus those that occur in the contrastingclasses Such a rule is of the form
∀X, target class(X)⇐condition1(X)[d : w1]∧ · · · ∧ condition m (X)[d : w m] (4.7)
The rule indicates that if X satisfies condition i , there is a probability of w i (the
d-weight value) that X is in the target class, where i is in {1, , m}.
A quantitative characteristic rule and a quantitative discriminant rule for a given class
can be combined as follows to form a quantitative description rule: (1) For each
con-dition, show both the associated t-weight and d-weight, and (2) a bidirectional arrowshould be used between the given class and the conditions That is, a quantitative descrip-tion rule is of the form
∀X, target class(X) ⇔ condition1(X)[t : w1, d : w01] (4.8)
θ· · ·θcondition m (X)[t : w m , d : w0
m],whereθrepresents a logical disjunction/conjuction (That is, if we consider the rule as acharacteristic rule, the conditions are ORed to from a disjunct Otherwise, if we considerthe rule as a discriminant rule, the conditions are ANDed to form a conjunct) The rule
indicates that for i from 1 to m, if X is in the target class, there is a probability of w ithat
X satisfies condition i ; and if X satisfies condition i , there is a probability of w0i that X is in
the target class.
Example 4.31 Quantitative description rule It is straightforward to transform the crosstab of Table 4.22
in Example 4.30 into a class description in the form of quantitative description rules For
example, the quantitative description rule for the target class, Europe, is
Trang 13218 Chapter 4 Data Cube Computation and Data Generalization
∀X, location(X) = “Europe” ⇔
(item(X) = “TV”) [t : 25%, d : 40%]θ(item(X) = “computer”) (4.9)
[t : 75%, d : 30%].
For the sales of TVs and computers at AllElectronics in 2004, the rule states that if
the sale of one of these items occurred in Europe, then the probability of the itembeing a TV is 25%, while that of being a computer is 75% On the other hand, if
we compare the sales of these items in Europe and North America, then 40% of theTVs were sold in Europe (and therefore we can deduce that 60% of the TVs weresold in North America) Furthermore, regarding computer sales, 30% of these salestook place in Europe
4.4 Summary
Data generalization is a process that abstracts a large set of task-relevant data in
a database from a relatively low conceptual level to higher conceptual levels Datageneralization approaches include data cube–based data aggregation and attribute-oriented induction
From a data analysis point of view, data generalization is a form of descriptive data
mining Descriptive data mining describes data in a concise and summarative manner
and presents interesting general properties of the data This is different from tive data mining, which analyzes data in order to construct one or a set of models, and
predic-attempts to predict the behavior of new data sets This chapter focused on methodsfor descriptive data mining
A data cube consists of a lattice of cuboids Each cuboid corresponds to a different
degree of summarization of the given multidimensional data
Full materialization refers to the computation of all of the cuboids in a data cube lattice Partial materialization refers to the selective computation of a subset of the
cuboid cells in the lattice Iceberg cubes and shell fragments are examples of partial
materialization An iceberg cube is a data cube that stores only those cube cells whose aggregate value (e.g., count) is above some minimum support threshold For shell fragments of a data cube, only some cuboids involving a small number of dimen-
sions are computed Queries on additional combinations of the dimensions can becomputed on the fly
There are several efficient data cube computation methods In this chapter, we cussed in depth four cube computation methods: (1) MultiWay array aggregation
dis-for materializing full data cubes in sparse-array-based, bottom-up, shared
compu-tation; (2) BUC for computing iceberg cubes by exploring ordering and sorting for efficient top-down computation; (3) Star-Cubing for integration of top-down
and bottom-up computation using a star-tree structure; and (4) high-dimensional
Trang 14Exercises 219
OLAP by precomputing only the partitioned shell fragments (thus called minimal cubing).
There are several methods for effective and efficient exploration of data cubes,
includ-ing discovery-driven cube exploration, multifeature data cubes, and constrained cube
gradient analysis Discovery-driven exploration of data cubes uses precomputed
mea-sures and visual cues to indicate data exceptions at all levels of aggregation, guiding the
user in the data analysis process Multifeature cubes compute complex queries ing multiple dependent aggregates at multiple granularity Constrained cube gradient
involv-analysis explores significant changes in measures in a multidimensional space, based
on a given set of probe cells, where changes in sector characteristics are expressed in
terms of dimensions of the cube and are limited to specialization (drill-down), alization (roll-up), and mutation (a change in one of the cube’s dimensions).
gener-Concept description is the most basic form of descriptive data mining It describes
a given set of task-relevant data in a concise and summarative manner, presentinginteresting general properties of the data Concept (or class) description consists of
characterization and comparison (or discrimination) The former summarizes and describes a collection of data, called the target class, whereas the latter summarizes and distinguishes one collection of data, called the target class, from other collec- tion(s) of data, collectively called the contrasting class(es).
Concept characterization can be implemented using data cube (OLAP-based) approaches and the attribute-oriented induction approach These are attribute- or dimension-based generalization approaches The attribute-oriented induction
approach consists of the following techniques: data focusing, data generalization by
attribute removal or attribute generalization, count and aggregate value accumulation, attribute generalization control, and generalization data visualization.
Concept comparison can be performed using the attribute-oriented induction or
data cube approaches in a manner similar to concept characterization Generalizedtuples from the target and contrasting classes can be quantitatively compared andcontrasted
Characterization and comparison descriptions (which form a concept description)
can both be presented in the same generalized relation, crosstab, or quantitative
rule form, although they are displayed with different interestingness measures These
measures include the t-weight (for tuple typicality) and d-weight (for tuple
Trang 15220 Chapter 4 Data Cube Computation and Data Generalization
(a) How many nonempty cuboids will a full data cube contain?
(b) How many nonempty aggregate (i.e., nonbase) cells will a full cube contain?
(c) How many nonempty aggregate cells will an iceberg cube contain if the condition of the iceberg cube is “count ≥ 2”?
(d) A cell, c, is a closed cell if there exists no cell, d, such that d is a specialization of cell c (i.e., d is obtained by replacing a ∗ in c by a non-∗ value) and d has the same measure value as c A closed cube is a data cube consisting of only closed cells How
many closed cells are in the full cube?
4.2 There are several typical cube computation methods, such as Multiway array computation
(MultiWay)[ZDN97], BUC (bottom-up computation)[BR99],and Star-Cubing [XHLW03].
Briefly describe these three methods (i.e., use one or two lines to outline the key points),and compare their feasibility and performance under the following conditions:
(a) Computing a dense full cube of low dimensionality (e.g., less than 8 dimensions)(b) Computing an iceberg cube of around 10 dimensions with a highly skewed datadistribution
(c) Computing a sparse iceberg cube of high dimensionality (e.g., over 100 dimensions)
4.3 [Contributed by Chen Chen] Suppose a data cube, C, has D dimensions, and the base
cuboid contains k distinct tuples.
(a) Present a formula to calculate the minimum number of cells that the cube, C, may
contain
(b) Present a formula to calculate the maximum number of cells that C may contain.
(c) Answer parts (a) and (b) above as if the count in each cube cell must be no less than
a threshold, v.
(d) Answer parts (a) and (b) above as if only closed cells are considered (with the
mini-mum count threshold, v).
4.4 Suppose that a base cuboid has three dimensions, A, B, C, with the following number
of cells: |A| = 1, 000, 000, |B| = 100, and |C| = 1000 Suppose that each dimension is evenly partitioned into 10 portions for chunking.
(a) Assuming each dimension has only one level, draw the complete lattice of the cube.(b) If each cube cell stores one measure with 4 bytes, what is the total size of the
computed cube if the cube is dense?
(c) State the order for computing the chunks in the cube that requires the least amount
of space, and compute the total amount of main memory space required for puting the 2-D planes
com-4.5 Often, the aggregate measure value of many cells in a large data cuboid is zero, resulting
in a huge, yet sparse, multidimensional matrix
Trang 16Exercises 221
(a) Design an implementation method that can elegantly overcome this sparse matrixproblem Note that you need to explain your data structures in detail and discuss thespace needed, as well as how to retrieve data from your structures
(b) Modify your design in (a) to handle incremental data updates Give the reasoning
behind your new design
4.6 When computing a cube of high dimensionality, we encounter the inherent curse of
dimensionality problem: there exists a huge number of subsets of combinations of
dimensions
(a) Suppose that there are only two base cells, {(a1, a2, a3, , a100), (a1, a2, b3, ,
b100)}, in a 100-dimensional base cuboid Compute the number of nonempty gate cells Comment on the storage space and time required to compute these cells.(b) Suppose we are to compute an iceberg cube from the above If the minimum supportcount in the iceberg condition is two, how many aggregate cells will there be in theiceberg cube? Show the cells
aggre-(c) Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells
in a data cube However, even with iceberg cubes, we could still end up having tocompute a large number of trivial uninteresting cells (i.e., with small counts) Sup-pose that a database has 20 tuples that map to (or cover) the two following base cells
in a 100-dimensional base cuboid, each with a cell count of 10: {(a1, a2, a3, , a100) :
10, (a1, a2, b3, , b100) : 10}
i Let the minimum support be 10 How many distinct aggregate cells will there
be like the following: {(a1, a2, a3, a4, , a99, ∗) : 10, , (a1, a2, ∗, a4, , a99,
a100) : 10, , (a1, a2, a3, ∗ , , ∗ , ∗) : 10}?
ii If we ignore all the aggregate cells that can be obtained by replacing some stants with ∗’s while keeping the same measure value, how many distinct cellsare left? What are the cells?
con-4.7 Propose an algorithm that computes closed iceberg cubes efficiently.
4.8 Suppose that we would like to compute an iceberg cube for the dimensions, A, B, C, D,
where we wish to materialize all cells that satisfy a minimum support count of at least
v, and where cardinality(A) <cardinality(B) <cardinality(C) <cardinality(D) Show the
BUC processing tree (which shows the order in which the BUC algorithm explores thelattice of a data cube, starting from all) for the construction of the above iceberg cube
4.9 Discuss how you might extend the Star-Cubing algorithm to compute iceberg cubes
where the iceberg condition tests for an avg that is no bigger than some value, v.
4.10 A flight data warehouse for a travel agent consists of six dimensions: traveler,
depar-ture (city), depardepar-ture time, arrival, arrival time, and flight; and two measures: count, and avg fare, where avg fare stores the concrete fare at the lowest level but average fare at
other levels
(a) Suppose the cube is fully materialized Starting with the base cuboid [traveller, departure, departure time, arrival, arrival time, flight], what specific OLAP operations
Trang 17222 Chapter 4 Data Cube Computation and Data Generalization
(e.g., roll-up flight to airline) should one perform in order to list the average fare per month for each business traveler who flies American Airlines (AA) from L.A in the
year 2004?
(b) Suppose we want to compute a data cube where the condition is that the minimumnumber of records is 10 and the average fare is over $500 Outline an efficient cubecomputation method (based on common sense about flight data distribution)
4.11 (Implementation project) There are four typical data cube computation methods:
MultiWay [ZDN97], BUC [BR99], H-cubing [HPDW01], and Star-Cubing [XHLW03].(a) Implement any one of these cube computation algorithms and describe yourimplementation, experimentation, and performance Find another student who hasimplemented a different algorithm on the same platform (e.g., C++ on Linux) andcompare your algorithm performance with his/hers
number of nonempty cells (this is used to quickly check the correctness of yourresults)
(b) Based on your implementation, discuss the following:
i What challenging computation problems are encountered as the number ofdimensions grows large?
ii How can iceberg cubing solve the problems of part (a) for some data sets (andcharacterize such data sets)?
iii Give one simple example to show that sometimes iceberg cubes cannot provide
a good solution
(c) Instead of computing a data cube of high dimensionality, we may choose to rialize the cuboids that have only a small number of dimension combinations Forexample, for a 30-dimensional data cube, we may only compute the 5-dimensionalcuboids for every possible 5-dimensional combination The resulting cuboids form
mate-a shell cube Discuss how emate-asy or hmate-ard it is to modify your cube computmate-ation
algorithm to facilitate such computation
4.12 Consider the following multifeature cube query: Grouping by all subsets of {item, region,
month}, find the minimum shelf life in 2004 for each group and the fraction of the total
sales due to tuples whose price is less than $100 and whose shelf life is between 1.25 and1.5 of the minimum shelf life
Trang 18Bibliographic Notes 223
(a) Draw the multifeature cube graph for the query
(b) Express the query in extended SQL
(c) Is this a distributive multifeature cube? Why or why not?
4.13 For class characterization, what are the major differences between a data cube–based
implementation and a relational implementation such as attribute-oriented induction?Discuss which method is most efficient and under what conditions this is so
4.14 Suppose that the following table is derived by attribute-oriented induction.
class birth place count
(a) Transform the table into a crosstab showing the associated t-weights and d-weights
(b) Map the class Programmer into a (bidirectional) quantitative descriptive rule, for
example,
∀X, Programmer(X) ⇔ (birth place(X) = “USA” ∧ )
[t : x%, d : y%] θ( .)[t : w%, d : z%].
4.15 Discuss why relevance analysis is beneficial and how it can be performed and integrated
into the characterization process Compare the result of two induction methods: (1) withrelevance analysis and (2) without relevance analysis
4.16 Given a generalized relation, R, derived from a database, DB, suppose that a set, 4DB,
of tuples needs to be deleted from DB Outline an incremental updating procedure for applying the necessary deletions to R.
4.17 Outline a data cube–based incremental algorithm for mining class comparisons.
Bibliographic Notes
Gray, Chauduri, Bosworth, et al [GCB+97] proposed the data cube as a relationalaggregation operator generalizing group-by, crosstabs, and subtotals Harinarayan,Rajaraman, and Ullman [HRU96] proposed a greedy algorithm for the partial mate-rialization of cuboids in the computation of a data cube Sarawagi and Stonebraker[SS94] developed a chunk-based computation technique for the efficient organization
of large multidimensional arrays Agarwal, Agrawal, Deshpande, et al [AAD+96] posed several methods for the efficient computation of multidimensional aggregatesfor ROLAP servers The chunk-based MultiWay array aggregation method for data
Trang 19pro-224 Chapter 4 Data Cube Computation and Data Generalization
cube computation in MOLAP was proposed in Zhao, Deshpande, and Naughton[ZDN97] Ross and Srivastava [RS97] developed a method for computing sparsedata cubes Iceberg queries were first described in Fang, Shivakumar, Garcia-Molina,
et al [FSGM+98] BUC, a scalable method that computes iceberg cubes from theapex cuboid, downward, was introduced by Beyer and Ramakrishnan [BR99] Han,Pei, Dong, and Wang [HPDW01] introduced an H-cubing method for computingiceberg cubes with complex measures using an H-tree structure The Star-cubingmethod for computing iceberg cubes with a dynamic star-tree structure was intro-duced by Xin, Han, Li, and Wah [XHLW03] MMCubing, an efficient iceberg cubecomputation method that factorizes the lattice space, was developed by Shao, Han,and Xin [SHX04] The shell-fragment-based minimal cubing approach for efficienthigh-dimensional OLAP introduced in this chapter was proposed by Li, Han, andGonzalez [LHG04]
Aside from computing iceberg cubes, another way to reduce data cube computation
is to materialize condensed, dwarf, or quotient cubes, which are variants of closed cubes
Wang, Feng, Lu, and Yu proposed computing a reduced data cube, called a condensed cube [WLFY02] Sismanis, Deligiannakis, Roussopoulos, and Kotids proposed comput- ing a compressed data cube, called a dwarf cube Lakshmanan, Pei, and Han proposed
a quotient cube structure to summarize the semantics of a data cube [LPH02], which was further extended to a qc-tree structure by Lakshmanan, Pei, and Zhao [LPZ03] Xin, Han, Shao, and Liu [Xin+06] developed C-Cubing (i.e., Closed-Cubing), an aggregation- based approach that performs efficient closed-cube computation using a new algebraic measure called closedness.
There are also various studies on the computation of compressed data cubes by roximation, such as quasi-cubes by Barbara and Sullivan [BS97a], wavelet cubes by Vit-ter, Wang, and Iyer [VWI98], compressed cubes for query approximation on continuousdimensions by Shanmugasundaram, Fayyad, and Bradley [SFB99], and using log-linearmodels to compress data cubes by Barbara and Wu [BW00] Computation of streamdata “cubes” for multidimensional regression analysis has been studied by Chen, Dong,Han, et al [CDH+02]
app-For works regarding the selection of materialized cuboids for efficient OLAPquery processing, see Chaudhuri and Dayal [CD97], Harinarayan, Rajaraman, andUllman [HRU96], Sristava, Dar, Jagadish, and Levy [SDJL96], Gupta [Gup97], Baralis,Paraboschi, and Teniente [BPT97], and Shukla, Deshpande, and Naughton [SDN98].Methods for cube size estimation can be found in Deshpande, Naughton, Ramasamy,
et al [DNR+97], Ross and Srivastava [RS97], and Beyer and Ramakrishnan [BR99].Agrawal, Gupta, and Sarawagi [AGS97] proposed operations for modeling multidimen-sional databases
The discovery-driven exploration of OLAP data cubes was proposed by Sarawagi,Agrawal, and Megiddo [SAM98] Further studies on the integration of OLAP with datamining capabilities include the proposal of DIFF and RELAX operators for intelligentexploration of multidimensional OLAP data by Sarawagi and Sathe [SS00, SS01] Theconstruction of multifeature data cubes is described in Ross, Srivastava, and Chatzianto-niou [RSC98] Methods for answering queries quickly by on-line aggregation are
Trang 20Bibliographic Notes 225
described in Hellerstein, Haas, and Wang [HHW97] and Hellerstein, Avnur, Chou,
et al [HAC+99] A cube-gradient analysis problem, called cubegrade, was first proposed
by Imielinski, Khachiyan, and Abdulghani [IKA02] An efficient method for mensional constrained gradient analysis in data cubes was studied by Dong, Han, Lam,
multidi-et al [DHL+01]
Generalization and concept description methods have been studied in the statisticsliterature long before the onset of computers Good summaries of statistical descriptivedata mining methods include Cleveland [Cle93] and Devore [Dev95] Generalization-based induction techniques, such as learning from examples, were proposed andstudied in the machine learning literature before data mining became active A theoryand methodology of inductive learning was proposed by Michalski [Mic83] Thelearning-from-examples method was proposed by Michalski [Mic83] Version space wasproposed by Mitchell [Mit77, Mit82] The method of factoring the version space waspresented by Subramanian and Feigenbaum [SF86b] Overviews of machine learningtechniques can be found in Dietterich and Michalski [DM83], Michalski, Carbonell, andMitchell [MCM86], and Mitchell [Mit97]
Database-oriented methods for concept description explore scalable and efficienttechniques for describing large sets of data The attribute-oriented induction methoddescribed in this chapter was first proposed by Cai, Cercone, and Han [CCH91] andfurther extended by Han, Cai, and Cercone [HCC93], Han and Fu [HF96], Carter andHamilton [CH98], and Han, Nishio, Kawano, and Wang [HNKW98]
Trang 22Mining Frequent Patterns, Associations, and Correlations
Frequent patternsare patterns (such as itemsets, subsequences, or substructures) that appear in
a data set frequently For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset A subsequence, such as
buying first a PC, then a digital camera, and then a memory card, if it occurs frequently
in a shopping history database, is a (frequent) sequential pattern A substructure can refer
to different structural forms, such as subgraphs, subtrees, or sublattices, which may becombined with itemsets or subsequences If a substructure occurs frequently, it is called
a (frequent) structured pattern Finding such frequent patterns plays an essential role in
mining associations, correlations, and many other interesting relationships among data.Moreover, it helps in data classification, clustering, and other data mining tasks as well.Thus, frequent pattern mining has become an important data mining task and a focusedtheme in data mining research
In this chapter, we introduce the concepts of frequent patterns, associations, and relations, and study how they can be mined efficiently The topic of frequent pattern
cor-mining is indeed rich This chapter is dedicated to methods of frequent itemset cor-mining.
We delve into the following questions: How can we find frequent itemsets from largeamounts of data, where the data are either transactional or relational? How can we mineassociation rules in multilevel and multidimensional space? Which association rules arethe most interesting? How can we help or guide the mining procedure to discover inter-esting associations or correlations? How can we take advantage of user preferences orconstraints to speed up the mining process? The techniques learned in this chapter mayalso be extended for more advanced forms of frequent pattern mining, such as fromsequential and structured data sets, as we will study in later chapters
5.1 Basic Concepts and a Road Map
Frequent pattern mining searches for recurring relationships in a given data set Thissection introduces the basic concepts of frequent pattern mining for the discovery ofinteresting associations and correlations between itemsets in transactional and relational
227
Trang 23228 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
databases We begin in Section 5.1.1 by presenting an example of market basket analysis,the earliest form of frequent pattern mining for association rules The basic concepts
of mining frequent patterns and associations are given in Section 5.1.2 Section 5.1.3presents a road map to the different kinds of frequent patterns, association rules, andcorrelation rules that can be mined
Frequent itemset mining leads to the discovery of associations and correlations amongitems in large transactional or relational data sets With massive amounts of datacontinuously being collected and stored, many industries are becoming interested inmining such patterns from their databases The discovery of interesting correlationrelationships among huge amounts of business transaction records can help in manybusiness decision-making processes, such as catalog design, cross-marketing, and cus-tomer shopping behavior analysis
A typical example of frequent itemset mining is market basket analysis This process
analyzes customer buying habits by finding associations between the different items thatcustomers place in their “shopping baskets” (Figure 5.1) The discovery of such associa-tions can help retailers develop marketing strategies by gaining insight into which itemsare frequently purchased together by customers For instance, if customers are buying
Which items are frequentlypurchased together by my customers?
milk cereal
butter
milk bread sugar eggs
Customer 1
Market Analyst
Customer 2
sugar eggs
Trang 245.1 Basic Concepts and a Road Map 229
milk, how likely are they to also buy bread (and what kind of bread) on the same trip
to the supermarket? Such information can lead to increased sales by helping retailers doselective marketing and plan their shelf space
Let’s look at an example of how market basket analysis can be useful
Example 5.1 Market basket analysis Suppose, as manager of an AllElectronics branch, you would
like to learn more about the buying habits of your customers Specifically, you wonder,
“Which groups or sets of items are customers likely to purchase on a given trip to the store?”
To answer your question, market basket analysis may be performed on the retail data ofcustomer transactions at your store You can then use the results to plan marketing oradvertising strategies, or in the design of a new catalog For instance, market basket anal-ysis may help you design different store layouts In one strategy, items that are frequentlypurchased together can be placed in proximity in order to further encourage the sale
of such items together If customers who purchase computers also tend to buy antivirussoftware at the same time, then placing the hardware display close to the software displaymay help increase the sales of both items In an alternative strategy, placing hardware andsoftware at opposite ends of the store may entice customers who purchase such items topick up other items along the way For instance, after deciding on an expensive computer,
a customer may observe security systems for sale while heading toward the software play to purchase antivirus software and may decide to purchase a home security system
dis-as well Market bdis-asket analysis can also help retailers plan which items to put on sale
at reduced prices If customers tend to purchase computers and printers together, then
having a sale on printers may encourage the sale of printers as well as computers.
If we think of the universe as the set of items available at the store, then each itemhas a Boolean variable representing the presence or absence of that item Each basketcan then be represented by a Boolean vector of values assigned to these variables.The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together These patterns can be represented in the
form of association rules For example, the information that customers who purchase
computers also tend to buy antivirus software at the same time is represented inAssociation Rule (5.1) below:
computer ⇒ antivirus software [support = 2%, confidence = 60%] (5.1)
Rule support and confidence are two measures of rule interestingness They
respec-tively reflect the usefulness and certainty of discovered rules A support of 2% for ation Rule (5.1) means that 2% of all the transactions under analysis show that computerand antivirus software are purchased together A confidence of 60% means that 60% ofthe customers who purchased a computer also bought the software Typically, associa-
Associ-tion rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold Such thresholds can be set by users or domain
experts Additional analysis can be performed to uncover interesting statistical tions between associated items
Trang 25correla-230 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
LetI={I1, I2, , I m } be a set of items Let D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that T ⊆I Each transaction
is associated with an identifier, called TID Let A be a set of items A transaction T is said to contain A if and only if A ⊆ T An association rule is an implication of the form
A ⇒ B, where A ⊂I, B ⊂I, and A ∩ B =φ The rule A ⇒ B holds in the transaction set D
with support s, where s is the percentage of transactions in D that contain A ∪ B (i.e., the
union of sets A and B, or say, both A and B) This is taken to be the probability, P(A ∪ B).1
The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of
transactions in D containing A that also contain B This is taken to be the conditional probability, P(B|A) That is,
Rules that satisfy both a minimum support threshold (min sup) and a minimum
confi-dence threshold (min conf) are called strong By convention, we write support and
con-fidence values so as to occur between 0% and 100%, rather than 0 to 1.0
A set of items is referred to as an itemset.2 An itemset that contains k items is a k-itemset The set {computer, antivirus software} is a 2-itemset The occurrence
frequency of an itemset is the number of transactions that contain the itemset This is also known, simply, as the frequency, support count, or count of the itemset Note that
the itemset support defined in Equation (5.2) is sometimes referred to as relative support,
whereas the occurrence frequency is called the absolute support If the relative support
of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a
frequent itemset.3The set of frequent k-itemsets is commonly denoted by L k.4From Equation (5.3), we have
confidence(A⇒B) = P(B|A) = support(A ∪ B)
support(A) =
support count(A ∪ B) support count(A) . (5.4)Equation (5.4) shows that the confidence of rule A⇒B can be easily derived from the support counts of A and A ∪ B That is, once the support counts of A, B, and A ∪ B are
1Notice that the notation P(A ∪ B) indicates the probability that a transaction contains the union of set
A and set B (i.e., it contains every item in A and in B) This should not be confused with P(A or B), which indicates the probability that a transaction contains either A or B.
2 In the data mining research literature, “itemset” is more commonly used than “item set.”
3In early work, itemsets satisfying minimum support were referred to as large This term, however, is
somewhat confusing as it has connotations to the number of items in an itemset rather than the
fre-quency of occurrence of the set Hence, we use the more recent term frequent.
4Although the term frequent is preferred over large, for historical reasons frequent k-itemsets are still
denoted as L.
Trang 265.1 Basic Concepts and a Road Map 231
found, it is straightforward to derive the corresponding association rules A⇒B and B⇒A
and check whether they are strong Thus the problem of mining association rules can bereduced to that of mining frequent itemsets
In general, association rule mining can be viewed as a two-step process:
1 Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.
2 Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence
Additional interestingness measures can be applied for the discovery of correlationrelationships between associated items, as will be discussed in Section 5.4 Because thesecond step is much less costly than the first, the overall performance of mining associ-ation rules is determined by the first step
A major challenge in mining frequent itemsets from a large data set is the fact thatsuch mining often generates a huge number of itemsets satisfying the minimum support
(min sup) threshold, especially when min sup is set low This is because if an itemset is
frequent, each of its subsets is frequent as well A long itemset will contain a rial number of shorter, frequent sub-itemsets For example, a frequent itemset of length
combinato-100, such as {a1, a2, , a100}, contains 1001 = 100 frequent 1-itemsets: a1, a2, , a100,100
2 frequent 2-itemsets: (a1, a2), (a1, a3), , (a99, a100), and so on The total number
of frequent itemsets that it contains is thus,
100
1
+1002
+· · · +100
100
= 2100− 1 ≈ 1.27 × 1030 (5.5)
This is too huge a number of itemsets for any computer to compute or store To overcome
this difficulty, we introduce the concepts of closed frequent itemset and maximal frequent itemset.
An itemset X is closed in a data set S if there exists no proper super-itemset5Y such
that Y has the same support count as X in S An itemset X is a closed frequent itemset
in set S if X is both closed and frequent in S An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is frequent, and there exists no super-itemset Y such that
X ⊂ Y and Y is frequent in S.
LetCbe the set of closed frequent itemsets for a data set S satisfying a minimum port threshold, min sup LetM be the set of maximal frequent itemsets for S satisfying min sup Suppose that we have the support count of each itemset inC andM NoticethatCand its count information can be used to derive the whole set of frequent item-sets Thus we say thatCcontains complete information regarding its corresponding fre-quent itemsets On the other hand,M registers only the support of the maximal itemsets
sup-5Y is a proper super-itemset of X if X is a proper sub-itemset of Y , that is, if X ⊂ Y In other words, every item of X is contained in Y but there is at least one item of Y that is not in X.
Trang 27232 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
It usually does not contain the complete support information regarding its ing frequent itemsets We illustrate these concepts with the following example
correspond-Example 5.2 Closed and maximal frequent itemsets Suppose that a transaction database has only
two transactions: {ha1, a2, , a100i; ha1, a2, , a50i} Let the minimum support count
threshold be min sup = 1 We find two closed frequent itemsets and their support counts,
that is,C= {{a1, a2, , a100} : 1; {a1, a2, , a50} : 2} There is one maximal frequentitemset:M = {{a1, a2, , a100} : 1} (We cannot include {a1, a2, , a50} as a maximal
frequent itemset because it has a frequent super-set, {a1, a2, , a100}.) Compare this tothe above, where we determined that there are 2100− 1 frequent itemsets, which is toohuge a set to be enumerated!
The set of closed frequent itemsets contains complete information regarding thefrequent itemsets For example, fromC, we can derive, say, (1) {a2, a45: 2} since {a2, a45}
is a sub-itemset of the itemset {a1, a2, , a50: 2}; and (2) {a8, a55: 1} since {a8, a55}
is not a sub-itemset of the previous itemset but of the itemset {a1, a2, , a100: 1}.However, from the maximal frequent itemset, we can only assert that both itemsets
({a2, a45} and {a8, a55}) are frequent, but we cannot assert their actual support counts
Market basket analysis is just one form of frequent pattern mining In fact, there are manykinds of frequent patterns, association rules, and correlation relationships Frequent pat-tern mining can be classified in various ways, based on the following criteria:
Based on the completeness of patterns to be mined: As we discussed in the previous
subsection, we can mine the complete set of frequent itemsets, the closed frequent itemsets, and the maximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints), approximate frequent itemsets (i.e., those that derive only approximate support counts for the mined frequent itemsets), near-match frequent itemsets (i.e., those that tally the support count of the near or almost matching item-
sets), top-k frequent itemsets (i.e., the k most frequent itemsets for a user-specified
value, k), and so on.
Different applications may have different requirements regarding the completeness ofthe patterns to be mined, which in turn can lead to different evaluation andoptimization methods In this chapter, our study of mining methods focuses on
mining the complete set of frequent itemsets, closed frequent itemsets, and constrained frequent itemsets We leave the mining of frequent itemsets under other completeness
requirements as an exercise
Based on the levels of abstraction involved in the rule set: Some methods for
associa-tion rule mining can find rules at differing levels of abstracassocia-tion For example, suppose
that a set of association rules mined includes the following rules where X is a variable
representing a customer:
buys(X, “computer”) ⇒ buys(X, “HP printer”) (5.6)
Trang 285.1 Basic Concepts and a Road Map 233
buys(X, “laptop computer”) ⇒ buys(X, “HP printer”) (5.7)
In Rules (5.6) and (5.7), the items bought are referenced at different levels of
abstraction (e.g., “computer” is a higher-level abstraction of “laptop computer”) We
refer to the rule set mined as consisting of multilevel association rules If, instead,
the rules within a given set do not reference items or attributes at different levels of
abstraction, then the set contains single-level association rules.
Based on the number of data dimensions involved in the rule: If the items or attributes
in an association rule reference only one dimension, then it is a single-dimensional association rule Note that Rule (5.1), for example, could be rewritten as Rule (5.8):
buys(X, “computer”) ⇒ buys(X, “antivirus software”) (5.8)Rules (5.6), (5.7), and (5.8) are single-dimensional association rules because they each
refer to only one dimension, buys.6
If a rule references two or more dimensions, such as the dimensions age, income, and
buys, then it is a multidimensional association rule The following rule is an example
of a multidimensional rule:
age(X, “30 39”) ∧ income(X, “42K 48K”)⇒buys(X, “high resolution TV”).
(5.9)
Based on the types of values handled in the rule: If a rule involves associations between
the presence or absence of items, it is a Boolean association rule For example,
Rules (5.1), (5.6), and (5.7) are Boolean association rules obtained from market ket analysis
bas-If a rule describes associations between quantitative items or attributes, then it is a
quantitative association rule In these rules, quantitative values for items or attributes
are partitioned into intervals Rule (5.9) is also considered a quantitative association
rule Note that the quantitative attributes, age and income, have been discretized.
Based on the kinds of rules to be mined: Frequent pattern analysis can generate
vari-ous kinds of rules and other interesting relationships Association rules are the most
popular kind of rules generated from frequent patterns Typically, such mining cangenerate a large number of rules, many of which are redundant or do not indicate
a correlation relationship among itemsets Thus, the discovered associations can be
further analyzed to uncover statistical correlations, leading to correlation rules.
We can also mine strong gradient relationships among itemsets, where a gradient
is the ratio of the measure of an item when compared with that of its parent (a eralized itemset), its child (a specialized itemset), or its sibling (a comparable item-
gen-set) One such example is: “The average sales from Sony Digital Camera increase over 16% when sold together with Sony Laptop Computer”: both Sony Digital Camera and
Sony Laptop Computer are siblings, where the parent itemset is Sony
6 Following the terminology used in multidimensional databases, we refer to each distinct predicate in a
rule as a dimension.
Trang 29234 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
Based on the kinds of patterns to be mined: Many kinds of frequent patterns can be
mined from different kinds of data sets For this chapter, our focus is on frequent set mining, that is, the mining of frequent itemsets (sets of items) from transactional
item-or relational data sets However, other kinds of frequent patterns can be found from
other kinds of data sets Sequential pattern mining searches for frequent subsequences
in a sequence data set, where a sequence records an ordering of events For example,
with sequential pattern mining, we can study the order in which items are frequentlypurchased For instance, customers may tend to first buy a PC, followed by a digital
camera, and then a memory card Structured pattern mining searches for frequent
sub-structures in a structured data set Notice that structure is a general concept that covers
many different kinds of structural forms, such as graphs, lattices, trees, sequences, sets,single items, or combinations of such structures Single items are the simplest form ofstructure Each element of an itemset may contain a subsequence, a subtree, and so on,and such containment relationships can be defined recursively Therefore, structuredpattern mining can be considered as the most general form of frequent pattern mining
In the next section, we will study efficient methods for mining the basic (i.e., level, single-dimensional, Boolean) frequent itemsets from transactional databases, andshow how to generate association rules from such itemsets The extension of this scope
single-of mining to multilevel, multidimensional, and quantitative rules is discussed inSection 5.3 The mining of strong correlation relationships is studied in Section 5.4.Constraint-based mining is studied in Section 5.5 We address the more advanced topic
of mining sequence and structured patterns in later chapters Nevertheless, most of themethods studied here can be easily extended for mining more complex kinds of patterns
5.2 Efficient and Scalable Frequent Itemset Mining Methods
In this section, you will learn methods for mining the simplest form of frequent
patterns—single-dimensional, single-level, Boolean frequent itemsets, such as those
dis-cussed for market basket analysis in Section 5.1.1 We begin by presenting Apriori, the
basic algorithm for finding frequent itemsets (Section 5.2.1) In Section 5.2.2, we look athow to generate strong association rules from frequent itemsets Section 5.2.3 describesseveral variations to the Apriori algorithm for improved efficiency and scalability.Section 5.2.4 presents methods for mining frequent itemsets that, unlike Apriori, do notinvolve the generation of “candidate” frequent itemsets Section 5.2.5 presents methodsfor mining frequent itemsets that take advantage of vertical data format Methods formining closed frequent itemsets are discussed in Section 5.2.6
Candidate Generation
Apriori is a seminal algorithm proposed by R Agrawal and R Srikant in 1994 for mining
frequent itemsets for Boolean association rules The name of the algorithm is based on
Trang 305.2 Efficient and Scalable Frequent Itemset Mining Methods 235
the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall see following Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets First, the set of frequent 1-itemsets is found
by scanning the database to accumulate the count for each item, and collecting those items
that satisfy minimum support The resulting set is denoted L1 Next, L1is used to find L2,
the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent
k-itemsets can be found The finding of each L krequires one full scan of the database
To improve the efficiency of the level-wise generation of frequent itemsets, an
impor-tant property called the Apriori property, presented below, is used to reduce the search
space We will first describe this property, and then show an example illustrating its use
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation By definition, if an itemset
I does not satisfy the minimum support threshold, min sup, then I is not frequent; that
is, P(I) < min sup If an item A is added to the itemset I, then the resulting itemset (i.e.,
I ∪ A) cannot occur more frequently than I Therefore, I ∪ A is not frequent either; that
is, P(I ∪ A) < min sup.
This property belongs to a special category of properties called antimonotone in the
sense that if a set cannot pass a test, all of its supersets will fail the same test as well It is called antimonotone because the property is monotonic in the context of failing a test.7
“How is the Apriori property used in the algorithm?” To understand this, let us look at how L k−1is used to find L k for k ≥ 2 A two-step process is followed, consisting of join
and prune actions.
1. The join step: To find L k , a set of candidate k-itemsets is generated by joining L k−1
with itself This set of candidates is denoted C k Let l1 and l2 be itemsets in L k−1
The notation l i [ j] refers to the jth item in l i (e.g., l1[k− 2] refers to the second to the
last item in l1) By convention, Apriori assumes that items within a transaction or
itemset are sorted in lexicographic order For the (k − 1)-itemset, l i, this means that
the items are sorted such that l i [1] < l i [2] < < l i [k − 1] The join, L k−1on L k−1,
is performed, where members of L k−1 are joinable if their first (k − 2) items are in common That is, members l1and l2of L k−1 are joined if (l1[1] = l2[1])∧ (l1[2] =
l2[2])∧ ∧ (l1[k − 2] = l2[k − 2]) ∧(l1[k − 1] < l2[k − 1]) The condition l1[k− 1] <
l2[k−1] simply ensures that no duplicates are generated The resulting itemset formed
by joining l1and l2is l1[1], l1[2], , l1[k − 2], l1[k − 1], l2[k− 1]
2. The prune step: C k is a superset of L k, that is, its members may or may not be frequent,
but all of the frequent k-itemsets are included inC k A scan of the database to determine
the count of each candidate in C k would result in the determination of L k (i.e., allcandidates having a count no less than the minimum support count are frequent by
definition, and therefore belong to L k ) C k, however, can be huge, and so this could
7 The Apriori property has many applications It can also be used to prune search during data cube computation (Chapter 4).
Trang 31236 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
Table 5.1 Transactional data for an
involve heavy computation To reduce the size of C k, the Apriori property is used
as follows Any (k − 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Hence, if any (k − 1)-subset of a candidate k-itemset is not in L k−1, then
the candidate cannot be frequent either and so can be removed from C k This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets Example 5.3 Apriori Let’s look at a concrete example, based on the AllElectronics transaction database,
D, of Table 5.1 There are nine transactions in this database, that is, |D| = 9 We use Figure 5.2 to illustrate the Apriori algorithm for finding frequent itemsets in D.
1. In the first iteration of the algorithm, each item is a member of the set of candidate
1-itemsets, C1 The algorithm simply scans all of the transactions in order to countthe number of occurrences of each item
2. Suppose that the minimum support count required is 2, that is, min sup = 2 (Here,
we are referring to absolute support because we are using a support count The sponding relative support is 2/9 = 22%) The set of frequent 1-itemsets, L1, can then
corre-be determined It consists of the candidate 1-itemsets satisfying minimum support
In our example, all of the candidates in C1satisfy minimum support
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1on L1to
generate a candidate set of 2-itemsets, C2.8C2consists of |L1 |
2 2-itemsets Note that
no candidates are removed from C2during the prune step because each subset of thecandidates is also frequent
8L1 oL1is equivalent to L1× L1, since the definition of L koL krequires the two joining itemsets to
share k − 1 = 0 items.
Trang 325.2 Efficient and Scalable Frequent Itemset Mining Methods 237
Figure 5.2 Generation of candidate itemsets and frequent itemsets, where the minimum support
count is 2
4. Next, the transactions in D are scanned and the support count of each candidate set in C2is accumulated, as shown in the middle table of the second row in Figure 5.2
item-5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate
2-itemsets in C2having minimum support
6. The generation of the set of candidate 3-itemsets, C3, is detailed in Figure 5.3 From the
join step, we first get C3= L2on L2= {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}} Based on the Apriori property that all subsets of a frequentitemset must also be frequent, we can determine that the four latter candidates cannot
possibly be frequent We therefore remove them from C3, thereby saving the effort of
unnecessarily obtaining their counts during the subsequent scan of D to determine L3
Note that when given a candidate k-itemset, we only need to check if its (k −1)-subsets
are frequent since the Apriori algorithm uses a level-wise search strategy The resulting
pruned version of C3is shown in the first table of the bottom row of Figure 5.2
7. The transactions in D are scanned in order to determine L3, consisting of those
can-didate 3-itemsets in C having minimum support (Figure 5.2)
Trang 33238 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
(a) Join: C3= L2 on L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} o n
{{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}
= {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}} (b) Prune using the Apriori property: All nonempty subsets of a frequent itemset must also be frequent Do any of the candidates have a subset that is not frequent?
The 2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, and {I2, I3} All 2-item subsets of {I1, I2,
I3} are members of L2 Therefore, keep {I1, I2, I3} in C3 The 2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, and {I2, I5} All 2-item subsets of {I1, I2,
I5} are members of L2 Therefore, keep {I1, I2, I5} in C3
The 2-item subsets of {I1, I3, I5} are {I1, I3}, {I1, I5}, and {I3, I5} {I3, I5} is not a member of L2 ,
and so it is not frequent Therefore, remove {I1, I3, I5} from C3
The 2-item subsets of {I2, I3, I4} are {I2, I3}, {I2, I4}, and {I3, I4} {I3, I4} is not a member of L2 ,
and so it is not frequent Therefore, remove {I2, I3, I4} from C3
The 2-item subsets of {I2, I3, I5} are {I2, I3}, {I2, I5}, and {I3, I5} {I3, I5} is not a member of L2 ,
and so it is not frequent Therefore, remove {I2, I3, I5} from C3 The 2-item subsets of {I2, I4, I5} are {I2, I4}, {I2, I5}, and {I4, I5} {I4, I5} is not a member of
L2, and so it is not frequent Therefore, remove {I2, I4, I5} from C3
(c) Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after pruning.
Figure 5.3 Generation and pruning of candidate 3-itemsets, C3, from L2using the Apriori property
8. The algorithm uses L3on L3to generate a candidate set of 4-itemsets, C4 Althoughthe join results in {{I1, I2, I3, I5}}, this itemset is pruned because its subset {{I2, I3,
I5}} is not frequent Thus, C4=φ, and the algorithm terminates, having found all ofthe frequent itemsets
Figure 5.4 shows pseudo-code for the Apriori algorithm and its related procedures
Step 1 of Apriori finds the frequent 1-itemsets, L1 In steps 2 to 10, L k−1 is used
to generate candidates C k in order to find L k for k ≥ 2 The apriori gen procedure
generates the candidates and then uses the Apriori property to eliminate those having
a subset that is not frequent (step 3) This procedure is described below Once all
of the candidates have been generated, the database is scanned (step 4) For eachtransaction, a subset function is used to find all subsets of the transaction thatare candidates (step 5), and the count for each of these candidates is accumulated(steps 6 and 7) Finally, all of those candidates satisfying minimum support (step 9)
form the set of frequent itemsets, L (step 11) A procedure can then be called to
generate association rules from the frequent itemsets Such a procedure is described
in Section 5.2.2
The apriori gen procedure performs two kinds of actions, namely, join and prune, as
described above In the join component, L k−1is joined with L k−1to generate potentialcandidates (steps 1 to 4) The prune component (steps 5 to 7) employs the Apriori prop-erty to remove candidates that have a subset that is not frequent The test for infrequentsubsets is shown in procedure has infrequent subset
Trang 345.2 Efficient and Scalable Frequent Itemset Mining Methods 239
Algorithm: Apriori Find frequent itemsets using an iterative level-wise approach based on candidate
generation.
Input:
D, a database of transactions;
min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
Method:
(1) L1 = find frequent 1-itemsets(D);
(2) for (k = 2; L k−1 6= φ; k++) {
(3) C k = apriori gen(Lk−1 );
(4) for each transaction t ∈ D { // scan D for counts
(5) C t = subset(Ck, t); // get the subsets of t that are candidates
(6) for each candidate c ∈ C t
procedure apriori gen(Lk−1:frequent (k − 1)-itemsets)
(1) for each itemset l1∈ Lk−1
(2) for each itemset l2∈ Lk−1
(3) if (l1[1] = l2 [1])∧ (l1[2] = l2 [2])∧ ∧ (l1[k − 2] = l2[k− 2]) ∧ (l1[k− 1] < l2[k− 1]) then {
(4) c = l1 on l2 ; // join step: generate candidates
(5) if has infrequent subset(c, L k−1) then
(6) delete c; // prune step: remove unfruitful candidate
(7) else add c to C k;
(9) return C k;
procedure has infrequent subset(c: candidate k-itemset;
L k−1: frequent (k − 1)-itemsets); // use prior knowledge
(1) for each (k − 1)-subset s of c
(2) if s 6∈ L k−1then
(3) return TRUE;
(4) return FALSE;
Figure 5.4 The Apriori algorithm for discovering frequent itemsets for mining Boolean association rules
Once the frequent itemsets from transactions in a database D have been found,
it is straightforward to generate strong association rules from them (where strong
association rules satisfy both minimum support and minimum confidence) Thiscan be done using Equation (5.4) for confidence, which we show again here forcompleteness:
confidence(A ⇒ B) = P(B|A) = support count(A ∪ B)
support count(A) .
Trang 35240 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
The conditional probability is expressed in terms of itemset support count, where
support count(A ∪ B) is the number of transactions containing the itemsets A ∪ B, and support count(A) is the number of transactions containing the itemset A Based on this
equation, association rules can be generated as follows:
For each frequent itemset l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if support count(l) support count(s)≥
min conf, where min conf is the minimum confidence threshold.
Because the rules are generated from frequent itemsets, each one automatically isfies minimum support Frequent itemsets can be stored ahead of time in hash tablesalong with their counts so that they can be accessed quickly
sat-Example 5.4 Generating association rules Let’s try an example based on the transactional data
for AllElectronics shown in Table 5.1 Suppose the data contain the frequent itemset
l = {I1, I2, I5} What are the association rules that can be generated from l? The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5} The
resulting association rules are as shown below, each listed with its confidence:
“How can we further improve the efficiency of Apriori-based mining?” Many variations of
the Apriori algorithm have been proposed that focus on improving the efficiency of theoriginal algorithm Several of these variations are summarized as follows:
Hash-based technique (hashing itemsets into corresponding buckets): A hash-based
technique can be used to reduce the size of the candidate k-itemsets, C k , for k > 1.
For example, when scanning each transaction in the database to generate the
fre-quent 1-itemsets, L1, from the candidate 1-itemsets in C1, we can generate all of the
2-itemsets for each transaction, hash (i.e., map) them into the different buckets of
a hash table structure, and increase the corresponding bucket counts (Figure 5.5).
A 2-itemset whose corresponding bucket count in the hash table is below the support
Trang 365.2 Efficient and Scalable Frequent Itemset Mining Methods 241
bucket address bucket count bucket contents
Create hash table H 2
using hash function
h(x, y) 5 ((order of x) 310
1 (order of y)) mod 7
0 2 {I1, I4}
{I3, I5}
1 2 {I1, I5}
{I1, I5}
2 4 {I2, I3}
{I2, I3}
{I2, I3}
{I2, I3}
3 2 {I2, I4}
{I2, I4}
4 2 {I2, I5}
{I2, I5}
5 4 {I1, I2} {I1, I2} {I1, I2} {I1, I2}
6 4 {I1, I3} {I1, I3} {I1, I3} {I1, I3}
H2
Figure 5.5 Hash table, H2, for candidate 2-itemsets: This hash table was generated by scanning the
trans-actions of Table 5.1 while determining L1from C1 If the minimum support count is, say, 3,then the itemsets in buckets 0, 1, 3, and 4 cannot be frequent and so they should not be
included in C2
threshold cannot be frequent and thus should be removed from the candidate set.Such a hash-based technique may substantially reduce the number of the candidate
k-itemsets examined (especially when k = 2).
Transaction reduction (reducing the number of transactions scanned in future
itera-tions): A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k + 1)-itemsets Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it.
Partitioning (partitioning the data to find candidate itemsets): A partitioning
tech-nique can be used that requires just two database scans to mine the frequent itemsets(Figure 5.6) It consists of two phases In Phase I, the algorithm subdivides the trans-
actions of D into n nonoverlapping partitions If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is min sup × the number of transactions in that partition For each partition, all frequent
itemsets within the partition are found These are referred to as local frequent sets The procedure employs a special data structure that, for each itemset, records
item-the TIDs of item-the transactions containing item-the items in item-the itemset This allows it to find
all of the local frequent k-itemsets, for k = 1, 2, , in just one scan of the database.
A local frequent itemset may or may not be frequent with respect to the entire
database, D Any itemset that is potentially frequent with respect to D must occur as a frequent itemset in at least one of the partitions Therefore, all local frequent itemsets are candidate itemsets with respect to D The collection of frequent itemsets from all
partitions forms the global candidate itemsets with respect to D In Phase II, a second
scan of D is conducted in which the actual support of each candidate is assessed in
order to determine the global frequent itemsets Partition size and the number ofpartitions are set so that each partition can fit into main memory and therefore beread only once in each phase
Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for quent itemsets in S instead of D In this way, we trade off some degree of accuracy
Trang 37fre-242 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
Transactions
in D
Frequent itemsets
Combine all local frequent itemsets
to form candidate itemset
Find global frequent itemsets among candidates (1 scan)
Phase I Phase II
Figure 5.6 Mining by partitioning the data
against efficiency The sample size of S is such that the search for frequent itemsets
in S can be done in main memory, and so only one scan of the transactions in S is required overall Because we are searching for frequent itemsets in S rather than in D,
it is possible that we will miss some of the global frequent itemsets To lessen this sibility, we use a lower support threshold than minimum support to find the frequent
pos-itemsets local to S (denoted L S) The rest of the database is then used to compute the
actual frequencies of each itemset in L S A mechanism is used to determine whether
all of the global frequent itemsets are included in L S If L Sactually contains all of the
frequent itemsets in D, then only one scan of D is required Otherwise, a second pass
can be done in order to find the frequent itemsets that were missed in the first pass.The sampling approach is especially beneficial when efficiency is of utmost impor-tance, such as in computationally intensive applications that must be run frequently
Dynamic itemset counting (adding candidate itemsets at different points during a scan):
A dynamic itemset counting technique was proposed in which the database ispartitioned into blocks marked by start points In this variation, new candidate item-sets can be added at any start point, unlike in Apriori, which determines new candi-date itemsets only immediately before each complete database scan The technique isdynamic in that it estimates the support of all of the itemsets that have been counted
so far, adding new candidate itemsets if all of their subsets are estimated to be quent The resulting algorithm requires fewer database scans than Apriori
fre-Other variations involving the mining of multilevel and multidimensional associationrules are discussed in the rest of this chapter The mining of associations related to spatialdata and multimedia data are discussed in Chapter 10
As we have seen, in many cases the Apriori candidate generate-and-test method cantly reduces the size of candidate sets, leading to good performance gain However, itcan suffer from two nontrivial costs:
Trang 38signifi-5.2 Efficient and Scalable Frequent Itemset Mining Methods 243
It may need to generate a huge number of candidate sets For example, if there are
104frequent 1-itemsets, the Apriori algorithm will need to generate more than 107candidate 2-itemsets Moreover, to discover a frequent pattern of size 100, such as
{a1, , a100}, it has to generate at least 2100− 1 ≈ 1030candidates in total
It may need to repeatedly scan the database and check a large set of candidates by pattern matching It is costly to go over each transaction in the database to determine the
support of the candidate itemsets
“Can we design a method that mines the complete set of frequent itemsets without
candi-date generation?” An interesting method in this attempt is called frequent-pattern growth,
or simply FP-growth, which adopts a divide-and-conquer strategy as follows First, it
compresses the database representing frequent items into a frequent-pattern tree, or FP-tree, which retains the itemset association information It then divides the compressed
database into a set of conditional databases (a special kind of projected database), each
associated with one frequent item or “pattern fragment,” and mines each such databaseseparately You’ll see how it works with the following example
Example 5.5 FP-growth (finding frequent itemsets without candidate generation) We re-examine
the mining of transaction database, D, of Table 5.1 in Example 5.3 using the
frequent-pattern growth approach
The first scan of the database is the same as Apriori, which derives the set of quent items (1-itemsets) and their support counts (frequencies) Let the minimum sup-port count be 2 The set of frequent items is sorted in the order of descending support
fre-count This resulting set or list is denoted L Thus, we have L ={{I2: 7}, {I1: 6}, {I3: 6},
{I4: 2}, {I5: 2}}
An FP-tree is then constructed as follows First, create the root of the tree, labeled with
“null.” Scan database D a second time The items in each transaction are processed in
Lorder (i.e., sorted according to descending support count), and a branch is created foreach transaction For example, the scan of the first transaction, “T100: I1, I2, I5,” which
contains three items (I2, I1, I5 in L order), leads to the construction of the first branch of
the tree with three nodes, hI2: 1i, hI1:1i, and hI5: 1i, where I2 is linked as a child of theroot, I1 is linked to I2, and I5 is linked to I1 The second transaction, T200, contains the
items I2 and I4 in L order, which would result in a branch where I2 is linked to the root
and I4 is linked to I2 However, this branch would share a common prefix, I2, with the
existing path for T100 Therefore, we instead increment the count of the I2 node by 1, andcreate a new node, hI4: 1i, which is linked as a child of hI2: 2i In general, when consideringthe branch to be added for a transaction, the count of each node along a common prefix
is incremented by 1, and nodes for the items following the prefix are created and linkedaccordingly
To facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links The tree obtained after scanning all of
the transactions is shown in Figure 5.7 with the associated node-links In this way, theproblem of mining frequent patterns in databases is transformed to that of mining theFP-tree
Trang 39244 Chapter 5 Mining Frequent Patterns, Associations, and Correlations
I2 I1 I3 I4 I5
7 6 6 2 2
I1:2 I3:2 I4:1 I3:2
I4:1 I5:1 I5:1
I1:4 I2:7
null{}
I3:2
Node-link Item ID
Support count
Figure 5.7 An FP-tree registers compressed, frequent pattern information
Table 5.2 Mining the FP-tree by creating conditional (sub-)pattern bases
Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated
I5 {{I2, I1: 1}, {I2, I1, I3: 1}} hI2: 2, I1: 2i {I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5: 2}
I3 {{I2, I1: 2}, {I2: 2}, {I1: 2}} hI2: 4, I1: 2i, hI1: 2i {I2, I3: 4}, {I1, I3: 4}, {I2, I1, I3: 2}
The FP-tree is mined as follows Start from each frequent length-1 pattern (as an initial
suffix pattern), construct its conditional pattern base (a “subdatabase,” which consists of
the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then construct its (conditional) FP-tree, and perform mining recursively on such a tree The pattern
growth is achieved by the concatenation of the suffix pattern with the frequent patternsgenerated from a conditional FP-tree
Mining of the FP-tree is summarized in Table 5.2 and detailed as follows We first
consider I5, which is the last item in L, rather than the first The reason for starting at the
end of the list will become apparent as we explain the FP-tree mining process I5 occurs
in two branches of the FP-tree of Figure 5.7 (The occurrences of I5 can easily be found
by following its chain of node-links.) The paths formed by these branches are hI2, I1,I5: 1i and hI2, I1, I3, I5: 1i Therefore, considering I5 as a suffix, its corresponding twoprefix paths are hI2, I1: 1i and hI2, I1, I3: 1i, which form its conditional pattern base Itsconditional FP-tree contains only a single path, hI2: 2, I1: 2i; I3 is not included becauseits support count of 1 is less than the minimum support count The single path generatesall the combinations of frequent patterns: {I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5: 2}
For I4, its two prefix paths form the conditional pattern base, {{I2 I1: 1}, {I2: 1}},which generates a single-node conditional FP-tree, hI2: 2i, and derives one frequent