Market Basket Analysis and Association Rules 313 In other words, minimum support pruning eliminates items that do not appear in enough transactions.. Using Association Rules to Compare S
Trang 1470643 c09.qxd 3/8/04 11:15 AM Page 312
312 Chapter 9
For instance, in the grocery store that sells orange juice, milk, detergent, soda, and window cleaner, the first step calculates the counts for each of these items During the second step, the following counts are created:
■■ Milk and detergent, milk and soda, milk and cleaner
■■ Detergent and soda, detergent and cleaner
■■ Soda and cleaner This is a total of 10 pairs of items The third pass takes all combinations of three items and so on Of course, each of these stages may require a separate pass through the data or multiple stages can be combined into a single pass by considering different numbers of combinations at the same time
Although it is not obvious when there are just five items, increasing the number of items in the combinations requires exponentially more computation This results in exponentially growing run times—and long, long waits when considering combinations with more than three or four items The solu
tion is pruning Pruning is a technique for reducing the number of items and
combinations of items being considered at each step At each stage, the algorithm throws out a certain number of combinations that do not meet some threshold criterion
The most common pruning threshold is called minimum support pruning
Support refers to the number of transactions in the database where the rule holds Minimum support pruning requires that a rule hold on a minimum number of transactions For instance, if there are one million transactions and the minimum support is 1 percent, then only rules supported by 10,000 transactions are of interest This makes sense, because the purpose of generating these rules is to pursue some sort of action—such as striking a deal with Mattel (the makers of Barbie dolls) to make a candy-bar-eating doll—and the action must affect enough transactions to be worthwhile
The minimum support constraint has a cascading effect Consider a rule with four items in it:
Trang 2Market Basket Analysis and Association Rules 313
In other words, minimum support pruning eliminates items that do not appear in enough transactions The threshold criterion applies to each step in the algorithm The minimum threshold also implies that:
A and B must appear together in at least 10,000 transactions, and,
A and C must appear together in at least 10,000 transactions, and,
A and D must appear together in at least 10,000 transactions, and so on
Each step of the calculation of the co-occurrence table can eliminate combinations of items that do not meet the threshold, reducing its size and the number of combinations to consider during the next pass
Figure 9.11 is an example of how the calculation takes place In this example, choosing a minimum support level of 10 percent would eliminate all the combinations with three items—and their associated rules—from consideration This is an example where pruning does not have an effect on the best rule since the best rule has only two items In the case of pizza, these toppings are all fairly common, so are not pruned individually If anchovies were included in the analysis—and there are only 15 pizzas containing them out of the 2,000— then a minimum support of 10 percent, or even 1 percent, would eliminate anchovies during the first pass
The best choice for minimum support depends on the data and the situation It is also possible to vary the minimum support as the algorithm progresses For instance, using different levels at different stages you can find uncommon combinations of common items (by decreasing the support level for successive steps) or relatively common combinations of uncommon items (by increasing the support level)
The Problem of Big Data
A typical fast food restaurant offers several dozen items on its menu, say 100
To use probabilities to generate association rules, counts have to be calculated for each combination of items The number of combinations of a given size tends to grow exponentially A combination with three items might be a small fries, cheeseburger, and medium Diet Coke On a menu with 100 items, how many combinations are there with three different menu items? There are 161,700! This calculation is based on the binomial formula On the other hand,
a typical supermarket has at least 10,000 different items in stock, and more typically 20,000 or 30,000
Trang 3Figure 9.11 This example shows how to count up the frequencies on pizza sales for
market basket analysis.
Calculating the support, confidence, and lift quickly gets out of hand as thenumber of items in the combinations grows There are almost 50 million pos-sible combinations of two items in the grocery store and over 100 billion com-binations of three items Although computers are getting more powerful and
A pizza restaurant has sold 2000 pizzas, of which:
100 are mushroom only, 150 are pepperoni, 200 are extra cheese
400 are mushroom and pepperoni, 300 are mushroom and extra cheese, 200 are pepperoni and extra cheese
100 are mushroom, pepperoni, and extra cheese.
550 have no extra toppings.
We need to calculate the probabilities for all possible combinations of items.
There are three rules with all three items:
Support = 5%
Confidence = 5% divided by 25% = 0.2 Lift = 20%(100/500) divided by 40%(800/2000) = 0.5
Support = 5%
Confidence = 5% divided by 15% = 0.333 Lift = 33.3%(100/300) divided by 45%(900/2000) = 0.74
Support = 25%
Confidence = 25% divided by 42.5% = 0.588 Lift = 55.6%(500/900) divided by 43.5%(200/850) = 1.31
The best rule has only two items:
Just mushroom
Mushroom and pepperoni
Mushroom and extra cheese
The works
314 Chapter 9
Trang 4expensive The use of product hierarchies reduces the number of items to a manageable size
The number of transactions is also very large In the course of a year, a decent-size chain of supermarkets will generate tens or hundreds of millions
of transactions Each of these transactions consists of one or more items, often several dozen at a time So, determining if a particular combination of items is present in a particular transaction may require a bit of effort—multiplied a million-fold for all the transactions
Extending the Ideas
The basic ideas of association rules can be applied to different areas, such as comparing different stores and making some enhancements to the definition
of the rules These are discussed in this section
Using Association Rules to Compare Stores
Market basket analysis is commonly used to make comparisons between locations within a single chain The rule about toilet bowl cleaner sales in hardware stores is an example where sales at new stores are compared to sales at existing stores Different stores exhibit different selling patterns for many reasons: regional trends, the effectiveness of management, dissimilar advertising, and varying demographic patterns in the catchment area, for example Air conditioners and fans are often purchased during heat waves, but heat waves affect only a limited region Within smaller areas, demographics of the catchment area can have a large impact; we would expect stores in wealthy areas to exhibit different sales patterns from those in poorer neighborhoods These are examples where market basket analysis can help to describe the differences and serve as an example of using market basket analysis for directed data mining
How can association rules be used to make these comparisons? The first
step is augmenting the transactions with virtual items that specify which
group, such as an existing location or a new location, that the transaction comes from Virtual items help describe the transaction, although the virtual item is not a product or service For instance, a sale at an existing hardware store might include the following products:
■■ A hammer
■■ A box of nails
■■ Extra-fine sandpaper
Trang 5316 Chapter 9
T I P Adding virtual transactions in to the market basket data makes it possible
to find rules that include store characteristics and customer characteristics
After augmenting the data to specify where it came from, the transaction looks like:
To compare sales at store openings versus existing stores, the process is:
1 Gather data for a specific period (such as 2 weeks) from store openings Augment each of the transactions in this data with a virtual item saying that the transaction is from a store opening
2 Gather about the same amount of data from existing stores Here you might use a sample across all existing stores, or you might take all the data from stores in comparable locations Augment the transactions in this data with a virtual item saying that the transaction is from an existing store
3 Apply market basket analysis to find association rules in each set
4 Pay particular attention to association rules containing the virtual items Because association rules are undirected data mining, the rules act as starting points for further hypothesis testing Why does one pattern exist at existing stores and another at new stores? The rule about toilet bowl cleaners and store openings, for instance, suggests looking more closely at toilet bowl cleaner sales in existing stores at different times during the year
Using this technique, market basket analysis can be used for many other types of comparisons:
■■ Sales during promotions versus sales at other times
■■ Sales in various geographic areas, by county, standard statistical metro
■■
■■
Adding virtual items to each basket of goods enables the standard association rule techniques to make these comparisons
Trang 6Market Basket Analysis and Association Rules 317
Dissociation Rules
A dissociation rule is similar to an association rule except that it can have the
connector “and not” in the condition in addition to “and.” A typical dissociation rule looks like:
if A and not B, then C
Dissociation rules can be generated by a simple adaptation of the basic market basket analysis algorithm The adaptation is to introduce a new set of items that are the inverses of each of the original items Then, modify each transaction
so it includes an inverse item if, and only if, it does not contain the original item For example, Table 9.8 shows the transformation of a few transactions The ¬ before the item denotes the inverse item
There are three downsides to including these new items First, the total number of items used in the analysis doubles Since the amount of computation grows exponentially with the number of items, doubling the number of items seriously degrades performance Second, the size of a typical transaction grows because it now includes inverted items The third issue is that the frequency of the inverse items tends to be much larger than the frequency of the original items So, minimum support constraints tend to produce rules in which all items are inverted, such as:
if NOT A and NOT B then NOT C
These rules are less likely to be actionable
Sometimes it is useful to invert only the most frequent items in the set used for analysis This is particularly valuable when the frequency of some of the original items is close to 50 percent, so the frequencies of their inverses are also close to 50 percent
Table 9.8 Transformation of Transactions to Generate Dissociation Rules
Trang 7318 Chapter 9
Association rules find things that happen at the same time—what items are purchased at a given time The next natural question concerns sequences of events and what they mean Examples of results in this area are:
■■ New homeowners purchase shower curtains before purchasing furniture
■■ Customers who purchase new lawnmowers are very likely to purchase
a new garden hose in the following 6 weeks
■■ When a customer goes into a bank branch and asks for an account reconciliation, there is a good chance that he or she will close all his or her accounts
Time-series data usually requires some way of identifying the customer over time Anonymous transactions cannot reveal that new homeowners buy shower curtains before they buy furniture This requires tracking each customer, as well as knowing which customers recently purchased a home Since larger purchases are often made with credit cards or debit cards, this is less of
a problem For problems in other domains, such as investigating the effects of medical treatments or customer behavior inside a bank, all transactions typically include identity information
WA R N I N G In order to consider time-series analyses on your customers, there has to be some way of identifying customers Without a way of tracking individual customers, there is no way to analyze their behavior over time
For the purposes of this section, a time series is an ordered sequence of items
It differs from a transaction only in being ordered In general, the time series contains identifying information about the customer, since this information is used to tie the different transactions together into a series Although there are many techniques for analyzing time series, such as ARIMA (a statistical technique) and neural networks, this section discusses only how to manipulate the time-series data to apply the market basket analysis
In order to use time series, the transaction data must have two additional features:
■■ A timestamp or sequencing information to determine when transactions occurred relative to each other
■■ Identifying information, such as account number, household ID, or customer ID that identifies different transactions as belonging to the same customer or household (sometimes called an economic marketing unit)
Trang 8Market Basket Analysis and Association Rules 319
Building sequential rules is similar to the process of building association rules:
1 All items purchased by a customer are treated as a single order, and each item retains the timestamp indicating when it was purchased
3 To develop the rules, only rules where the items on the left-hand side were purchased before items on the right-hand side are considered
The result is a set of association rules that can reveal sequential patterns
Lessons Learned
Market basket data describes what customers purchase Analyzing this data is complex, and no single technique is powerful enough to provide all the answers The data itself typically describes the market basket at three different levels The order is the event of the purchase; the line-items are the items in the purchase, and the customer connects orders together over time
Many important questions about customer behavior can be answered by looking at product sales over time Which are the best selling items? Which items that sold well last year are no longer selling well this year? Inventory curves do not require transaction level data Perhaps the most important insight they provide is the effect of marketing interventions—did sales go up
or down after a particular event?
However, inventory curves are not sufficient for understanding relationships among items in a single basket One technique that is quite powerful is association rules This technique finds products that tend to sell together in groups Sometimes is the groups are sufficient for insight Other times, the groups are turned into explicit rules—when certain items are present then we expect to find certain other items in the basket
There are three measures of association rules Support tells how often the rule is found in the transaction data Confidence says how often when the “if” part is true that the “then” part is also true And, lift tells how much better the rule is at predicting the “then” part as compared to having no rule at all
The rules so generated fall into three categories Useful rules explain a relationship that was perhaps unexpected Trivial rules explain relationships that are known (or should be known) to exist And inexplicable rules simply do not make sense Inexplicable rules often have weak support
Trang 9320 Chapter 9
Market basket analysis and association rules provide ways to analyze level detail, where the relationships between items are determined by the baskets they fall into In the next chapter, we’ll turn to link analysis, which generalizes the ideas of “items” linked by “relationships,” using the background of an area of mathematics called graph theory
Trang 10Which Web sites link to which other ones? Who calls whom on the telephone? Which physicians prescribe which drugs to which patients? These relationships are all visible in data, and they all contain a wealth of informa
tion that most data mining techniques are not able to take direct advantage of
In our ever-more-connected world (where, it has been claimed, there are no more than six degrees of separation between any two people on the planet), understanding relationships and connections is critical Link analysis is the data mining technique that addresses this need
Link analysis is based on a branch of mathematics called graph theory This
chapter reviews the key notions of graphs, then shows how link analysis has been applied to solve real problems Link analysis is not applicable to all types
of data nor can it solve all types of problems However, when it can be used, it
321
Trang 11Basic Graph Theory
Graphs are an abstraction developed specifically to represent relationships They have proven very useful in both mathematics and computer science for developing algorithms that exploit these relationships Fortunately, graphs are quite intuitive, and there is a wealth of examples that illustrate how to take advantage of them
A graph consists of two distinct parts:
■■ Nodes (sometimes called vertices) are the things in the graph that have
relationships These have names and often have additional useful properties
■■ Edges are pairs of nodes connected by a relationship An edge is repre sented by the two nodes that it connects, so (A, B) or AB represents the
edge that connects A and B An edge might also have a weight in a
weighted graph
Figure 10.1 illustrates two graphs The graph on the left has four nodes connected by six edges and has the property that there is an edge between every
pair of nodes Such a graph is said to be fully connected It could be represent
ing daily flights between Atlanta, New York, Cincinnati, and Salt Lake City on
an airline where these four cities serve as regional hubs It could also represent
Team-Fly®
Trang 12tions This is the power of abstraction
A few points of terminology about graphs Because graphs are so useful for visualizing relationships, it is nice when the nodes and edges can be drawn with no intersecting edges The graphs in Figure 10.2 have this property They
are planar graphs, since they can be drawn on a sheet of paper (what mathe
maticians call a plane) without having any edges intersect Figure 10.2 shows
two graphs that cannot be drawn without having at least two edges cross There is, in fact, a theorem in graph theory that says that if a graph is nonpla
nar, then lurking inside it is one of the two previously described graphs
When a path exists between any two nodes in a graph, the graph is said to
be connected For the rest of this chapter, we assume that all graphs are con
nected, unless otherwise specified A path, as its name implies, is an ordered
sequence of nodes connected by edges Consider a graph where each node represents a city, and the edges are flights between pairs of cities On such a graph, a node is a city and an edge is a flight segment—two cities that are con
nected by a nonstop flight A path is an itinerary of flight segments that go from one city to another, such as from Greenville, South Carolina to Atlanta, from Atlanta to Chicago, and from Chicago to Peoria
A fully connected graph with A graph with five nodes
a fully connected graph, there
is an edge between every pair
of nodes
Figure 10.1 Two examples of graphs
Trang 13324 Chapter 10
Three nodes cannot connect
to three other nodes without
intersect
two edges crossing over each other
A fully-connected graph with five nodes must also have edges that intersect
Oops! These edges
Figure 10.2 Not all graphs can be drawn without having some edges cross over each other
Figure 10.3 is an example of a weighted graph, one in which the edges have
weights associated with them In this case, the nodes represent products pur
chased by customers The weights on the edges represent the support for the
association, the percentage of market baskets containing both products Such graphs provide an approach for solving problems in market basket analysis and are also a useful means of visualizing market basket data This product association graph is an example of an undirected graph The graph shows that 22.12 percent of market baskets at this health food grocery contain both yellow peppers and bananas By itself, this does not explain whether yellow pepper sales drive banana sales or vice versa, or whether something else drives the purchase of all yellow fruits and vegetables
One very common problem in link analysis is finding the shortest path between two nodes Which is shortest, though, depends on the weights assigned to the edges Consider the graph of flights between cities Does shortest refer to distance? To the fewest number of flight segments? To the shortest flight time? Or to the least expensive? All these questions are answered the same way using graphs—the only difference is the weights on the edges The following two sections describe two classic problems in graph theory that illustrate the power of graphs to represent and solve problems Few data mining problems are exactly like these two problems, but the problems give a flavor of how the simple construction of graphs leads to some interesting solutions They are presented to familiarize the reader with graphs by providing examples of key concepts in graph theory and to provide a stronger basis for discussing link analysis
Trang 14Red Peppers
Vine Tomatoes
Organic Peaches
Floral 3.68
Figure 10.3 This is an example of a weighted graph where the edge weights are the
number of transactions containing the items represented by the nodes at either end
Seven Bridges of Königsberg
One of the earliest problems in graph theory originated with a simple chal
lenge posed in the eighteenth century by the Swiss mathematician Leonhard Euler As shown in the simple map in Figure 10.4, Königsberg had two islands
in the Pregel River connected to each other and to the rest of the city by a total
of seven bridges On either side of the river or on the islands, it is possible to get to any of the bridges Figure 10.4 shows one path through the town that crosses over five bridges exactly once Euler posed the question: Is it possible
to walk over all seven bridges exactly once, starting from anywhere in the city, without getting wet or using a boat? As an historical note, the problem has sur
vived longer than the name of the city In the eighteenth century, Königsberg was a prominent Prussian city on the Baltic Sea nestled between Lithuania and Poland Now, it is known as Kaliningrad, the westernmost Russian enclave, separated from the rest of Russia by Lithuania and Belarus
In order to solve this problem, Euler invented the notation of graphs He represented the map of Königsberg as the simple graph with four vertices and seven edges in Figure 10.5 Some pairs of nodes are connected by more than one edge, indicating that there is more than one bridge between them Finding a route that traverses all the bridges in Königsberg exactly one time is equivalent to finding a path in the graph that visits every edge exactly once Such a path is called an
Eulerian path in honor of the mathematician who posed and solved this problem
Trang 15AD
BD
B2B
C
1
Figure 10.5 This graph represents the layout of Königsberg The edges are bridges and the
nodes are the riverbanks and islands
Trang 16of those nodes is even
graph By keeping track of the degrees of the nodes, it is possible to construct such a path when there are at most two nodes whose degree is odd
WHY DO THE DEGREES HAVE TO BE EVEN?
even (except at most two) rests on a simple observation This observation is
The edges being used are:
The edges connecting the intermediate nodes in the path come in pairs That
is, there is an outgoing edge for every incoming edge For instance, node C has intermediate node has an even number of edges in the path Since an Eulerian
intermediate nodes for the path This is another way of saying that the degree Euler also showed that the opposite is true When all the nodes in a graph (save at most two) have an even degree, then an Eulerian path exists This proof is a bit more complicated, but the idea is rather simple To construct an Eulerian path, start at any node (even one with an odd degree) and move to any other connected node which has an even degree Remove the edge just traversed from the graph and make it the first edge in the Eulerian path Now, the problem is to find an Eulerian path starting at the second node in the
Euler devised a solution based on the number of edges going into or out of
each node in the graph The number of such edges is called the degree of a
node For instance, in the graph representing the seven bridges of Königsberg, the nodes representing the shores both have a degree of three—corresponding
to the fact that there are three bridges connecting each shore to the islands The other two nodes, representing the islands, have degrees of 5 and 3 Euler showed that an Eulerian path exists only when the degrees of all the nodes in
a graph are even, except at most two (see technical aside) So, there is no way
to walk over the seven bridges of Königsberg without traversing a bridge more than once, since there are four nodes whose degrees are odd
Traveling Salesman Problem
A more recent problem in graph theory is the “Traveling Salesman Problem.”
In this problem, a salesman needs to visit customers in a set of cities He plans
on flying to one of the cities, renting a car, visiting the customer there, then driving to each of other cities to visit each of the rest of his customers He
Trang 17328 Chapter 10
leaves the car in the last city and flies home There are many possible routes that the salesman can take What route minimizes the total distance that he travels while still allowing him to visit each city exactly one time?
The Traveling Salesman Problem is easily reformulated using graphs, since graphs are a natural representation of cities connected by roads In the graph representing this problem, the nodes are cities and each edge has a weight corresponding to the distance between the two cities connected by the edge The Traveling Salesman Problem therefore is asking: “What is the shortest path that visits all the nodes in a graph exactly one time?” Notice that this problem
is different from the Seven Bridges of Königsberg We are not interested in simply finding a path that visits all nodes exactly once, but of all possible paths we want the shortest one Notice that all Eulerian paths have exactly the same length, since they contain exactly the same edges Asking for the shortest Eulerian path does not make sense
Solving the Traveling Salesman Problem for three or four cities is not difficult The most complicated graph with four nodes is a completely connected graph where every node in the graph is connected to every other node In this graph, 24 different paths visit each node exactly once To count the number of paths, start at any of nodes (there are four possibilities), then go to any of the other three remaining ones, then to any of the other two, and finally to the last
node (4 * 3 * 2 * 1 = 4! = 24) A completely connected graph with n nodes has n! (n factorial) distinct paths that contain all nodes Each path has a slightly dif
ferent collection of edges, so their lengths are usually different Since listing the 24 possible paths is not that hard, finding the shortest path is not particularly difficult for this simple case
The problem of finding the shortest path connecting nodes was first investigated by the Irish mathematician Sir William Rowan Hamilton His study of minimizing energy in physical systems led him to investigate minimizing energy in certain discrete systems that he represented as graphs In honor of
him, a path that visits all nodes in a graph exactly once is called a Hamiltonian path
The Traveling Salesman Problem is difficult to solve Any solution must consider all of the possible paths through the graph in order to determine which one is the shortest The number of paths in a completely connected graph grows very fast—as a factorial What is true for completely connected graphs is true for graphs in general: The number of possible paths visiting all the nodes grows like an exponential function of the number of nodes (although there are a few simple graphs where this is not true) So, as the number of cities increases, the effort required to find the shortest path grows exponentially Adding just one more city (with associated roads) can result in a solution that takes twice as long—or more—to find