Description Data Mining Techniques For Marketing_6 pdf

Market Basket Analysis and Association Rules 313 In other words, minimum support pruning eliminates items that do not appear in enough transactions.. Using Association Rules to Compare S

Trang 1

470643 c09.qxd 3/8/04 11:15 AM Page 312

312 Chapter 9

For instance, in the grocery store that sells orange juice, milk, detergent, soda, and window cleaner, the first step calculates the counts for each of these items During the second step, the following counts are created:

■■ Milk and detergent, milk and soda, milk and cleaner

■■ Detergent and soda, detergent and cleaner

■■ Soda and cleaner This is a total of 10 pairs of items The third pass takes all combinations of three items and so on Of course, each of these stages may require a separate pass through the data or multiple stages can be combined into a single pass by considering different numbers of combinations at the same time

Although it is not obvious when there are just five items, increasing the number of items in the combinations requires exponentially more computation This results in exponentially growing run times—and long, long waits when considering combinations with more than three or four items The solu

tion is pruning Pruning is a technique for reducing the number of items and

combinations of items being considered at each step At each stage, the algorithm throws out a certain number of combinations that do not meet some threshold criterion

The most common pruning threshold is called minimum support pruning

Support refers to the number of transactions in the database where the rule holds Minimum support pruning requires that a rule hold on a minimum number of transactions For instance, if there are one million transactions and the minimum support is 1 percent, then only rules supported by 10,000 transactions are of interest This makes sense, because the purpose of generating these rules is to pursue some sort of action—such as striking a deal with Mattel (the makers of Barbie dolls) to make a candy-bar-eating doll—and the action must affect enough transactions to be worthwhile

The minimum support constraint has a cascading effect Consider a rule with four items in it:

Trang 2

Market Basket Analysis and Association Rules 313

In other words, minimum support pruning eliminates items that do not appear in enough transactions The threshold criterion applies to each step in the algorithm The minimum threshold also implies that:

A and B must appear together in at least 10,000 transactions, and,

A and C must appear together in at least 10,000 transactions, and,

A and D must appear together in at least 10,000 transactions, and so on

Each step of the calculation of the co-occurrence table can eliminate combinations of items that do not meet the threshold, reducing its size and the number of combinations to consider during the next pass

Figure 9.11 is an example of how the calculation takes place In this example, choosing a minimum support level of 10 percent would eliminate all the combinations with three items—and their associated rules—from consideration This is an example where pruning does not have an effect on the best rule since the best rule has only two items In the case of pizza, these toppings are all fairly common, so are not pruned individually If anchovies were included in the analysis—and there are only 15 pizzas containing them out of the 2,000— then a minimum support of 10 percent, or even 1 percent, would eliminate anchovies during the first pass

The best choice for minimum support depends on the data and the situation It is also possible to vary the minimum support as the algorithm progresses For instance, using different levels at different stages you can find uncommon combinations of common items (by decreasing the support level for successive steps) or relatively common combinations of uncommon items (by increasing the support level)

The Problem of Big Data

A typical fast food restaurant offers several dozen items on its menu, say 100

To use probabilities to generate association rules, counts have to be calculated for each combination of items The number of combinations of a given size tends to grow exponentially A combination with three items might be a small fries, cheeseburger, and medium Diet Coke On a menu with 100 items, how many combinations are there with three different menu items? There are 161,700! This calculation is based on the binomial formula On the other hand,

a typical supermarket has at least 10,000 different items in stock, and more typically 20,000 or 30,000

Trang 3

Figure 9.11 This example shows how to count up the frequencies on pizza sales for

market basket analysis.

Calculating the support, confidence, and lift quickly gets out of hand as thenumber of items in the combinations grows There are almost 50 million pos-sible combinations of two items in the grocery store and over 100 billion com-binations of three items Although computers are getting more powerful and

A pizza restaurant has sold 2000 pizzas, of which:

100 are mushroom only, 150 are pepperoni, 200 are extra cheese

400 are mushroom and pepperoni, 300 are mushroom and extra cheese, 200 are pepperoni and extra cheese

100 are mushroom, pepperoni, and extra cheese.

550 have no extra toppings.

We need to calculate the probabilities for all possible combinations of items.

There are three rules with all three items:

Support = 5%

Confidence = 5% divided by 25% = 0.2 Lift = 20%(100/500) divided by 40%(800/2000) = 0.5

Support = 5%

Confidence = 5% divided by 15% = 0.333 Lift = 33.3%(100/300) divided by 45%(900/2000) = 0.74

Support = 25%

Confidence = 25% divided by 42.5% = 0.588 Lift = 55.6%(500/900) divided by 43.5%(200/850) = 1.31

The best rule has only two items:

Just mushroom

Mushroom and pepperoni

Mushroom and extra cheese

The works

314 Chapter 9

Trang 4

expensive The use of product hierarchies reduces the number of items to a manageable size

The number of transactions is also very large In the course of a year, a decent-size chain of supermarkets will generate tens or hundreds of millions

of transactions Each of these transactions consists of one or more items, often several dozen at a time So, determining if a particular combination of items is present in a particular transaction may require a bit of effort—multiplied a million-fold for all the transactions

Extending the Ideas

The basic ideas of association rules can be applied to different areas, such as comparing different stores and making some enhancements to the definition

of the rules These are discussed in this section

Using Association Rules to Compare Stores

Market basket analysis is commonly used to make comparisons between locations within a single chain The rule about toilet bowl cleaner sales in hardware stores is an example where sales at new stores are compared to sales at existing stores Different stores exhibit different selling patterns for many reasons: regional trends, the effectiveness of management, dissimilar advertising, and varying demographic patterns in the catchment area, for example Air conditioners and fans are often purchased during heat waves, but heat waves affect only a limited region Within smaller areas, demographics of the catchment area can have a large impact; we would expect stores in wealthy areas to exhibit different sales patterns from those in poorer neighborhoods These are examples where market basket analysis can help to describe the differences and serve as an example of using market basket analysis for directed data mining

How can association rules be used to make these comparisons? The first

step is augmenting the transactions with virtual items that specify which

group, such as an existing location or a new location, that the transaction comes from Virtual items help describe the transaction, although the virtual item is not a product or service For instance, a sale at an existing hardware store might include the following products:

■■ A hammer

■■ A box of nails

■■ Extra-fine sandpaper

Trang 5

316 Chapter 9

T I P Adding virtual transactions in to the market basket data makes it possible

to find rules that include store characteristics and customer characteristics

After augmenting the data to specify where it came from, the transaction looks like:

To compare sales at store openings versus existing stores, the process is:

1 Gather data for a specific period (such as 2 weeks) from store openings Augment each of the transactions in this data with a virtual item saying that the transaction is from a store opening

2 Gather about the same amount of data from existing stores Here you might use a sample across all existing stores, or you might take all the data from stores in comparable locations Augment the transactions in this data with a virtual item saying that the transaction is from an existing store

3 Apply market basket analysis to find association rules in each set

4 Pay particular attention to association rules containing the virtual items Because association rules are undirected data mining, the rules act as starting points for further hypothesis testing Why does one pattern exist at existing stores and another at new stores? The rule about toilet bowl cleaners and store openings, for instance, suggests looking more closely at toilet bowl cleaner sales in existing stores at different times during the year

Using this technique, market basket analysis can be used for many other types of comparisons:

■■ Sales during promotions versus sales at other times

■■ Sales in various geographic areas, by county, standard statistical metro

■■

Adding virtual items to each basket of goods enables the standard association rule techniques to make these comparisons

Trang 6

Dissociation Rules

A dissociation rule is similar to an association rule except that it can have the

connector “and not” in the condition in addition to “and.” A typical dissociation rule looks like:

if A and not B, then C

Dissociation rules can be generated by a simple adaptation of the basic market basket analysis algorithm The adaptation is to introduce a new set of items that are the inverses of each of the original items Then, modify each transaction

so it includes an inverse item if, and only if, it does not contain the original item For example, Table 9.8 shows the transformation of a few transactions The ¬ before the item denotes the inverse item

There are three downsides to including these new items First, the total number of items used in the analysis doubles Since the amount of computation grows exponentially with the number of items, doubling the number of items seriously degrades performance Second, the size of a typical transaction grows because it now includes inverted items The third issue is that the frequency of the inverse items tends to be much larger than the frequency of the original items So, minimum support constraints tend to produce rules in which all items are inverted, such as:

if NOT A and NOT B then NOT C

These rules are less likely to be actionable

Sometimes it is useful to invert only the most frequent items in the set used for analysis This is particularly valuable when the frequency of some of the original items is close to 50 percent, so the frequencies of their inverses are also close to 50 percent

Table 9.8 Transformation of Transactions to Generate Dissociation Rules

Trang 7

318 Chapter 9

Association rules find things that happen at the same time—what items are purchased at a given time The next natural question concerns sequences of events and what they mean Examples of results in this area are:

■■ New homeowners purchase shower curtains before purchasing furniture

■■ Customers who purchase new lawnmowers are very likely to purchase

a new garden hose in the following 6 weeks

■■ When a customer goes into a bank branch and asks for an account reconciliation, there is a good chance that he or she will close all his or her accounts

Time-series data usually requires some way of identifying the customer over time Anonymous transactions cannot reveal that new homeowners buy shower curtains before they buy furniture This requires tracking each customer, as well as knowing which customers recently purchased a home Since larger purchases are often made with credit cards or debit cards, this is less of

a problem For problems in other domains, such as investigating the effects of medical treatments or customer behavior inside a bank, all transactions typically include identity information

WA R N I N G In order to consider time-series analyses on your customers, there has to be some way of identifying customers Without a way of tracking individual customers, there is no way to analyze their behavior over time

For the purposes of this section, a time series is an ordered sequence of items

It differs from a transaction only in being ordered In general, the time series contains identifying information about the customer, since this information is used to tie the different transactions together into a series Although there are many techniques for analyzing time series, such as ARIMA (a statistical technique) and neural networks, this section discusses only how to manipulate the time-series data to apply the market basket analysis

In order to use time series, the transaction data must have two additional features:

■■ A timestamp or sequencing information to determine when transactions occurred relative to each other

■■ Identifying information, such as account number, household ID, or customer ID that identifies different transactions as belonging to the same customer or household (sometimes called an economic marketing unit)

Trang 8

Building sequential rules is similar to the process of building association rules:

1 All items purchased by a customer are treated as a single order, and each item retains the timestamp indicating when it was purchased

3 To develop the rules, only rules where the items on the left-hand side were purchased before items on the right-hand side are considered

The result is a set of association rules that can reveal sequential patterns

Lessons Learned

Market basket data describes what customers purchase Analyzing this data is complex, and no single technique is powerful enough to provide all the answers The data itself typically describes the market basket at three different levels The order is the event of the purchase; the line-items are the items in the purchase, and the customer connects orders together over time

Many important questions about customer behavior can be answered by looking at product sales over time Which are the best selling items? Which items that sold well last year are no longer selling well this year? Inventory curves do not require transaction level data Perhaps the most important insight they provide is the effect of marketing interventions—did sales go up

or down after a particular event?

However, inventory curves are not sufficient for understanding relationships among items in a single basket One technique that is quite powerful is association rules This technique finds products that tend to sell together in groups Sometimes is the groups are sufficient for insight Other times, the groups are turned into explicit rules—when certain items are present then we expect to find certain other items in the basket

There are three measures of association rules Support tells how often the rule is found in the transaction data Confidence says how often when the “if” part is true that the “then” part is also true And, lift tells how much better the rule is at predicting the “then” part as compared to having no rule at all

The rules so generated fall into three categories Useful rules explain a relationship that was perhaps unexpected Trivial rules explain relationships that are known (or should be known) to exist And inexplicable rules simply do not make sense Inexplicable rules often have weak support

Trang 9

320 Chapter 9

Market basket analysis and association rules provide ways to analyze level detail, where the relationships between items are determined by the baskets they fall into In the next chapter, we’ll turn to link analysis, which generalizes the ideas of “items” linked by “relationships,” using the background of an area of mathematics called graph theory

Trang 10

Which Web sites link to which other ones? Who calls whom on the telephone? Which physicians prescribe which drugs to which patients? These relationships are all visible in data, and they all contain a wealth of informa

tion that most data mining techniques are not able to take direct advantage of

In our ever-more-connected world (where, it has been claimed, there are no more than six degrees of separation between any two people on the planet), understanding relationships and connections is critical Link analysis is the data mining technique that addresses this need

Link analysis is based on a branch of mathematics called graph theory This

chapter reviews the key notions of graphs, then shows how link analysis has been applied to solve real problems Link analysis is not applicable to all types

of data nor can it solve all types of problems However, when it can be used, it

321

Trang 11

Basic Graph Theory

Graphs are an abstraction developed specifically to represent relationships They have proven very useful in both mathematics and computer science for developing algorithms that exploit these relationships Fortunately, graphs are quite intuitive, and there is a wealth of examples that illustrate how to take advantage of them

A graph consists of two distinct parts:

■■ Nodes (sometimes called vertices) are the things in the graph that have

relationships These have names and often have additional useful properties

■■ Edges are pairs of nodes connected by a relationship An edge is repre sented by the two nodes that it connects, so (A, B) or AB represents the

edge that connects A and B An edge might also have a weight in a

weighted graph

Figure 10.1 illustrates two graphs The graph on the left has four nodes connected by six edges and has the property that there is an edge between every

pair of nodes Such a graph is said to be fully connected It could be represent

ing daily flights between Atlanta, New York, Cincinnati, and Salt Lake City on

an airline where these four cities serve as regional hubs It could also represent

Team-Fly®

Trang 12

tions This is the power of abstraction

A few points of terminology about graphs Because graphs are so useful for visualizing relationships, it is nice when the nodes and edges can be drawn with no intersecting edges The graphs in Figure 10.2 have this property They

are planar graphs, since they can be drawn on a sheet of paper (what mathe

maticians call a plane) without having any edges intersect Figure 10.2 shows

two graphs that cannot be drawn without having at least two edges cross There is, in fact, a theorem in graph theory that says that if a graph is nonpla

nar, then lurking inside it is one of the two previously described graphs

When a path exists between any two nodes in a graph, the graph is said to

be connected For the rest of this chapter, we assume that all graphs are con

nected, unless otherwise specified A path, as its name implies, is an ordered

sequence of nodes connected by edges Consider a graph where each node represents a city, and the edges are flights between pairs of cities On such a graph, a node is a city and an edge is a flight segment—two cities that are con

nected by a nonstop flight A path is an itinerary of flight segments that go from one city to another, such as from Greenville, South Carolina to Atlanta, from Atlanta to Chicago, and from Chicago to Peoria

A fully connected graph with A graph with five nodes

a fully connected graph, there

is an edge between every pair

of nodes

Figure 10.1 Two examples of graphs

Trang 13

324 Chapter 10

Three nodes cannot connect

to three other nodes without

intersect

two edges crossing over each other

A fully-connected graph with five nodes must also have edges that intersect

Oops! These edges

Figure 10.2 Not all graphs can be drawn without having some edges cross over each other

Figure 10.3 is an example of a weighted graph, one in which the edges have

weights associated with them In this case, the nodes represent products pur

chased by customers The weights on the edges represent the support for the

association, the percentage of market baskets containing both products Such graphs provide an approach for solving problems in market basket analysis and are also a useful means of visualizing market basket data This product association graph is an example of an undirected graph The graph shows that 22.12 percent of market baskets at this health food grocery contain both yellow peppers and bananas By itself, this does not explain whether yellow pepper sales drive banana sales or vice versa, or whether something else drives the purchase of all yellow fruits and vegetables

One very common problem in link analysis is finding the shortest path between two nodes Which is shortest, though, depends on the weights assigned to the edges Consider the graph of flights between cities Does shortest refer to distance? To the fewest number of flight segments? To the shortest flight time? Or to the least expensive? All these questions are answered the same way using graphs—the only difference is the weights on the edges The following two sections describe two classic problems in graph theory that illustrate the power of graphs to represent and solve problems Few data mining problems are exactly like these two problems, but the problems give a flavor of how the simple construction of graphs leads to some interesting solutions They are presented to familiarize the reader with graphs by providing examples of key concepts in graph theory and to provide a stronger basis for discussing link analysis

Trang 14

Red Peppers

Vine Tomatoes

Organic Peaches

Floral 3.68

Figure 10.3 This is an example of a weighted graph where the edge weights are the

number of transactions containing the items represented by the nodes at either end

Seven Bridges of Königsberg

One of the earliest problems in graph theory originated with a simple chal

lenge posed in the eighteenth century by the Swiss mathematician Leonhard Euler As shown in the simple map in Figure 10.4, Königsberg had two islands

in the Pregel River connected to each other and to the rest of the city by a total

of seven bridges On either side of the river or on the islands, it is possible to get to any of the bridges Figure 10.4 shows one path through the town that crosses over five bridges exactly once Euler posed the question: Is it possible

to walk over all seven bridges exactly once, starting from anywhere in the city, without getting wet or using a boat? As an historical note, the problem has sur

vived longer than the name of the city In the eighteenth century, Königsberg was a prominent Prussian city on the Baltic Sea nestled between Lithuania and Poland Now, it is known as Kaliningrad, the westernmost Russian enclave, separated from the rest of Russia by Lithuania and Belarus

In order to solve this problem, Euler invented the notation of graphs He represented the map of Königsberg as the simple graph with four vertices and seven edges in Figure 10.5 Some pairs of nodes are connected by more than one edge, indicating that there is more than one bridge between them Finding a route that traverses all the bridges in Königsberg exactly one time is equivalent to finding a path in the graph that visits every edge exactly once Such a path is called an

Eulerian path in honor of the mathematician who posed and solved this problem

Trang 15

AD

BD

B2B

C

1

Figure 10.5 This graph represents the layout of Königsberg The edges are bridges and the

nodes are the riverbanks and islands

Trang 16

of those nodes is even

graph By keeping track of the degrees of the nodes, it is possible to construct such a path when there are at most two nodes whose degree is odd

WHY DO THE DEGREES HAVE TO BE EVEN?

even (except at most two) rests on a simple observation This observation is

The edges being used are:

The edges connecting the intermediate nodes in the path come in pairs That

is, there is an outgoing edge for every incoming edge For instance, node C has intermediate node has an even number of edges in the path Since an Eulerian

intermediate nodes for the path This is another way of saying that the degree Euler also showed that the opposite is true When all the nodes in a graph (save at most two) have an even degree, then an Eulerian path exists This proof is a bit more complicated, but the idea is rather simple To construct an Eulerian path, start at any node (even one with an odd degree) and move to any other connected node which has an even degree Remove the edge just traversed from the graph and make it the first edge in the Eulerian path Now, the problem is to find an Eulerian path starting at the second node in the

Euler devised a solution based on the number of edges going into or out of

each node in the graph The number of such edges is called the degree of a

node For instance, in the graph representing the seven bridges of Königsberg, the nodes representing the shores both have a degree of three—corresponding

to the fact that there are three bridges connecting each shore to the islands The other two nodes, representing the islands, have degrees of 5 and 3 Euler showed that an Eulerian path exists only when the degrees of all the nodes in

a graph are even, except at most two (see technical aside) So, there is no way

to walk over the seven bridges of Königsberg without traversing a bridge more than once, since there are four nodes whose degrees are odd

Traveling Salesman Problem

A more recent problem in graph theory is the “Traveling Salesman Problem.”

In this problem, a salesman needs to visit customers in a set of cities He plans

on flying to one of the cities, renting a car, visiting the customer there, then driving to each of other cities to visit each of the rest of his customers He

Trang 17

328 Chapter 10

leaves the car in the last city and flies home There are many possible routes that the salesman can take What route minimizes the total distance that he travels while still allowing him to visit each city exactly one time?

The Traveling Salesman Problem is easily reformulated using graphs, since graphs are a natural representation of cities connected by roads In the graph representing this problem, the nodes are cities and each edge has a weight corresponding to the distance between the two cities connected by the edge The Traveling Salesman Problem therefore is asking: “What is the shortest path that visits all the nodes in a graph exactly one time?” Notice that this problem

is different from the Seven Bridges of Königsberg We are not interested in simply finding a path that visits all nodes exactly once, but of all possible paths we want the shortest one Notice that all Eulerian paths have exactly the same length, since they contain exactly the same edges Asking for the shortest Eulerian path does not make sense

Solving the Traveling Salesman Problem for three or four cities is not difficult The most complicated graph with four nodes is a completely connected graph where every node in the graph is connected to every other node In this graph, 24 different paths visit each node exactly once To count the number of paths, start at any of nodes (there are four possibilities), then go to any of the other three remaining ones, then to any of the other two, and finally to the last

node (4 * 3 * 2 * 1 = 4! = 24) A completely connected graph with n nodes has n! (n factorial) distinct paths that contain all nodes Each path has a slightly dif

ferent collection of edges, so their lengths are usually different Since listing the 24 possible paths is not that hard, finding the shortest path is not particularly difficult for this simple case

The problem of finding the shortest path connecting nodes was first investigated by the Irish mathematician Sir William Rowan Hamilton His study of minimizing energy in physical systems led him to investigate minimizing energy in certain discrete systems that he represented as graphs In honor of

him, a path that visits all nodes in a graph exactly once is called a Hamiltonian path

The Traveling Salesman Problem is difficult to solve Any solution must consider all of the possible paths through the graph in order to determine which one is the shortest The number of paths in a completely connected graph grows very fast—as a factorial What is true for completely connected graphs is true for graphs in general: The number of possible paths visiting all the nodes grows like an exponential function of the number of nodes (although there are a few simple graphs where this is not true) So, as the number of cities increases, the effort required to find the shortest path grows exponentially Adding just one more city (with associated roads) can result in a solution that takes twice as long—or more—to find

Tiêu đề	Description Data Mining Techniques For Marketing
Trường học	Standard University
Chuyên ngành	Data Mining
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	New York

Định dạng
Số trang	34
Dung lượng	1,2 MB