John wiley sons data mining techniques for marketing sales_12 doc

Automatic Cluster Detection The data mining techniques described in this book are used to find meaning ful patterns in data.. The geometric ideas used in K-Means bring up the more gene

Trang 1

Second, link analysis can apply the concepts generated by visualization to larger sets of customers For instance, a churn reduction program might avoid targeting customers who have high inertia or be sure to target customers with high influence This requires traversing the call graph to calculate the inertia or influence for all customers Such derived characteristics can play an important role in marketing efforts

Different marketing programs might suggest looking for other features in the call graph For instance, perhaps the ability to place a conference call would be desirable, but who would be the best prospects? One idea would be

to look for groups of customers that all call each other Stated as a graph problem, this group is a fully connected subgraph In the telephone industry, these subgraphs are called “communities of interest.” A community of interest may represent a group of customers who would be interested in the ability to place conference calls

Lessons Learned

Link analysis is an application of the mathematical field of graph theory As a data mining technique, link analysis has several strengths:

■■ It capitalizes on relationships

■■ It is useful for visualization

■■ It creates derived characteristics that can be used for further mining Some data and data mining problems naturally involve links As the two case studies about telephone data show, link analysis is very useful for telecommunications—a telephone call is a link between two people Opportunities for link analysis are most obvious in fields where the links are obvious such as telephony, transportation, and the World Wide Web Link analysis is also appropriate in other areas where the connections do not have such a clear manifestation, such as physician referral patterns, retail sales data, and forensic analysis for crimes

Links are a very natural way to visualize some types of data Direct visualization of the links can be a big aid to knowledge discovery Even when automated patterns are found, visualization of the links helps to better understand what is happening Link analysis offers an alternative way of looking at data, different from the formats of relational databases and OLAP tools Links may suggest important patterns in the data, but the significance of the patterns requires a person for interpretation

Link analysis can lead to new and useful data attributes Examples include calculating an authority score for a page on the World Wide Web and calculating the sphere of influence for a telephone user

Trang 2

Although link analysis is very powerful when applicable, it is not appropriate for all types of problems It is not a prediction tool or classification tool like

a neural network that takes data in and produces an answer Many types of data are simply not appropriate for link analysis Its strongest use is probably

in finding specific patterns, such as the types of outgoing calls, which can then

be applied to data These patterns can be turned into new features of the data, for use in conjunction with other directed data mining techniques

Trang 4

Automatic Cluster Detection

The data mining techniques described in this book are used to find meaning

ful patterns in data These patterns are not always immediately forthcoming Sometimes this is because there are no patterns to be found Other times, the problem is not the lack of patterns, but the excess The data may contain so much complex structure that even the best data mining techniques are unable

to coax out meaningful patterns When mining such a database for the answer

to some specific question, competing explanations tend to cancel each other out As with radio reception, too many competing signals add up to noise Clustering provides a way to learn about the structure of complex data, to break up the cacophony of competing signals into its components

When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply If someone were asked to describe the color of trees in the forest, the answer would probably make distinctions between deciduous trees and evergreens, and between winter, spring, summer, and fall People know enough about woodland flora to predict that, of all the hundreds of vari

ables associated with the forest, season and foliage type, rather than say age and height, are the best factors to use for forming clusters of trees that follow similar coloration rules

Once the proper clusters have been defined, it is often possible to find simple patterns within each cluster “In Winter, deciduous trees have no leaves so the trees tend to be brown” or “The leaves of deciduous trees change color in the

349

Trang 5

autumn, typically to oranges, reds, and yellows.” In many cases, a very noisy dataset is actually composed of a number of better-behaved clusters The question is: how can these be found? That is where techniques for automatic cluster detection come in—to help see the forest without getting lost in the trees This chapter begins with two examples of the usefulness of clustering—one drawn from astronomy, another from clothing design It then introduces the K-Means clustering algorithm which, like the nearest neighbor techniques discussed in Chapter 8, depends on a geometric interpretation of data The geometric ideas used in K-Means bring up the more general topic of measures of similarity, association, and distance These distance measures are quite sensitive to variations in how data is represented, so the next topic addressed is data preparation for clustering, with special attention being paid to scaling and weighting K-Means is not the only algorithm in common use for automatic cluster detection This chapter contains brief discussions of several others: Gaussian mixture models, agglomerative clustering, and divisive clustering (Another clustering technique, self-organizing maps, is covered in Chapter 7 because self-organizing maps are a form of neural network.) The chapter concludes with a case study in which automatic cluster detection is used to evaluate editorial zones for a major daily newspaper

Searching for Islands of Simplicity

In Chapter 1, where data mining techniques are classified as directed or undirected, automatic cluster detection is described as a tool for undirected knowledge discovery In the technical sense, that is true because the automatic cluster detection algorithms themselves are simply finding structure that exists in the data without regard to any particular target variable Most data mining tasks start out with a preclassified training set, which is used to develop a model capable of scoring or classifying previously unseen records

In clustering, there is no preclassified data and no distinction between independent and dependent variables Instead, clustering algorithms search for groups of records—the clusters—composed of records similar to each other The algorithms discover these similarities It is up to the people running the analysis to determine whether similar records represent something of interest

to the business—or something inexplicable and perhaps unimportant

In a broader sense, however, clustering can be a directed activity because clusters are sought for some business purpose In marketing, clusters formed for a business purpose are usually called “segments,” and customer segmentation is a popular application of clustering

Automatic cluster detection is a data mining technique that is rarely used in isolation because finding clusters is not often an end in itself Once clusters have been detected, other methods must be applied in order to figure out what

Trang 6

the clusters mean When clustering is successful, the results can be dramatic: One famous early application of cluster detection led to our current under

standing of stellar evolution

Star Light, Star Bright

Early in the twentieth century, astronomers trying to understand the relation

ship between the luminosity (brightness) of stars and their temperatures, made scatter plots like the one in Figure 11.1 The vertical scale measures lumi

nosity in multiples of the brightness of our own sun The horizontal scale measures surface temperature in degrees Kelvin (degrees centigrade above absolute 0, the theoretical coldest possible temperature)

Temperature (Degrees Kelvin)

Figure 11.1 The Hertzsprung-Russell diagram clusters stars by temperature and luminosity

Trang 7

352 Chapter 11

Two different astronomers, Enjar Hertzsprung in Denmark and Norris Russell in the United States, thought of doing this at about the same time They both observed that in the resulting scatter plot, the stars fall into three clusters This observation led to further work and the understanding that these three clusters represent stars in very different phases of the stellar life cycle The relationship between luminosity and temperature is consistent within each cluster, but the relationship is different between the clusters because fundamentally different processes are generating the heat and light The 80 percent of stars that fall on the main sequence are generating energy by converting hydrogen to helium through nuclear fusion This is how all stars spend most of their active life After some number of billions of years, the hydrogen is used up Depending on its mass, the star then begins fusing helium or the fusion stops In the latter case, the core of the star collapses, generating a great deal of heat in the process At the same time, the outer layer of gasses expands away from the core, and a red giant is formed Eventually, the outer layer of gasses is stripped away, and the remaining core begins to cool The star is now a white dwarf

A recent search on Google using the phrase “Hertzsprung-Russell Diagram” returned thousands of pages of links to current astronomical research based on cluster detection of this kind Even today, clusters based on the HR diagram are being used to hunt for brown dwarfs (starlike objects that lack sufficient mass to initiate nuclear fusion) and to understand pre–main sequence stellar evolution

Fitting the Troops

The Hertzsprung-Russell diagram is a good introductory example of clustering because with only two variables, it is easy to spot the clusters visually (and, incidentally, it is a good example of the importance of good data visualizations) Even in three dimensions, picking out clusters by eye from a scatter plot cube is not too difficult If all problems had so few dimensions, there would be no need for automatic cluster detection algorithms As the number

of dimensions (independent variables) increases, it becomes increasing difficult to visualize clusters Our intuition about how close things are to each other also quickly breaks down with more dimensions

Saying that a problem has many dimensions is an invitation to analyze it

geometrically A dimension is each of the things that must be measured inde pendently in order to describe something In other words, if there are N vari

ables, imagine a space in which the value of each variable represents a distance

along the corresponding axis in an N-dimensional space A single record con taining a value for each of the N variables can be thought of as the vector that

defines a particular point in that space When there are two dimensions, this is easily plotted The HR diagram was one such example Figure 11.2 is another example that plots the height and weight of a group of teenagers as points on

a graph Notice the clustering of boys and girls

TE AM

FL Y

Trang 8

The chart in Figure 11.2 begins to give a rough idea of people’s shapes But

if the goal is to fit them for clothes, a few more measurements are needed!

In the 1990s, the U.S army commissioned a study on how to redesign the uniforms of female soldiers The army’s goal was to reduce the number of dif-

Figure 11.2 Heights and weights of a group of teenagers

1 Ashdown, Susan P 1998 “An Investigation of the Structure of Sizing Systems: A Comparison of Three Multidimensional Optimized Sizing Systems Generated from Anthropometric Data,” International Journal of Clothing Science and Technology Vol 10, #5, pp 324-341

Trang 9

Unlike the traditional clothing size systems, the one Ashdown and Paal came

up with is not an ordered set of graduated sizes where all dimensions increase together Instead, they came up with sizes that fit particular body types Each body type corresponds to a cluster of records in a database of body measurements One cluster might consist of short-legged, small-waisted, large-busted women with long torsos, average arms, broad shoulders, and skinny necks while other clusters capture other constellations of measurements

The database contained more than 100 measurements for each of nearly 3,000 women The clustering technique employed was the K-means algorithm, described in the next section In the end, only a handful of the more than 100 measurements were needed to characterize the clusters Finding this smaller number of variables was another benefit of the clustering process

K-Means Clustering

The K-means algorithm is one of the most commonly used clustering algorithms The “K” in its name refers to the fact that the algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other The version described here was first published by J B MacQueen in

1967 For ease of explaining, the technique is illustrated using two-dimensional diagrams Bear in mind that in practice the algorithm is usually handling many more than two independent variables This means that instead of points corre

sponding to two-element vectors (x1,x2), the points correspond to n-element vectors (x1,x2, , x n) The procedure itself is unchanged

Three Steps of the K-Means Algorithm

In the first step, the algorithm randomly selects K data points to be the seeds MacQueen’s algorithm simply takes the first K records In cases where the records have some meaningful order, it may be desirable to choose widely spaced records, or a random selection of records Each of the seeds is an embryonic cluster with only one element This example sets the number of clusters to 3

The second step assigns each record to the closest seed One way to do this

is by finding the boundaries between the clusters, as shown geometrically

in Figure 11.3 The boundaries between two clusters are the points that are equally close to each cluster Recalling a lesson from high-school geometry makes this less difficult than it sounds: given any two points, A and B, all points that are equidistant from A and B fall along a line (called the perpendicular bisector) that is perpendicular to the one connecting A and B and halfway between them In Figure 11.3, dashed lines connect the initial seeds; the resulting cluster boundaries shown with solid lines are at right angles to

Trang 10

the dashed lines Using these lines as guides, it is obvious which records are closest to which seeds In three dimensions, these boundaries would be planes

and in N dimensions they would be hyperplanes of dimension N – 1 Fortu

nately, computer algorithms easily handle these situations Finding the actual boundaries between clusters is useful for showing the process geometrically

In practice, though, the algorithm usually measures the distance of each record

to each seed and chooses the minimum distance for this step

For example, consider the record with the box drawn around it On the basis

of the initial seeds, this record is assigned to the cluster controlled by seed number 2 because it is closer to that seed than to either of the other two

At this point, every point has been assigned to exactly one of the three clusters centered around the original seeds The third step is to calculate the cen

troids of the clusters; these now do a better job of characterizing the clusters than the initial seeds Finding the centroids is simply a matter of taking the average value of each dimension for all the records in the cluster

In Figure 11.4, the new centroids are marked with a cross The arrows show the motion from the position of the original seeds to the new centroids of the clusters formed from those seeds

X 2

X1

Seed 3

Seed 1 Seed 2

Figure 11.3 The initial seeds determine the initial cluster boundaries

Trang 11

X 2

X1

Figure 11.4 The centroids are calculated from the points that are assigned to each cluster

The centroids become the seeds for the next iteration of the algorithm Step 2

is repeated, and each point is once again assigned to the cluster with the closest centroid Figure 11.5 shows the new cluster boundaries—formed, as before, by drawing lines equidistant between each pair of centroids Notice that the point with the box around it, which was originally assigned to cluster number 2, has now been assigned to cluster number 1 The process of assigning points to cluster and then recalculating centroids continues until the cluster boundaries stop changing In practice, the K-means algorithm usually finds a set of stable clusters after a few dozen iterations

What K Means

Clusters describe underlying structure in data However, there is no one right description of that structure For instance, someone not from New York City may think that the whole city is “downtown.” Someone from Brooklyn or Queens might apply this nomenclature to Manhattan Within Manhattan, it might only be neighborhoods south of 23rd Street And even there, “downtown” might still be reserved only for the taller buildings at the southern tip of the island There is a similar problem with clustering; structures in data exist

at many different levels

Trang 12

X 2

X1

Figure 11.5 At each iteration, all cluster assignments are reevaluated

Descriptions of K-means and related algorithms gloss over the selection of

K But since, in many cases, there is no a priori reason to select a particular value, there is really an outermost loop to these algorithms that occurs during analysis rather than in the computer program This outer loop consists of per

forming automatic cluster detection using one value of K, evaluating the results, then trying again with another value of K or perhaps modifying the data After each trial, the strength of the resulting clusters can be evaluated by comparing the average distance between records in a cluster with the average distance between clusters, and by other procedures described later in this chapter These tests can be automated, but the clusters must also be evaluated

on a more subjective basis to determine their usefulness for a given applica

tion As shown in Figure 11.6, different values of K may lead to very different clusterings that are equally valid The figure shows clusterings of a deck of playing cards for K = 2 and K = 4 Is one better than the other? It depends on the use to which the clusters will be put

Trang 13

Figure 11.6 These examples of clusters of size 2 and 4 in a deck of playing cards illustrate

that there is no one correct clustering.

Often the first time K-means clustering is run on a given set of data, most

of the data points fall in one giant central cluster and there are a number ofsmaller clusters outside it This is often because most records describe “nor-mal” variations in the data, but there are enough outliers to confuse the clus-tering algorithm This type of clustering may be valuable for applications such

as identifying fraud or manufacturing defects In other applications, it may bedesirable to filter outliers from the data; more often, the solution is to massagethe data values Later in this chapter there is a section on data preparation forclustering which describes how to work with variables to make it easier to findmeaningful clusters

Similarity and Distance

Once records in a database have been mapped to points in space, automaticcluster detection is really quite simple—a little geometry, some vector means,

et voilà! The problem, of course, is that the databases encountered in

market-ing, sales, and customer support are not about points in space They are aboutpurchases, phone calls, airplane trips, car registrations, and a thousand otherthings that have no obvious connection to the dots in a cluster diagram

Clustering records of this sort requires some notion of natural association; that is, records in a given cluster are more similar to each other than to records

in another cluster Since it is difficult to convey intuitive notions to a computer,

Trang 14

two main problems with this approach:

■■ Many variable types, including all categorical variables and many

numeric variables such as rankings, do not have the right behavior to

properly be treated as components of a position vector

■■ In geometry, the contributions of each dimension are of equal impor

tance, but in databases, a small change in one field may be much more

important than a large change in another field

The following section introduces several alternative measures of similarity

Similarity Measures and Variable Type

Geometric distance works well as a similarity measure for well-behaved numeric variables A well-behaved numeric variable is one whose value indi

cates its placement along the axis that corresponds to it in our geometric model Not all variables fall into this category For this purpose, variables fall into four classes, listed here in increasing order of suitability for the geometric model

■■ Categorical variables

■■ Ranks

■■ Intervals

■■ True measures

Categorical variables only describe which of several unordered categories a

thing belongs to For instance, it is possible to label one ice cream pistachio and another butter pecan, but it is not possible to say that one is greater than the other or judge which one is closer to black cherry In mathematical terms, it is

possible to tell that X ≠ Y, but not whether X > Y or X < Y

Ranks put things in order, but don’t say how much bigger one thing is than

another The valedictorian has better grades than the salutatorian, but we

don’t know by how much If X, Y, and Z are ranked A, B, and C, we know that

X > Y > Z, but we cannot define X-Y or Y-Z

Intervals measure the distance between two observations If it is 56° in San

Francisco and 78° in San Jose, then it is 22 degrees warmer at one end of the bay than the other

Trang 15

True measures are interval variables that measure from a meaningful zero

point This trait is important because it means that the ratio of two values of the variable is meaningful The Fahrenheit temperature scale used in the United States and the Celsius scale used in most of the rest of the world do not have this property In neither system does it make sense to say that a 30° day is twice as warm as a 15° day Similarly, a size 12 dress is not twice as large as a size 6, and gypsum is not twice as hard as talc though they are 2 and 1 on the hardness scale It does make perfect sense, however, to say that a 50-year-old

is twice as old as a 25-year-old or that a 10-pound bag of sugar is twice as heavy as a 5-pound one Age, weight, length, customer tenure, and volume are examples of true measures

Geometric distance metrics are well-defined for interval variables and true measures In order to use categorical variables and rankings, it is necessary to transform them into interval variables Unfortunately, these transformations may add spurious information If ice cream flavors are assigned arbitrary numbers 1 through 28, it will appear that flavors 5 and 6 are closely related while flavors 1 and 28 are far apart

These and other data transformation and preparation issues are discussed extensively in Chapter 17

Formal Measures of Similarity

There are dozens if not hundreds of published techniques for measuring the similarity of two records Some have been developed for specialized applications such as comparing passages of text Others are designed especially for use with certain types of data such as binary variables or categorical variables

Of the three presented here, the first two are suitable for use with interval variables and true measures, while the third is suitable for categorical variables

Geometric Distance between Two Points

When the fields in a record are numeric, the record represents a point in n-dimensional space The distance between the points represented by two records is used as the measure of similarity between them If two points are close in distance, the corresponding records are similar

There are many ways to measure the distance between two points, as discussed in the sidebar “Distance Metrics” The most common one is the Euclidian distance familiar from high-school geometry To find the Euclidian

distance between X and Y, first find the differences between the corresponding elements of X and Y (the distance along each axis) and square them The dis

tance is the square root of the sum of the squared differences

Trang 16

Any function that takes two points and produces a single number describing a

◆ Distance(X,Y) = 0 if and only if X = Y

◆ Distance(X,Y) ≥ 0 for all X and all Y

◆ Distance(X,Y) = Distance(Y,X)

◆ Distance(X,Y) ≤ Distance(X,Z) + Distance(Z,Y)

identity and commutativity by mathematicians)—that the measure is 0 or positive and is well-defined for any two points If two records have a distance

DISTANCE METRICS

relationship between them is a candidate measure of similarity, but to be a true distance metric, it must meet the following criteria:

These are the formal definition of a distance metric in geometry

A true distance is a good metric for clustering, but some of these conditions can be relaxed The most important conditions are the second and third (called

of 0, that is okay, as long as they are very, very similar, since they will always fall into the same cluster

The last condition, the Triangle Inequality, is perhaps the most interesting mathematically In terms of clustering, it basically means that adding a new cluster center will not make two distant points suddenly seem close together

Fortunately, most metrics we could devise satisfy this condition

Angle between Two Vectors

Sometimes it makes more sense to consider two records closely associated

because of similarities in the way the fields within each record are related Min

nows should cluster with sardines, cod, and tuna, while kittens cluster with cougars, lions, and tigers, even though in a database of body-part lengths, the sardine is closer to a kitten than it is to a catfish

The solution is to use a different geometric interpretation of the same data

Instead of thinking of X and Y as points in space and measuring the distance between them, think of them as vectors and measure the angle between them

In this context, a vector is the line segment connecting the origin of a coordi

nate system to the point described by the vector values A vector has both mag

nitude (the distance from the origin to the point) and direction For this similarity measure, it is the direction that matters

Take the values for length of whiskers, length of tail, overall body length, length of teeth, and length of claws for a lion and a house cat and plot them as single points, they will be very far apart But if the ratios of lengths of these body parts to one another are similar in the two species, than the vectors will

be nearly colinear

Trang 17

362 Chapter 11

The angle between vectors provides a measure of association that is not influenced by differences in magnitude between the two things being compared (see Figure 11.7) Actually, the sine of the angle is a better measure since

it will range from 0 when the vectors are closest (most nearly parallel) to 1 when they are perpendicular Using the sine ensures that an angle of 0 degrees

is treated the same as an angle of 180 degrees, which is as it should be since for this measure, any two vectors that differ only by a constant factor are considered similar, even if the constant factor is negative Note that the cosine of the angle measures correlation; it is 1 when the vectors are parallel (perfectly correlated) and 0 when they are orthogonal

Tiêu đề	Link Analysis
Trường học	John Wiley & Sons
Chuyên ngành	Data Mining Techniques for Marketing Sales
Thể loại	Book
Năm xuất bản	2023
Thành phố	Hoboken

Định dạng
Số trang	34
Dung lượng	1,71 MB