5.11 Clustering creating segments with cluster analysis 151in this node.. 5.11 Clustering creating segments with cluster analysis Cluster analysis allows us to segment the target popula
Trang 1150 5.10 The tree navigator
Figure 5.46 shows that males who have an relatively long tenure (≥1.125 years) and who come from relatively small firms (≤ 0.25) or, at theother extreme, from relatively large firms (> 1.75) are most likely to attend:7.6 percent This places this group at about the same level as the overallattendance rate of 10 percent and indicates that these people can be tar-geted as a means of increasing loyalty and lifetime value
As shown in Figure 5.47, females who come from firms with relativelylow annual sales and who come from a midrange size of firm (> 1.75 and
≤ 3.25) are also good targets This group had an attendance rate of 14.85percent Notice that there are only 14 “positive” occurrences of attendance
Figure 5.46 Increased response among males with high tenure
Figure 5.47 Example of response by selected female attributes
Trang 25.11 Clustering (creating segments) with cluster analysis 151
in this node This is a relatively small number to base results on, eventhough these results are statistically valid
There are no attendances in the Annual Sales > 1.25 node and in theSize of Firm ≤ 1.75 or > 3.25 node, shown in Figure 5.48 We see that thereare only 6 of 724 “positive” cases (less than 1 percent) Six cases is a verysmall number to base marketing results on and, while it may be possible todemonstrate that the results are statistically valid from a theoretical point ofview, it is definitely recommended to verify these results with respect to aholdout sample or validation database to see whether these results could beexpected to generalize to a new marketing target population
5.11 Clustering (creating segments) with
cluster analysis
Cluster analysis allows us to segment the target population reflected in ourdatabase on the basis of shared similarities among a number of attributes
So, unlike decision trees, it is not necessary to specify a particular outcome
to be used to determine various classes, discriminators, and predictors.Rather, we just need to specify which fields we want the data mining clus-tering algorithm to use when assessing the similarity or dissimilarity of thecases being considered for assignment to the various clusters
To begin the data mining modeling task it is necessary to specify thesource data As with the decision tree, developed in the previous section, wewill point the Data Mining wizard at the Conferences.mdb data source andpick up the customer table as the analysis target As shown in Figure 5.49,
in this case we will be clustering on customers and will use their shared ilarities according to various characteristics or attributes to determine towhich cluster they belong
sim-Figure 5.48 A small number of “positive” cases
Trang 3152 5.11 Clustering (creating segments) with cluster analysis
Trang 45.11 Clustering (creating segments) with cluster analysis 153
Once the target data table has been identified, the Modeling wizard willrequest us to specify the data mining technique As shown in Figure 5.50,select clustering as the data mining method
As in all data mining models, we are asked to indicate the level of sis This is contained in the case key selected for the analysis As shown inFigure 5.51, at this point we want the level of analysis to be the customerlevel, so we specify the customer as the key field
analy-The Analysis wizard then asks us to specify the fields that will be used toform the clusters These are the fields that will be used to collectively gaugethe similarities and dissimilarities between the cases to form the customerclusters We select the fields shown in Figure 5.52
Once the fields have been selected, we can continue to run the clustermodel After processing, we get the results presented in Figure 5.53
Figure 5.51
Selecting the case
key to define the
Trang 5154 5.11 Clustering (creating segments) with cluster analysis
In Figure 5.53 we see that by default, the cluster procedure has identifiedten clusters The content detail and content navigator areas use color to rep-resent the density of the number of observations
We can browse the attribute results to look at the characteristics of thevarious clusters Although we can be confident that the algorithm hasforced the clusters into ten homogenous but optimally distinct groups, if wewant to understand the characteristics of the groups then it may be prefera-ble to tune the clustering engine to produce a fewer number of clusters.Three clusters accomplish this There are many different quantitative tests
to determine the appropriate number of clusters in an analysis In manycases, as illustrated here, the choice is made on the basis of business knowl-edge and hunches on how many distinct customer groupings actually exist.Having determined that three clusters are appropriate, we can select the
Figure 5.53 Default display produced by the cluster analysis modeling procedure
Trang 65.11 Clustering (creating segments) with cluster analysis 155
Trang 7156 5.11 Clustering (creating segments) with cluster analysis
properties dialog and change the number of clusters from ten to three This
is shown in Figure 5.54
This will instruct Analysis Server to recalculate the cluster attributes andmembers by trying to identify three clusters rather than the default ten clus-ters To complete this recalculation you need to go back to the data miningmodel, reprocess the model, and then browse the model to see the newresults The new results are displayed in Figure 5.55
As shown in Figure 5.55, the attributes pane shows which decision rulescan be used to characterize the cluster membership Each decision rule willresult in classifying a case into a unique cluster The cluster that is foundwill depend upon how the preconditions of the cluster decision rule match
up to the specific attributes of the case being classified
Here are the decision rules for classifying cases (or records) into the threeclusters Note that the fields used as preconditions of the decision rule arethe same fields we indicated should be used to calculate similarity in theMining Model wizard
Cluster 1 Size Of Firm = 0 ,Annual Sales = 0 ,
0.100000001490116 ≤ Tenure ≤ 1.31569222413047 ,Gender = M
Cluster 2 6.65469534945513 ≤ Size Of Firm ≤ 9 ,1.06155892122041 ≤ Annual Sales ≤ 9 ,
0.100000001490116 ≤ Tenure ≤ 3.00482080240072 ,Gender = F
Cluster 3 Size Of Firm ≤ 0 ,Tenure ≤ 0.100000001490116 ,
0≤ Annual Sales ≤ 5.18296067118255 ,Gender = F
cluster analysis
These decision rules provide a statistical summary of the cases in the dataset once they have been classified in the various clusters Here we can seethat Cluster 1 characterizes customers from generally small, general lowsales volume firms Cluster 1 members also have generally short tenure.Cluster 3 is primarily a female cluster and has the very short tenure mem-
Trang 85.11 Clustering (creating segments) with cluster analysis 157
bers while Cluster 2 draws on customers from the larger, high sales volumefirms
This tends to suggest that we have Small, low sales volume customerswho tend to be males Among female customers they are either longer termcustomers from generally larger, higher sales companies or very short termcustomers from small, medium sales companies
We can see here that Cluster techniques and Decision Tree techniquesproduce different kinds of results: the decision tree was produced purelywith respect to probability of response The Clusters, on the other hand, areproduced with respect to Tenure, Gender, Size of Firm and Annual Sales Infact, in clustering, probability of response was specifically excluded
As indicated in Chapter 3, mining models are stored as Decision SupportObjects in the database The models contain all the information necessary
to recreate themselves, but they need to be refreshed in order to respond tonew data or new settings To retrieve a previously grown mining model, go
to Analysis Services and select the mining model you want to look at Forexample, as shown in Figure 5.56, open the Analysis Server file tree andhighlight the previously produced mining model entitled “PromoResults.”
Go to the Action menu or right-click the mouse and execute Refresh.This will bring the mining results back Once the model is refreshed, go tothe Action menu and select Browse to look at the model results
Trang 9158 5.12 Confirming the model through validation
5.12 Confirming the model through validation
It is important to test the results of modeling activities to ensure that therelationships that have been uncovered will bear up over time and will holdtrue in a variety of circumstances This is important in a target marketingapplication, for example, where considerable sums will be invested in thetargeting campaign This investment is based on the model results, so theybetter be right! The best way to determine whether a relationship is right ornot is to see whether it holds up in a new set of data drawn from the mod-eled population In essence, in a target marketing campaign, we would like
to apply the results of the analysis to a new set of data, where we alreadyknow the answer (whether people will respond or not), to see how well ourmodel performs
This is done by creating a “test” data set (sometimes called a “hold back”sample), which is typically drawn from the database to be analyzed beforethe analytical model is developed This way we can create a test data set thathasn’t been used to develop the model We can see that this test data set isindependent of the training (or learning data set), so it can serve as a proxyfor how a new data set would perform in a model deployment situation Ofcourse, since the test data set was extracted from the original database, itcontains the answer; therefore, it can be used to calculate the validity of themodel results Validity consists of accuracy and reliability: How accurately
do we reproduce the results in a test data set and how reliable is this finding.Reliability is best tested with numerous data sets, drawn in different sets ofcircumstances over time Reliability accumulates as we continue our model-ing and validation efforts over time Accuracy can be calculated on the basis
of the test data set results
Qualitative questions—such as respond/did not respond—result in decisiontrees where the components of the nodes on the branches of the tree show afrequency distribution (e.g., 20 percent respond; 80 percent do notrespond) In this case the decision tree indicates that the majority of caseswill not respond To validate this predicted outcome a test or hold backsample data set is used Each data record in the test sample is validatedagainst the prediction that is suggested by the decision tree If the predic-tion is correct, then the valid score indicator is incremented If theprediction is incorrect, then the invalid score indicator is incremented Atthe end of the validation procedure the percentage of valid scores to invalid
Trang 105.13 Summary 159
scores is calculated This is then displayed as the percentage accuracy of thevalidated decision tree model
In the case of a quantitative outcome, such as dollars spent, accuracy can becalculated using variance explained according to a linear regression modelcalculated in a standard statistical manner In this case, some of the superiorstatistical properties of regression are used in calculating the accuracy of thedecision tree This is possible because a decision tree with quantitative datasummarized in each of the nodes of the decision tree is actually a specialtype of regression model So the statistical test of variance explained, nor-mally used in regression modeling, can be used with decision trees
Thus, the value of a quantitative field in any given node is computed asconsisting of the values of the predictors multiplied by the value for eachpredictor that is derived in calculating the regression equation In a perfectregression model this calculation will equal the observed value in the nodeand the prediction will be perfect When there is less than a perfect predic-tion, the observed value deviates from the predicted value The deviationsfrom these scores, or residuals, represent the unexplained variance of theregression model
The accuracy that you find acceptable depends upon the circumstances.One way to determine how well your model performs is to compare its per-formance with chance In our example, there were about 67 percent, ortwo-thirds, responders and about one-third nonresponders So, by chancealone, we expect to be able to correctly determine whether someoneresponds two-thirds of the time Clearly then, we would like to have amodel that provides, say, an 80 percent accuracy rate This difference inaccuracy—the difference between the model accuracy rate of 80 percentand the accuracy rate given by chance (67 percent)—represents the gainfrom using the model In this case the gain is about 13 percent In general,this 13 percent gain means that we will have lowered targeting costs andincreased profitability from a given targeting initiative
5.13 Summary
Enterprise data can be harnessed—profitably and constructively—in anumber of ways to support decision making in a wide variety of problemareas The “trick” is to deploy the best pattern-searching tools available to
Trang 11160 5.13 Summary
look through the enterprise data store in order to find all relevant datapoints to support a decision and, more importantly, to determine how thesedata points interact with one another to affect the question under examina-tion
This is where decision tree products show their value as an enterprisedata and knowledge discovery tool Decision trees search through all rele-vant data patterns—and combinations of patterns—and present the bestcombinations to the user in support of decision making The decision treealgorithm quickly rejects apparent (spurious) relationships and presentsthose combinations of patterns that—together—produce the effect underexamination It is both multidimensional, in an advanced numerical proc-essing manner, and easy to use in its ability to support various user models
of the question under examination
In summary, the SQL 2000 decision tree presents critical benefits insupport of the enterprise knowledge discovery mission, as follows:
It is easy to use and supports the user’s view of the problem domain
It works well with real-world enterprise data stores, including datathat are simple, such as “male” and “female,” and data that are com-plex, such as rate of investment
It is sensitive to all relevant relationships, including complex, multiplerelationships, yet quickly rejects weak relationships or relationshipsthat are spurious (i.e., relationships that are more apparent than real)
It effectively summarizes data by forming groups of data values thatbelong together in clusters, or branches, on the decision tree display
It employs advanced statistical hypothesis testing procedures and dation procedures to ensure that the results are accurate and repro-ducible (in a simple manner, behind the scenes, so that users do notneed a degree in statistics to use these procedures)
vali- The resulting display can be quickly and easily translated into sion rules, predictive rules, and even knowledge-based rules fordeployment throughout the enterprise
deci-In a similar manner, the SQL 2000 clustering procedure provides criticalbenefits when a particular field or outcome does not form the focus of thestudy This is frequently the case when, for example, you are interested inplacing customers into various classes or segments that will be used todescribe their behavior in a variety of circumstances As with the decision
Trang 125.13 Summary 161
tree, the cluster procedure provides a range of benefits, including the lowing:
fol- The Data Mining wizard makes it easy to use
It is possible to define, ahead of time, how many clusters you feel areappropriate to describe the phenomenon (e.g., customer segments)that you are viewing
The attributes of the clusters can be easily examined
As with decision tree models, cluster models can be described anddeployed in a variety of ways throughout the enterprise
Knowledge discovery is similar to an archeological expedition—youneed to sift through a lot of “dirt” in order to find the treasure buriedbeneath the dig There is no shortage of dirt in enterprise operational data,but there are real treasures buried in the vaults of the enterprise data store.The SQL Server 2000 decision tree is an indispensable tool for siftingthrough multitudes of potential data relationships in order to find the criti-cal patterns in data that demonstrate and explain the mission-critical princi-ples necessary for success in the knowledge-based enterprise of the twenty-first century
Trang 13This Page Intentionally Left Blank
Trang 14Deploying the Results
The customer generates nothing No customer asked for electric lights.
—W Edwards Deming
W Edwards Deming, in his tireless efforts to apply statistical concepts inthe pursuit of ever-more quality outputs, was instrumental in changing theway products are conceived, designed, and developed In the Demingmodel, novelty emerged from the determined application of quality princi-ples throughout the product life cycle The Deming quality cycle is similar
to the data mining closed-loop “virtuous cycle,” which was outlined inChapter 2 The application of data mining results, and the lessons to belearned from the application of these results to the product/market lifecycle, begins with the deployment of the results In the same way that tire-less application of quality principles throughout the product life cycle hasbeen shown to lead to revolutionary new ways of conceiving, designing, anddelivering products, so too can the tireless deployment of data miningresults to market interventions lead to similar revolutionary developments
in the product and marketing life cycle
If information is data organized for decision making, then knowledgecould be termed data organized for a deployable action As seen in theexample developed in Chapter 5, data, once analyzed, yield up their secrets
in the form of a decision rule, a probability to purchase, or perhaps a clusterassignment The deployment of the decision rule, probability to purchase,
or cluster assignment could be embodied in a specific, point solution to a
Trang 15164 6.1 Deployments for predictive tasks (classification)
a customer call center) Another deployment could be a multistage, channel delivery system (e.g., in what is normally referred to as campaignmanagement, send offers to prospects through several channels with variousfrequencies and periodicities while measuring the results in order to deter-mine not only what characteristics of prospects are most likely to lead toresponses but which approach methods, frequencies, and timing provideincremental response “lift”) In assessing the potential value of a deployableresult, it is common to talk about lift Lift is a measure of the incrementalvalue obtained by using the data mining results when compared with theresults that would normally be achieved in the absence of using the knowl-edge derived from data mining
multi-In this chapter, we will show how to implement a target marketingmodel by scoring the customer database with a predictive query We alsoshow how to estimate return on investment with a lift chart
tasks (classification)
Data Transformation Services (DTS), which are located in the EnterpriseServices of SQL Server 2000, can be used to build a prediction query toscore a data set using a predictive model developed in Analysis Manager.The predictive query is used to score unseen cases and is stored as a DTSpackage This package can be scheduled for execution using DTS to triggerthe package at any time in the future, under a variety of conditions This is
a very powerful way to create knowledge about unclassified cases, customerswho have responded to a promotional offer, or customers who have visitedyour Web site
Figure 6.1 illustrates the procedure that is necessary to create a tion query First, start up DTS Designer by going to the desktop startmenu, selecting Programs, pointing to Microsoft SQL Server, and thenselecting Enterprise Manager In the console tree, expand the server that willhold the package containing the Data Mining Prediction Query task.Right-click the Data Transformation Services folder, and then click NewPackage
predic-In DTS Designer, from the Task tool palette, drag the icon for the DataMining Prediction Query task onto the workspace This icon appears in theDTS Package (New Package) dialog shown in Figure 6.1
As an option, as shown in Figure 6.1, in the Data Mining PredictionQuery Task dialog box, type a new name in the Name box to replace the
Trang 166.1 Deployments for predictive tasks (classification) 165
default name for the task As another option, type a task description in theDescription box This description is used to identify the task in DTSDesigner In the Server box, type the name of the Analysis Server containingthe data mining model to be used as the source for the prediction query.The server name—MetaGuide, used in the example in Figure 6.1—is thesame as the computer name on the network
From the Database list, select the database that contains the miningmodel to be queried Here we select ConferenceResults, and this provideseither the CustomerSegments data mining model (DMM) for the clusterresults or PromoResults DMM for the response analysis
If the mining model you want to use for the prediction query is notalready highlighted in the Mining Models box, select a mining model fromthe box by clicking on its name or icon As shown in Figure 6.1, you canview some of the properties of the mining model in the Details box Click the Query tab, and then, in the Data source box, either type avalid ActiveX Data Objects (ADOs) connection string to the case table con-taining the input and predictable columns for the query, or, to build the
connection string, click the edit ( ) button to launch the Data Link
Prop-erties dialog box
Figure 6.1 Building a prediction query task in DTS Designer
Trang 17166 6.1 Deployments for predictive tasks (classification)
In the Prediction query box, type the syntax, or click New Query tolaunch Prediction Query Builder, as shown in Figure 6.2 If you choose tobuild the query yourself, note that the prediction query syntax must con-form to the OLE DB for DM specification For more information about theOLE DB for DM specification, see the list of links on the SQL Server Docu-mentation Web site at http://www.microsoft.com/sql/support/docs.htm
As shown in Figure 6.2, once the Prediction Query Builder haslaunched, you can build a query The query asks you to associate the predic-tion with a new table
Once the associations are made, the query builder is complete, and youcan click OK to finish creating the task
When the query is executed, it produces the query code shown in thequery output box, displayed in Figure 6.3
As can be seen by examining the code produced by the Prediction QueryBuilder, prediction queries are run by means of the SELECT statement.The prediction query syntax is as follows:
SELECT [FLATTENED] <SELECT-expressions> FROM <mining model name>
PREDICTION JOIN <source data query> ON <join condition> [WHERE <WHERE-expression>]
Figure 6.2
DTS Prediction
Query Builder