Cookbook Modeling Data for Marketing_1 doc

For example, the following code will produce the output for examining the variable estimated income There is a lot of information in this univariate analysis.. Outliers and Data Errors

Trang 1

Page 60

Cleaning the Data

I now have the complete data set for modeling The next step is to examine the data for errors, outliers, and missing values This is the most time-consuming, least exciting, and most important step in the data preparation process Luckily there are some effective techniques for managing this process

First, I describe some techniques for cleaning and repairing data for continuous variables Then I repeat the process for categorical variables

Continuous Variables

To perform data hygiene on continuous variables, PROC UNIVARIATE is a useful procedure It provides a great deal

of information about the distribution of the variable including measures of central tendency, measures of spread, and the skewness or the degree of imbalance of the data

For example, the following code will produce the output for examining the variable estimated income

There is a lot of information in this univariate analysis I just look for a few key things Notice the measures in bold In

the moments section, the mean seems reasonable at $61.39224 But looking a little further I detect some data issues

Notice that the highest value in the extreme values is 660 In Figure 3.5, the histogram and box plot provide a good visual analysis of the overall distribution and the extreme value I get another view of this one value In the histogram, the bulk of the observations are near the bottom of the graph with the single high value near the top The box plot also shows the limited range for the bulk of the data The box area represents the central 50% of the data The distance to the extreme value is very apparent This point may be considered an outlier

Outliers and Data Errors

An outlier is a single or low-frequency occurrence of the value of a variable that is far from the mean as well as the majority of the other values for that variable Determining whether a value is an outlier or a data error is an art as well as

a science Having an intimate knowledge of your data is your best strength

Trang 2

Figure 3.4

Initial univariate analysis of estimated income.

Trang 3

Figure 3.5 Histogram and box plot of estimated income.

Trang 4

Page 62Common sense and good logic will lead you to most of the problems In our example, the one value that seems

questionable is the maximum value (660) It could have an extra zero One way to see if it is a data error is to look at

some other values in the record The variable estimated income group serves as a check for the value The following

code prints the record:

proc print data=acqmod.model(where=(inc_est=660));

Based on the information provided with the data, I know the range of incomes in group K to be between $65,000 and

$69,000 This leads us to believe that the value 660 should be 66 I can verify this by running a PROC MEANS for the remaining records in group K

proc means data=acqmod.model maxdec = 2;

where inc_grp = 'K' and inc_est ^= 660;

var inc_est;

run;

The following SAS output validates our suspicion All the other prospects with estimated income group = K have estimated income values between 65 and 69

Analysis Variable : INC_EST (K)

N Mean Std Dev Minimum Maximum

if inc_est = 660 then inc_est2 = 66;

else inc_est2 = inc_est;

Trang 5

Figure 3.6 Histogram and box plot of estimated income with corrections.

If you have hundreds of variables, you may not want to spend a lot of time on each variable with missing or incorrect values Time

techniques for correction should be used sparingly If you find an error and the fix is not obvious, you can treat it as a missing value

Outliers are common in numeric data, especially when dealing with monetary variables Another method for dealing with outliers is to develop a capping rule This can be accomplished easily using some features in PROC UNIVARIATE The following code produces an output data set with the

standard deviation (incstd) and the 99th percentile value (inc99 ) for estimated income (inc_est).

proc univariate data=acqmod.model noprint;

if (_n_ eq 1) then set incdata(keep= incstd inc99);

if incstd > 2*inc99 then inc_est2 =

min(inc_est,(4*inc99));

else inc_est2 = inc_est;

run;

Trang 6

Page 64The code in bold is just one example of a rule for capping the values of a variable It looks at the spread by seeing if the standard deviation is greater than twice the value at the 99th percentile If it is, it caps the value at four times the 99th percentile This still allows for generous spread without allowing in obvious outliers This particular rule only works for variables with positive values Depending on your data, you can vary the rules to suit your goals.

Missing Values

As information is gathered and combined, missing values are present in almost every data set Many software packages ignore records with missing values, which makes them a nuisance The fact that a value is missing, however, can be predictive It is important to capture that information

Consider the direct mail company that had its customer file appended with data from an outside list Almost a third of its customers didn't match to the outside list At first this was perceived as negative But it turned out that these customers were much more responsive to offers for additional products After further analysis, it was discovered that these

customers were not on many outside lists This made them more responsive because they were not receiving many direct mail offers from other companies Capturing the fact that they had missing values improved the targeting model

In our case study, we saw in the univariate analysis that we have 84 missing values for income The first step is to create

an indicator variable to capture the fact that the value is missing for certain records The following code creates a variable to capture the information:

Single Value Substitution

Single value substitution is the simplest method for replacing missing values There are three common choices: mean, median, and mode The mean value is based on the statistical least-square -error calculation This introduces the least variance into the distribution If the distribution is highly skewed, the median may be a better choice The following

code substitutes the mean value for estimated income (inc_est2):

Trang 7

Page 65

data acqmod.model;

set acqmod.model;

if inc_est2 = then inc_est3 = 61;

else inc_est3 = inc_est2;

run;

Class Mean Substitution

Class mean substitution uses the mean values within subgroups of other variables or combinations of variables This method maintains more of the original distribution The first step is to select one or two variables that may be highly

correlated with income Two values that would be highly correlated with income are home equity ( hom_equ) and inferred age (infd_ag) The goal is to get the average estimated income for cross-sections of home equity ranges and age

ranges for observations where estimated income is not missing Because both variables are continuous, a data step is

used to create the group variables, age_grp and homeq_r PROC TABULATE is used to derive and display the values.

data acqmod.model;

set acqmod.model;

if 25 <= infd_ag <= 34 then age_grp = '25-34'; else

if 55 <= infd_ag <= 65 then age_grp = '55-65';

if 0 <= hom_equ<=100000 then homeq_r = '$0 -$100K'; else

if 100000<hom_equ<=200000 then homeq_r = '$100-$200K'; else

if 700000<hom_equ then homeq_r = '$700K+';

table homeq_r ='Home Equity',age_grp='Age Group'*

inc_est2=' '*mean=' '*f=dollar6

/rts=13;

run;

The output in Figure 3.7 shows a strong variation in average income among the different combinations of home equity and age group Using these values for missing value substitution will help to maintain the distribution of the data

Trang 8

Figure 3.7 Values for class mean substitution.

The final step is to develop an algorithm that will create a new estimated income variable (inc_est3 ) that has no missing values.

data acqmod.model;

set acqmod.model;

if inc_est2 = then do;

if 25 <= infd_ag <= 34 then do;

if 0 <= hom_equ<=100000 then inc_est3= 47; else

if 100000<hom_equ<=200000 then inc_est3= 70; else

if 700000<hom_equ then inc_est3= 71;

end; else

" " " "

end; else

" " " "

Trang 9

Page 67

end; else

In our case study, I derive values for estimated income (inc_est2) using the continuous form of age (infd_ag), the mean for each category of home equity (hom_equ), total line of credit (credlin), and total credit balances (tot_bal) The following code performs a regression analysis and creates an output data set (reg_out ) with the predictive coefficients.

proc reg data=acqmod.model outest=reg_out;

if inc_est2 = then inc_est3 = inc_reg;

else inc_est3 = inc_est2;

run;

Trang 10

Page 68

Figure 3.8 Output for regression substitution

One of the benefits of regression substitution is its ability to sustain the overall distribution of the data To measure the

effect on the spread of the data, I look at a PROC MEANS for the variable before (inc_est2) and after (inc_est3) the

Trang 11

Page 69

Figure 3.9 Means comparison of missing replacement

table pop_den trav_cd bankcrd apt_ind clustr1 inc_grp sgle_in opd_bcd

occu_cd finl_id hh_ind gender ssn_ind driv_in mob_ind mortin1 mortin2

autoin1 autoin2 infd_ag age_ind dob_yr homeq_r childin homevlr clustr2

/ missing;

run;

In Figure 3.10, I see that population density (pop_den) has four values, A, B, C, and P I requested the missing option in our frequency, so I can see the number of missing values The data dictionary states that the correct values for pop_den

are A, B, and C I presume that the value P is an error I have a couple of choices to remedy the situation I can delete it

or replace it For the purposes of this case study, I give it the value of the mode, which is C

Trang 12

Page 70

Figure 3.10 Frequency of population density

Summary

In this chapter I demonstrated the process of getting data from its raw form to useful pieces of information The process uses some techniques ranging from simple graphs to complex univariate outputs But, in the end, it becomes obvious that with the typically large sample sizes used in marketing, it is necessary to use these techniques effectively Why? Because the quality of the work from this point on is dependent on the accuracy and validity of the data

Now that I have the ingredients, that is the data, I am ready to start preparing it for modeling In chapter 4, I use some interesting techniques to select the final candidate variables I also find the form or forms of each variable that

maximizes the predictive power of the model

Trang 13

Page 71

Chapter 4—

Selecting and Transforming the Variables

At this point in the process, the data has been carefully examined and refined for analysis The next step is to define the goal in technical terms For our case study, the objective is to build a net present value (NPV) model for a direct mail life insurance campaign In this chapter, I will describe the components of NPV and detail the plan for developing the model

Once the goal has been defined, the next step is to find a group of candidate variables that show potential for having strong predictive power This is accomplished through variable reduction To select the final candidate variables I use a combination of segmentation, transformation, and interaction detection

Defining the Objective Function

In chapter 1, I stressed the importance of having a clear objective In this chapter, I assign a technical definition — called

the objective function— to the goal Recall that the objective function is the technical definition of your business goal.

In our case study, the first goal is to predict net present value I define the objective function as the value in today's dollars of future profits for a life insurance product The NPV for this particular product consists of four major

components: the probability of activation, the risk index, the product profitability, and the marketing expense They are each defined as follows:

Trang 14

Page 72

Probability of activation A probability calculated by a model The individual must respond, be approved by risk, and

pay his or her first premium

Risk index Indices in matrix of gender by marital status by age group based on actuarial analysis This value can also

be calculated using a predictive model

Product profitability Present value of product-specific, three-year profit measure that is provided by the product

manager

Marketing expense Cost of package, mailing, and processing (approval, fulfillment).

The final model is a combination of these four components:

Net Present Value = P(Activation) × Risk Index × Product Profitability – Marketing Expense

For our case study, I have specific methods and values for these measures

Probability of Activation

To calculate the probability of activation, I have two options: build one model that predicts activation or build two models, one for response and one for activation, given response (model build on just responders to target actives) To determine which method works better, I will develop the model both ways

Method 1: One model When using one model, the goal is to target active accounts, that is, those responders who paid

their first premium To use the value ''activate" in the analysis, it must be in a numeric form, preferably 0 or 1 We know

from the frequency in chapter 3 that the values for the variable, activate, are as follows: activate = 1, respond but not

activated = 0, no response = (missing) To model activation from the original offer, I must give nonresponders the value

of 0 I create a new variable, active, and leave the original variable, activate, untouched.

data acqmod.model2;

set acqmod.model2;

if activate = then active = 0;

else active = activate;

run;

We will now use the variable active as our dependent variable.

Method 2: Two models When using two models, the goal is to build one model to target response and a second model

to target activation, given response The probability of activation, P(A), is the product of the probability of response, P(R), times the probability of activation, given response, P(A|R) For this method, I do not have to recode any of the dependent vari-

Định dạng
Số trang	29
Dung lượng	642,23 KB