For example, the following code will produce the output for examining the variable estimated income There is a lot of information in this univariate analysis.. Outliers and Data Errors
Trang 1Page 60
Cleaning the Data
I now have the complete data set for modeling The next step is to examine the data for errors, outliers, and missing values This is the most time-consuming, least exciting, and most important step in the data preparation process Luckily there are some effective techniques for managing this process
First, I describe some techniques for cleaning and repairing data for continuous variables Then I repeat the process for categorical variables
Continuous Variables
To perform data hygiene on continuous variables, PROC UNIVARIATE is a useful procedure It provides a great deal
of information about the distribution of the variable including measures of central tendency, measures of spread, and the skewness or the degree of imbalance of the data
For example, the following code will produce the output for examining the variable estimated income
There is a lot of information in this univariate analysis I just look for a few key things Notice the measures in bold In
the moments section, the mean seems reasonable at $61.39224 But looking a little further I detect some data issues
Notice that the highest value in the extreme values is 660 In Figure 3.5, the histogram and box plot provide a good visual analysis of the overall distribution and the extreme value I get another view of this one value In the histogram, the bulk of the observations are near the bottom of the graph with the single high value near the top The box plot also shows the limited range for the bulk of the data The box area represents the central 50% of the data The distance to the extreme value is very apparent This point may be considered an outlier
Outliers and Data Errors
An outlier is a single or low-frequency occurrence of the value of a variable that is far from the mean as well as the majority of the other values for that variable Determining whether a value is an outlier or a data error is an art as well as
a science Having an intimate knowledge of your data is your best strength
Trang 2Figure 3.4
Initial univariate analysis of estimated income.
Trang 3Figure 3.5 Histogram and box plot of estimated income.
Trang 4Page 62Common sense and good logic will lead you to most of the problems In our example, the one value that seems
questionable is the maximum value (660) It could have an extra zero One way to see if it is a data error is to look at
some other values in the record The variable estimated income group serves as a check for the value The following
code prints the record:
proc print data=acqmod.model(where=(inc_est=660));
Based on the information provided with the data, I know the range of incomes in group K to be between $65,000 and
$69,000 This leads us to believe that the value 660 should be 66 I can verify this by running a PROC MEANS for the remaining records in group K
proc means data=acqmod.model maxdec = 2;
where inc_grp = 'K' and inc_est ^= 660;
var inc_est;
run;
The following SAS output validates our suspicion All the other prospects with estimated income group = K have estimated income values between 65 and 69
Analysis Variable : INC_EST (K)
N Mean Std Dev Minimum Maximum
if inc_est = 660 then inc_est2 = 66;
else inc_est2 = inc_est;
Trang 5Figure 3.6 Histogram and box plot of estimated income with corrections.
If you have hundreds of variables, you may not want to spend a lot of time on each variable with missing or incorrect values Time
techniques for correction should be used sparingly If you find an error and the fix is not obvious, you can treat it as a missing value
Outliers are common in numeric data, especially when dealing with monetary variables Another method for dealing with outliers is to develop a capping rule This can be accomplished easily using some features in PROC UNIVARIATE The following code produces an output data set with the
standard deviation (incstd) and the 99th percentile value (inc99 ) for estimated income (inc_est).
proc univariate data=acqmod.model noprint;
if (_n_ eq 1) then set incdata(keep= incstd inc99);
if incstd > 2*inc99 then inc_est2 =
min(inc_est,(4*inc99));
else inc_est2 = inc_est;
run;
Trang 6Page 64The code in bold is just one example of a rule for capping the values of a variable It looks at the spread by seeing if the standard deviation is greater than twice the value at the 99th percentile If it is, it caps the value at four times the 99th percentile This still allows for generous spread without allowing in obvious outliers This particular rule only works for variables with positive values Depending on your data, you can vary the rules to suit your goals.
Missing Values
As information is gathered and combined, missing values are present in almost every data set Many software packages ignore records with missing values, which makes them a nuisance The fact that a value is missing, however, can be predictive It is important to capture that information
Consider the direct mail company that had its customer file appended with data from an outside list Almost a third of its customers didn't match to the outside list At first this was perceived as negative But it turned out that these customers were much more responsive to offers for additional products After further analysis, it was discovered that these
customers were not on many outside lists This made them more responsive because they were not receiving many direct mail offers from other companies Capturing the fact that they had missing values improved the targeting model
In our case study, we saw in the univariate analysis that we have 84 missing values for income The first step is to create
an indicator variable to capture the fact that the value is missing for certain records The following code creates a variable to capture the information:
Single Value Substitution
Single value substitution is the simplest method for replacing missing values There are three common choices: mean, median, and mode The mean value is based on the statistical least-square -error calculation This introduces the least variance into the distribution If the distribution is highly skewed, the median may be a better choice The following
code substitutes the mean value for estimated income (inc_est2):
Trang 7Page 65
data acqmod.model;
set acqmod.model;
if inc_est2 = then inc_est3 = 61;
else inc_est3 = inc_est2;
run;
Class Mean Substitution
Class mean substitution uses the mean values within subgroups of other variables or combinations of variables This method maintains more of the original distribution The first step is to select one or two variables that may be highly
correlated with income Two values that would be highly correlated with income are home equity ( hom_equ) and inferred age (infd_ag) The goal is to get the average estimated income for cross-sections of home equity ranges and age
ranges for observations where estimated income is not missing Because both variables are continuous, a data step is
used to create the group variables, age_grp and homeq_r PROC TABULATE is used to derive and display the values.
data acqmod.model;
set acqmod.model;
if 25 <= infd_ag <= 34 then age_grp = '25-34'; else
if 35 <= infd_ag <= 44 then age_grp = '35-44'; else
if 45 <= infd_ag <= 54 then age_grp = '45-54'; else
if 55 <= infd_ag <= 65 then age_grp = '55-65';
if 0 <= hom_equ<=100000 then homeq_r = '$0 -$100K'; else
if 100000<hom_equ<=200000 then homeq_r = '$100-$200K'; else
if 200000<hom_equ<=300000 then homeq_r = '$200-$300K'; else
if 300000<hom_equ<=400000 then homeq_r = '$300-$400K'; else
if 400000<hom_equ<=500000 then homeq_r = '$400-$500K'; else
if 500000<hom_equ<=600000 then homeq_r = '$500-$600K'; else
if 600000<hom_equ<=700000 then homeq_r = '$600-$700K'; else
if 700000<hom_equ then homeq_r = '$700K+';
table homeq_r ='Home Equity',age_grp='Age Group'*
inc_est2=' '*mean=' '*f=dollar6
/rts=13;
run;
The output in Figure 3.7 shows a strong variation in average income among the different combinations of home equity and age group Using these values for missing value substitution will help to maintain the distribution of the data
Trang 8Figure 3.7 Values for class mean substitution.
The final step is to develop an algorithm that will create a new estimated income variable (inc_est3 ) that has no missing values.
data acqmod.model;
set acqmod.model;
if inc_est2 = then do;
if 25 <= infd_ag <= 34 then do;
if 0 <= hom_equ<=100000 then inc_est3= 47; else
if 100000<hom_equ<=200000 then inc_est3= 70; else
if 200000<hom_equ<=300000 then inc_est3= 66; else
if 300000<hom_equ<=400000 then inc_est3= 70; else
if 400000<hom_equ<=500000 then inc_est3= 89; else
if 500000<hom_equ<=600000 then inc_est3= 98; else
if 600000<hom_equ<=700000 then inc_est3= 91; else
if 700000<hom_equ then inc_est3= 71;
end; else
if 35 <= infd_ag <= 44 then do;
if 0 <= hom_equ<=100000 then inc_est3= 55; else
if 100000<hom_equ<=200000 then inc_est3= 73; else
" " " "
" " " "
if 700000<hom_equ then inc_est3= 101;
end; else
if 45 <= infd_ag <= 54 then do;
if 0 <= hom_equ<=100000 then inc_est3= 57; else
if 100000<hom_equ<=200000 then inc_est3= 72; else
" " " "
" " " "
Trang 9Page 67
if 700000<hom_equ then inc_est3= 110;
end; else
if 55 <= infd_ag <= 65 then do;
if 0 <= hom_equ<=100000 then inc_est3= 55; else
if 100000<hom_equ<=200000 then inc_est3= 68; else
In our case study, I derive values for estimated income (inc_est2) using the continuous form of age (infd_ag), the mean for each category of home equity (hom_equ), total line of credit (credlin), and total credit balances (tot_bal) The following code performs a regression analysis and creates an output data set (reg_out ) with the predictive coefficients.
proc reg data=acqmod.model outest=reg_out;
if inc_est2 = then inc_est3 = inc_reg;
else inc_est3 = inc_est2;
run;
Trang 10Page 68
Figure 3.8 Output for regression substitution
One of the benefits of regression substitution is its ability to sustain the overall distribution of the data To measure the
effect on the spread of the data, I look at a PROC MEANS for the variable before (inc_est2) and after (inc_est3) the
Trang 11Page 69
Figure 3.9 Means comparison of missing replacement
table pop_den trav_cd bankcrd apt_ind clustr1 inc_grp sgle_in opd_bcd
occu_cd finl_id hh_ind gender ssn_ind driv_in mob_ind mortin1 mortin2
autoin1 autoin2 infd_ag age_ind dob_yr homeq_r childin homevlr clustr2
/ missing;
run;
In Figure 3.10, I see that population density (pop_den) has four values, A, B, C, and P I requested the missing option in our frequency, so I can see the number of missing values The data dictionary states that the correct values for pop_den
are A, B, and C I presume that the value P is an error I have a couple of choices to remedy the situation I can delete it
or replace it For the purposes of this case study, I give it the value of the mode, which is C
Trang 12Page 70
Figure 3.10 Frequency of population density
Summary
In this chapter I demonstrated the process of getting data from its raw form to useful pieces of information The process uses some techniques ranging from simple graphs to complex univariate outputs But, in the end, it becomes obvious that with the typically large sample sizes used in marketing, it is necessary to use these techniques effectively Why? Because the quality of the work from this point on is dependent on the accuracy and validity of the data
Now that I have the ingredients, that is the data, I am ready to start preparing it for modeling In chapter 4, I use some interesting techniques to select the final candidate variables I also find the form or forms of each variable that
maximizes the predictive power of the model
Trang 13Page 71
Chapter 4—
Selecting and Transforming the Variables
At this point in the process, the data has been carefully examined and refined for analysis The next step is to define the goal in technical terms For our case study, the objective is to build a net present value (NPV) model for a direct mail life insurance campaign In this chapter, I will describe the components of NPV and detail the plan for developing the model
Once the goal has been defined, the next step is to find a group of candidate variables that show potential for having strong predictive power This is accomplished through variable reduction To select the final candidate variables I use a combination of segmentation, transformation, and interaction detection
Defining the Objective Function
In chapter 1, I stressed the importance of having a clear objective In this chapter, I assign a technical definition — called
the objective function— to the goal Recall that the objective function is the technical definition of your business goal.
In our case study, the first goal is to predict net present value I define the objective function as the value in today's dollars of future profits for a life insurance product The NPV for this particular product consists of four major
components: the probability of activation, the risk index, the product profitability, and the marketing expense They are each defined as follows:
Trang 14Page 72
Probability of activation A probability calculated by a model The individual must respond, be approved by risk, and
pay his or her first premium
Risk index Indices in matrix of gender by marital status by age group based on actuarial analysis This value can also
be calculated using a predictive model
Product profitability Present value of product-specific, three-year profit measure that is provided by the product
manager
Marketing expense Cost of package, mailing, and processing (approval, fulfillment).
The final model is a combination of these four components:
Net Present Value = P(Activation) × Risk Index × Product Profitability – Marketing Expense
For our case study, I have specific methods and values for these measures
Probability of Activation
To calculate the probability of activation, I have two options: build one model that predicts activation or build two models, one for response and one for activation, given response (model build on just responders to target actives) To determine which method works better, I will develop the model both ways
Method 1: One model When using one model, the goal is to target active accounts, that is, those responders who paid
their first premium To use the value ''activate" in the analysis, it must be in a numeric form, preferably 0 or 1 We know
from the frequency in chapter 3 that the values for the variable, activate, are as follows: activate = 1, respond but not
activated = 0, no response = (missing) To model activation from the original offer, I must give nonresponders the value
of 0 I create a new variable, active, and leave the original variable, activate, untouched.
data acqmod.model2;
set acqmod.model2;
if activate = then active = 0;
else active = activate;
run;
We will now use the variable active as our dependent variable.
Method 2: Two models When using two models, the goal is to build one model to target response and a second model
to target activation, given response The probability of activation, P(A), is the product of the probability of response, P(R), times the probability of activation, given response, P(A|R) For this method, I do not have to recode any of the dependent vari-