The following data step appends the overall mean values to every record: data ch09.bs_all; set ch09.bs_all; if _n_ eq 1 then set preddata; retain sumwgt rspmean salmean; run; PROC SUMM
Trang 1Figure 9.9
Trang 2Figure 9.10 Validation gains table with lift.
Trang 3Page 227
Figure 9.11 Validation gains chart
The following data step appends the overall mean values to every record:
data ch09.bs_all;
set ch09.bs_all;
if (_n_ eq 1) then set preddata;
retain sumwgt rspmean salmean;
run;
PROC SUMMARY creates mean values of respond (rspmnf) and 12-month sales (salmnf) for each decile (val_dec):
proc summary data=ch09.bs_all;
var respond sale12mo;
class val_dec;
output out=ch09.fullmean mean= rspmnf salmnf ;
run;
The next data step uses the output from PROC SUMMARY to create a separate data set (salfmean) with the two overall
mean values renamed The overall mean values are stored in the observation where val_dec has a missing value ( val_dec
= ) These will be used in the final bootstrap calculation:
Trang 4Page 228
data salfmean( rename=(salmnf=salomn_g rspmnf=rspomn_g ) drop=val_dec);
set ch09.fullmean( where=(val_dec=.) keep=salmnf rspmnf val_dec);
smp_wgt=1;
run;
In the next data step, the means are appended to every value of the data set ch09.fullmean This will be accessed in the
final calculations following the macro
data ch09.fullmean;
set ch09.fullmean;
if (_n_ eq 1) then set salfmean;
retain salomn_g rspomn_g;
run;
The bootstrapping program is identical to the one in chapter 6 up to the point where the estimates are calculated The following data step merges all the bootstrap samples and calculates the bootstrap estimates:
data ch09.bs_sum(keep=liftf bsest_r rspmnf lci_r uci_r bsest_s salmnf
lci_s uci_s bsest_l lftmbs lci_l uci_l val_dec salomn_g);
rspmbs = mean(of rspmn1 -rspmn25); /* mean of response */
rspsdbs = std(of rspmn1 -rspmn25); /* st dev of response */
salmbs = mean(of salmn1 -salmn25); /* mean of sales */
salsdbs = std(of salmn1 -salmn25); /* st dev of sales */
lftmbs = mean(of liftd1 -liftd25); /* mean of lift */
lftsdbs = std(of liftd1 -liftd25); /* st dev of lift */
liftf = 100*salmnf/salomn_g; /* overall lift for sales */
bsest_r = 2*rspmnf - rspmbs; /* boostrap est - response */
lci_r = bsest_r - 1.96*rspsdbs; /* lower conf interval */
uci_r = bsest_r + 1.96*rspsdbs; /* upper conf interval */
bsest_s = 2*salmnf - salmbs; /* boostrap est - sales */
lci_s = bsest_s - 1.96*salsdbs; /* lower conf interval */
uci_s = bsest_s + 1.96*salsdbs; /* upper conf interval */
bsest_l = 2*liftf - lftmbs; /* boostrap est - lift */
lci_l = bsest_l - 1.96*lftsdbs; /* lower conf interval */
uci_l = bsest_l + 1.96*lftsdbs; /* upper conf interval */
run;
Trang 5Finally, I use PROC TABULATE to display the bootstrap and confidence interval values by decile.proc tabulate data=ch09.bs_sum;
var liftf bsest_r rspmnf lci_r uci_r bsest_s salmnf
lci_s uci_s bsest_l lftmbs lci_l uci_l;
class val_dec;
table (val_dec='Decile' all='Total'),
(rspmnf='Actual Resp'*mean=' '*f=percent6
bsest_r='BS Est Resp'*mean=' '*f=percent6
lci_r ='BS Lower CI Resp'*mean=' '*f=percent6
uci_r ='BS Upper CI Resp'*mean=' '*f=percent6
salmnf ='12-Month Sales'*mean=' '*f=dollar8
bsest_s='BS Est Sales'*mean=' '*f=dollar8
lci_s ='BS Lower CI Sales'*mean=' '*f=dollar8
uci_s ='BS Upper CI Sales'*mean=' '*f=dollar8
liftf ='Sales Lift'*mean=' '*f=6
bsest_l='BS Est Lift'*mean=' '*f=6
lci_l ='BS Lower CI Lift'*mean=' '*f=6
uci_l ='BS Upper CI Lift'*mean=' '*f=6.)
/rts=10 row=float;
run;
Figure 9.12 Bootstrap analysis
Trang 6Page 230The results of the bootstrap analysis give me confidence that the model is stable Notice how the confidence intervals are fairly tight even in the best decile And the bootstrap estimates are very close to the actual value, providing additional security Keep in mind that these estimates are not based on actual behavior but rather a propensity toward a type of behavior They will, however, provide a substantial improvement over random selection.
Implementing the Model
In this case, the same file containing the score will be used for marketing The marketing manager at Downing Office Products now has a robust model that can be used to solicit businesses that have the highest propensity to buy the company's products
The ability to rank the entire business list also creates other opportunities for Downing It is now prepared to prioritize sales efforts to maximize its marketing dollar The top scoring businesses (deciles 7–9) are targeted to receive a personal sales call The middle group (4–8) is targeted to receive several telemarketing solicitations And the lowest group (deciles 0–3) will receive a postcard directing potential customers to the company's Web site This is expected to provide
a substantial improvement in yearly sales
Response models are the most widely used and work for almost any industry From banks and insurance companies selling their products to phone companies and resorts selling their services, the simplest response model can improve targeting and cut costs Whether you're targeting individuals, families, or business, the rules are the same: clear
objective, proper data preparation, linear predictors, rigorous processing, and thorough validation In our next chapter,
we try another recipe We're going to predict which prospects are more likely to be financially risky
Trang 7In this chapter, I start off with a description of credit scoring, its origin, and how it has evolved into risk modeling Then
I begin the case study in which I build a model that predicts risk by targeting failure to pay on a credit-based purchase for the telecommunications or telco industry (This is also known as an approval model.) As in chapter 9, I define the objective, prepare the variables, and process and validate the model You will see some similarities in the processes, but there are also some notable differences due to the nature of the data Finally, I wrap up the chapter with a brief
discussion of fraud modeling and how it's being used to reduce losses in many industries
Trang 8Page 232
Credit Scoring and Risk Modeling
If you've ever applied for a loan, I'm sure you're familiar with questions like, ''Do you own or rent?" "How long have you lived at your current address?" and "How many years have you been with your current employer?" The answers to these questions— and more— are used to calculate your credit score Based on your answers (each of which is assigned a value), your score is summed and evaluated Historically, this method has been very effective in helping companies determine credit worthiness
Credit scoring began in the early sixties when Fair, Isaac and Company developed the first simple scoring algorithm based on a few key factors Until that time, decisions to grant credit were primarily based on judgment Some companies were resistant to embrace a score to determine credit worthiness As the scores proved to be predictive, more and more companies began to use them
As a result of increased computer power, more available data, and advances in technology, tools for predicting credit risk have become much more sophisticated This has led to complex credit scoring algorithms that have the ability to consider and utilize many different factors Through these advances, risk scoring has evolved from a simple scoring algorithm based on a few factors to the sophisticated scoring algorithms we see today
Over the years, Fair, Isaac scores have become a standard in the industry While its methodology has been closely guarded, it recently published the components of its credit-scoring algorithm Its score is based on the following
elements:
Past payment history
• Account payment information on specific types of accounts (e.g., credit cards, retail accounts, installment loans, finance company accounts, mortgage)
• Presence of adverse public records (e.g., bankruptcy, judgments, suits, liens, wage attachments), collection items, and/or delinquency (past due items)
• Severity of delinquency (how long past due)
• Amount past due on delinquent accounts or collection items
• Time since (recency of) past due items (delinquency), adverse public records (if any), or collection items (if any)
• Number of past due items on file
Trang 9Page 233
• Number of accounts paid as agreed
Amount of credit owing
• Amount owing on accounts
• Amount owing on specific types of accounts
• Lack of a specific type of balance, in some cases
• Number of accounts with balances
• Proportion of credit lines used (proportion of balances to total credit limits on certain types of revolving accounts)
• Proportion of installment loan amounts still owing (proportion of balance to original loan amount on certain types of installment loans)
Length of time credit established
• Time since accounts opened
• Time since accounts opened, by specific type of account
• Time since account activity
Search for and acquisition of new credit
• Number of recently opened accounts, and proportion of accounts that are recently opened, by type of account
• Number of recent credit inquiries
• Time since recent account opening(s), by type of
account
• Time since credit inquiry(s)
• Reestablishment of positive credit history following past payment problems
Types of credit established
• Number of (presence, prevalence, and recent information on) various types of accounts (credit cards, retail accounts, installment loans, mortgage, consumer finance accounts, etc.)
Over the past decade, numerous companies have begun developing their own risk scores to sell or for personal use In this case study, I will develop a risk score that is very similar to those available on the market I will test the final scoring algorithm against a generic risk score that I obtained from the credit bureau
Trang 10Page 234
Defining the Objective
Eastern Telecom has just formed an alliance with First Reserve Bank to sell products and services Initially, Eastern wishes to offer cellular phones and phone services to First Reserve's customer base Eastern plans to use statement inserts to promote its products and services, so marketing costs are relatively small Its main concern at this point is managing risk
Since payment behavior for a loan product is highly correlated with payment behavior for a product or service, Eastern plans to use the bank's data to predict financial risk over a three-year period To determine the level of risk for each customer, Eastern Telecom has decided to develop a model that predicts the probability of a customer becoming 90+ days past due or defaulting on a loan within a three-year period
To develop a modeling data set, Eastern took a sample of First Reserve's loan customers From the customers that were current 36 months ago, Eastern selected all the customers now considered high risk or in default and a sample of those customers who were still current and considered low risk A high-risk customer was defined as any customer who was
90 days or more behind on a loan with First Reserve Bank This included all bankruptcies and charge-offs Eastern
created three data fields to define a high -risk customer: bkruptcy to denote if they were bankrupt, chargoff to denote if they were charged off, and dayspdue, a numeric field detailing the days past due.
A file containing name, address, social security number, and a match key (idnum) was sent to the credit bureau for a data
overlay Eastern requested that the bureau pull 300+ variables from an archive of 36 months ago and append the
information to the customer file It also purchased a generic risk score that was developed by an outside source
The file was returned and matched to the original extract to combine the 300+ predictive variables with the three data fields The following code takes the combined file and creates the modeling data set The first step defines the
independent variable, highrisk The second step samples and defines the weight, smp_wgt This step creates two
temporary data sets, hr and lr, that are brought together in the final step to create the data set ch10.telco:
Trang 11high-Table 10.1 Population and Sample Frequencies and Weights
Preparing the Variables
This data set is unusual in that all the predictive variables are continuous except one, gender Because there are so many
(300+) variables, I have decided to do things a little differently I know that when data is sent to an outside source for data overlay that there will be missing values The first thing I want to do is determine which variables have a high number of missing values I begin by running PROC MEANS with an option to calculate the number of missing values,
nmiss To avoid having to type in the entire list of variable names, I run a PROC CONTENTS with a short option This
creates a variable list (in all caps) without any numbers or extraneous information This can be cut and pasted into open code:
proc contents data=ch10.telco short;
run;
proc means data=ch10.telco n nmiss mean min max maxdec=1;
Trang 12var AFADBM AFMAXB AFMAXH AFMINB AFMINH AFOPEN AFPDBAL AFR29 AFR39
With 61 variables, I am going to streamline my techniques for faster processing The first step is to check the quality of the data I look for outliers and handle the missing values Rather than look at each variable individually, I run another PROC MEANS on the 61 variables that I have chosen for consideration Figure 10.2 shows part of the output
Figure 10.1 Means of continuous variables
Trang 13At this point, the only problem I see is with the variable age The minimum age does not seem correct because we have no customers under the age of
18 In Figure 10.3, the univariate analysis of age shows that less than 1% of the values for age are below 18, so I will treat any value of
missing
As I said, I am going to do things a bit differently this time to speed the processing I have decided that because the number of missing values for each
variable is relatively small, I am going to use mean substitution to replace the values Before I replace the missing values with the mean, I want to create
a set of duplicate variables This allows me to keep the original values intact
To streamline the code, I use an array to create a set of duplicate variables An array is a handy SAS option that allows you to assign a name to a group
of variables Each of the 61 variables is duplicated with the variable names rvar1, rvar2, rvar3, though rvar61 This allows me to perform the same
calculations on every variable just by naming the array In fact, I am going to use several
Figure 10.2 Means of selected variables
Trang 14Figure 10.3 Univariate analysis of age.
arrays This will make it much easier to follow the code because I won't get bogged down in variable names.
In the following data step, I create an array called riskvars This represents the 61 variables that I've selected as preliminary candidates I also create
an array called rvar This represents the group of renamed variables, rvar1–rvar61 The "do loop" following the array names takes each of the 61
variables and renames it to rvar1-rvar61
data riskmean;
set ch10.telco;
array riskvars (61)
COLLS LOCINQS INQAGE TADB25 TUTRADES TLTRADES ;
array rvar (61) rvar1-rvar61;
do count = 1 to 61;
rvar(count) = riskvars(count);
end;
run;