Data For Marketing Risk And Customer Relationship Management_9 pptx

The following data step appends the overall mean values to every record: data ch09.bs_all; set ch09.bs_all; if _n_ eq 1 then set preddata; retain sumwgt rspmean salmean; run; PROC SUMM

Trang 1

Figure 9.9

Trang 2

Figure 9.10 Validation gains table with lift.

Trang 3

Page 227

Figure 9.11 Validation gains chart

The following data step appends the overall mean values to every record:

data ch09.bs_all;

set ch09.bs_all;

if (_n_ eq 1) then set preddata;

retain sumwgt rspmean salmean;

run;

PROC SUMMARY creates mean values of respond (rspmnf) and 12-month sales (salmnf) for each decile (val_dec):

proc summary data=ch09.bs_all;

var respond sale12mo;

class val_dec;

output out=ch09.fullmean mean= rspmnf salmnf ;

run;

The next data step uses the output from PROC SUMMARY to create a separate data set (salfmean) with the two overall

mean values renamed The overall mean values are stored in the observation where val_dec has a missing value ( val_dec

= ) These will be used in the final bootstrap calculation:

Trang 4

Page 228

data salfmean( rename=(salmnf=salomn_g rspmnf=rspomn_g ) drop=val_dec);

set ch09.fullmean( where=(val_dec=.) keep=salmnf rspmnf val_dec);

smp_wgt=1;

run;

In the next data step, the means are appended to every value of the data set ch09.fullmean This will be accessed in the

final calculations following the macro

data ch09.fullmean;

set ch09.fullmean;

if (_n_ eq 1) then set salfmean;

retain salomn_g rspomn_g;

run;

The bootstrapping program is identical to the one in chapter 6 up to the point where the estimates are calculated The following data step merges all the bootstrap samples and calculates the bootstrap estimates:

data ch09.bs_sum(keep=liftf bsest_r rspmnf lci_r uci_r bsest_s salmnf

lci_s uci_s bsest_l lftmbs lci_l uci_l val_dec salomn_g);

rspmbs = mean(of rspmn1 -rspmn25); /* mean of response */

rspsdbs = std(of rspmn1 -rspmn25); /* st dev of response */

salmbs = mean(of salmn1 -salmn25); /* mean of sales */

salsdbs = std(of salmn1 -salmn25); /* st dev of sales */

lftmbs = mean(of liftd1 -liftd25); /* mean of lift */

lftsdbs = std(of liftd1 -liftd25); /* st dev of lift */

liftf = 100*salmnf/salomn_g; /* overall lift for sales */

bsest_r = 2*rspmnf - rspmbs; /* boostrap est - response */

lci_r = bsest_r - 1.96*rspsdbs; /* lower conf interval */

uci_r = bsest_r + 1.96*rspsdbs; /* upper conf interval */

bsest_s = 2*salmnf - salmbs; /* boostrap est - sales */

lci_s = bsest_s - 1.96*salsdbs; /* lower conf interval */

uci_s = bsest_s + 1.96*salsdbs; /* upper conf interval */

bsest_l = 2*liftf - lftmbs; /* boostrap est - lift */

lci_l = bsest_l - 1.96*lftsdbs; /* lower conf interval */

uci_l = bsest_l + 1.96*lftsdbs; /* upper conf interval */

run;

Trang 5

Finally, I use PROC TABULATE to display the bootstrap and confidence interval values by decile.proc tabulate data=ch09.bs_sum;

var liftf bsest_r rspmnf lci_r uci_r bsest_s salmnf

lci_s uci_s bsest_l lftmbs lci_l uci_l;

class val_dec;

table (val_dec='Decile' all='Total'),

(rspmnf='Actual Resp'*mean=' '*f=percent6

bsest_r='BS Est Resp'*mean=' '*f=percent6

lci_r ='BS Lower CI Resp'*mean=' '*f=percent6

uci_r ='BS Upper CI Resp'*mean=' '*f=percent6

salmnf ='12-Month Sales'*mean=' '*f=dollar8

bsest_s='BS Est Sales'*mean=' '*f=dollar8

lci_s ='BS Lower CI Sales'*mean=' '*f=dollar8

uci_s ='BS Upper CI Sales'*mean=' '*f=dollar8

liftf ='Sales Lift'*mean=' '*f=6

bsest_l='BS Est Lift'*mean=' '*f=6

lci_l ='BS Lower CI Lift'*mean=' '*f=6

uci_l ='BS Upper CI Lift'*mean=' '*f=6.)

/rts=10 row=float;

run;

Figure 9.12 Bootstrap analysis

Trang 6

Page 230The results of the bootstrap analysis give me confidence that the model is stable Notice how the confidence intervals are fairly tight even in the best decile And the bootstrap estimates are very close to the actual value, providing additional security Keep in mind that these estimates are not based on actual behavior but rather a propensity toward a type of behavior They will, however, provide a substantial improvement over random selection.

Implementing the Model

In this case, the same file containing the score will be used for marketing The marketing manager at Downing Office Products now has a robust model that can be used to solicit businesses that have the highest propensity to buy the company's products

The ability to rank the entire business list also creates other opportunities for Downing It is now prepared to prioritize sales efforts to maximize its marketing dollar The top scoring businesses (deciles 7–9) are targeted to receive a personal sales call The middle group (4–8) is targeted to receive several telemarketing solicitations And the lowest group (deciles 0–3) will receive a postcard directing potential customers to the company's Web site This is expected to provide

a substantial improvement in yearly sales

Response models are the most widely used and work for almost any industry From banks and insurance companies selling their products to phone companies and resorts selling their services, the simplest response model can improve targeting and cut costs Whether you're targeting individuals, families, or business, the rules are the same: clear

objective, proper data preparation, linear predictors, rigorous processing, and thorough validation In our next chapter,

we try another recipe We're going to predict which prospects are more likely to be financially risky

Trang 7

In this chapter, I start off with a description of credit scoring, its origin, and how it has evolved into risk modeling Then

I begin the case study in which I build a model that predicts risk by targeting failure to pay on a credit-based purchase for the telecommunications or telco industry (This is also known as an approval model.) As in chapter 9, I define the objective, prepare the variables, and process and validate the model You will see some similarities in the processes, but there are also some notable differences due to the nature of the data Finally, I wrap up the chapter with a brief

discussion of fraud modeling and how it's being used to reduce losses in many industries

Trang 8

Page 232

Credit Scoring and Risk Modeling

If you've ever applied for a loan, I'm sure you're familiar with questions like, ''Do you own or rent?" "How long have you lived at your current address?" and "How many years have you been with your current employer?" The answers to these questions— and more— are used to calculate your credit score Based on your answers (each of which is assigned a value), your score is summed and evaluated Historically, this method has been very effective in helping companies determine credit worthiness

Credit scoring began in the early sixties when Fair, Isaac and Company developed the first simple scoring algorithm based on a few key factors Until that time, decisions to grant credit were primarily based on judgment Some companies were resistant to embrace a score to determine credit worthiness As the scores proved to be predictive, more and more companies began to use them

As a result of increased computer power, more available data, and advances in technology, tools for predicting credit risk have become much more sophisticated This has led to complex credit scoring algorithms that have the ability to consider and utilize many different factors Through these advances, risk scoring has evolved from a simple scoring algorithm based on a few factors to the sophisticated scoring algorithms we see today

Over the years, Fair, Isaac scores have become a standard in the industry While its methodology has been closely guarded, it recently published the components of its credit-scoring algorithm Its score is based on the following

elements:

Past payment history

• Account payment information on specific types of accounts (e.g., credit cards, retail accounts, installment loans, finance company accounts, mortgage)

• Presence of adverse public records (e.g., bankruptcy, judgments, suits, liens, wage attachments), collection items, and/or delinquency (past due items)

• Severity of delinquency (how long past due)

• Amount past due on delinquent accounts or collection items

• Time since (recency of) past due items (delinquency), adverse public records (if any), or collection items (if any)

• Number of past due items on file

Trang 9

Page 233

• Number of accounts paid as agreed

Amount of credit owing

• Amount owing on accounts

• Amount owing on specific types of accounts

• Lack of a specific type of balance, in some cases

• Number of accounts with balances

• Proportion of credit lines used (proportion of balances to total credit limits on certain types of revolving accounts)

• Proportion of installment loan amounts still owing (proportion of balance to original loan amount on certain types of installment loans)

Length of time credit established

• Time since accounts opened

• Time since accounts opened, by specific type of account

• Time since account activity

Search for and acquisition of new credit

• Number of recently opened accounts, and proportion of accounts that are recently opened, by type of account

• Number of recent credit inquiries

• Time since recent account opening(s), by type of

account

• Time since credit inquiry(s)

• Reestablishment of positive credit history following past payment problems

Types of credit established

• Number of (presence, prevalence, and recent information on) various types of accounts (credit cards, retail accounts, installment loans, mortgage, consumer finance accounts, etc.)

Over the past decade, numerous companies have begun developing their own risk scores to sell or for personal use In this case study, I will develop a risk score that is very similar to those available on the market I will test the final scoring algorithm against a generic risk score that I obtained from the credit bureau

Trang 10

Page 234

Defining the Objective

Eastern Telecom has just formed an alliance with First Reserve Bank to sell products and services Initially, Eastern wishes to offer cellular phones and phone services to First Reserve's customer base Eastern plans to use statement inserts to promote its products and services, so marketing costs are relatively small Its main concern at this point is managing risk

Since payment behavior for a loan product is highly correlated with payment behavior for a product or service, Eastern plans to use the bank's data to predict financial risk over a three-year period To determine the level of risk for each customer, Eastern Telecom has decided to develop a model that predicts the probability of a customer becoming 90+ days past due or defaulting on a loan within a three-year period

To develop a modeling data set, Eastern took a sample of First Reserve's loan customers From the customers that were current 36 months ago, Eastern selected all the customers now considered high risk or in default and a sample of those customers who were still current and considered low risk A high-risk customer was defined as any customer who was

90 days or more behind on a loan with First Reserve Bank This included all bankruptcies and charge-offs Eastern

created three data fields to define a high -risk customer: bkruptcy to denote if they were bankrupt, chargoff to denote if they were charged off, and dayspdue, a numeric field detailing the days past due.

A file containing name, address, social security number, and a match key (idnum) was sent to the credit bureau for a data

overlay Eastern requested that the bureau pull 300+ variables from an archive of 36 months ago and append the

information to the customer file It also purchased a generic risk score that was developed by an outside source

The file was returned and matched to the original extract to combine the 300+ predictive variables with the three data fields The following code takes the combined file and creates the modeling data set The first step defines the

independent variable, highrisk The second step samples and defines the weight, smp_wgt This step creates two

temporary data sets, hr and lr, that are brought together in the final step to create the data set ch10.telco:

Trang 11

high-Table 10.1 Population and Sample Frequencies and Weights

Preparing the Variables

This data set is unusual in that all the predictive variables are continuous except one, gender Because there are so many

(300+) variables, I have decided to do things a little differently I know that when data is sent to an outside source for data overlay that there will be missing values The first thing I want to do is determine which variables have a high number of missing values I begin by running PROC MEANS with an option to calculate the number of missing values,

nmiss To avoid having to type in the entire list of variable names, I run a PROC CONTENTS with a short option This

creates a variable list (in all caps) without any numbers or extraneous information This can be cut and pasted into open code:

proc contents data=ch10.telco short;

run;

proc means data=ch10.telco n nmiss mean min max maxdec=1;

Trang 12

var AFADBM AFMAXB AFMAXH AFMINB AFMINH AFOPEN AFPDBAL AFR29 AFR39

With 61 variables, I am going to streamline my techniques for faster processing The first step is to check the quality of the data I look for outliers and handle the missing values Rather than look at each variable individually, I run another PROC MEANS on the 61 variables that I have chosen for consideration Figure 10.2 shows part of the output

Figure 10.1 Means of continuous variables

Trang 13

At this point, the only problem I see is with the variable age The minimum age does not seem correct because we have no customers under the age of

18 In Figure 10.3, the univariate analysis of age shows that less than 1% of the values for age are below 18, so I will treat any value of

missing

As I said, I am going to do things a bit differently this time to speed the processing I have decided that because the number of missing values for each

variable is relatively small, I am going to use mean substitution to replace the values Before I replace the missing values with the mean, I want to create

a set of duplicate variables This allows me to keep the original values intact

To streamline the code, I use an array to create a set of duplicate variables An array is a handy SAS option that allows you to assign a name to a group

of variables Each of the 61 variables is duplicated with the variable names rvar1, rvar2, rvar3, though rvar61 This allows me to perform the same

calculations on every variable just by naming the array In fact, I am going to use several

Figure 10.2 Means of selected variables

Trang 14

Figure 10.3 Univariate analysis of age.

arrays This will make it much easier to follow the code because I won't get bogged down in variable names.

In the following data step, I create an array called riskvars This represents the 61 variables that I've selected as preliminary candidates I also create

an array called rvar This represents the group of renamed variables, rvar1–rvar61 The "do loop" following the array names takes each of the 61

variables and renames it to rvar1-rvar61

data riskmean;

set ch10.telco;

array riskvars (61)

COLLS LOCINQS INQAGE TADB25 TUTRADES TLTRADES ;

array rvar (61) rvar1-rvar61;

do count = 1 to 61;

rvar(count) = riskvars(count);

end;

run;

Định dạng
Số trang	29
Dung lượng	647,25 KB