The aim of this contribution is to illustrate the role of statistical models and, more generally, of statistics, in choosing a Data Mining model.. Key words: Model choice, statistical hy
Trang 2Data Mining Model Comparison
Paolo Giudici
University of Pavia
Summary The aim of this contribution is to illustrate the role of statistical models and, more generally, of statistics, in choosing a Data Mining model After a preliminary introduction on the distinction between Data Mining and statistics, we will focus on the issue of how to choose
a Data Mining methodology This well illustrates how statistical thinking can bring real added value to a Data Mining analysis, as otherwise it becomes rather difficult to make a reasoned choice In the third part of the paper we will present, by means of a case study in credit risk management, how Data Mining and statistics can profitably interact
Key words: Model choice, statistical hypotheses testing, cross-validation, loss functions, credit risk management, logistic regression models
32.1 Data Mining and Statistics
Statistics has always been involved with creating methods to analyse data The main differ-ence compared to the methods developed in Data Mining is that statistical methods are usually developed in relation to the data being analyzed but also according to a conceptual reference paradigm Although this has made the various statistical methods available coherent and rig-orous at the same time, it has also limited their ability to adapt quickly to the methodological requests put forward by the developments in the field of information technology
There are at least four aspects that distinguish the statistical analysis of data from Data Mining
First, while statistical analysis traditionally concerns itself with analyzing primary data that has been collected to check specific research hypotheses, Data Mining can also concern itself with secondary data collected for other reasons This is the norm, for example, when an-alyzing company data that comes from a data warehouse Furthermore, while in the statistical field the data can be of an experimental nature (the data could be the result of an experiment which randomly allocates all the statistical units to different kinds of treatment) in Data Min-ing the data is typically of an observational nature
Second, Data Mining is concerned with analyzing great masses of data This implies new considerations for statistical analysis For example, for many applications it is impossible
to analyst or even access the whole database for reasons of computer efficiency Therefore
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_32, © Springer Science+Business Media, LLC 2010
Trang 3642 Paolo Giudici
it becomes necessary to have a sample of the data from the database being examined This sampling must be carried out bearing in mind the Data Mining aims and, therefore, it cannot
be analyzed with the traditional statistical sampling theory tools
Third, many databases do not lead to the classic forms of statistical data organization This
is true, for example, of data that comes from the Internet This creates the need for appropriate analytical methods to be developed, which are not available in the statistics field
One last but very important difference that we have already mentioned is that Data Mining results must be of some consequence This means that constant attention must be given to business results achieved with the data analysis models
32.2 Data Mining Model Comparison
Several classes of computational and statistical methods for data mining are available Once a class of models has been established the problem is to choose the ”best” model from it In this chapter, summarized from chapter 6 in (Giudici, 2003) we present a systematic comparison of them
Comparison criteria for Data Mining models can be classified schematically into: criteria based on statistical tests, based on scoring functions, Bayesian criteria, computational criteria, and business criteria
The first are based on the theory of statistical hypothesis testing and, therefore, there is a lot of detailed literature related to this topic See for example a text about statistical inference,
such as (Mood et al., 1991) or (Bickel and Doksum, 1977) A statistical model can be specified
by a discrete probability function or by a probability density function, f (x) Such model is
usually left unspecified, up to unknown quantities that have to be estimated on the basis of the data at hand Typically, the observed sample it is not sufficient to reconstruct each detail
of f (x), but can indeed be used to approximate f (x) with a certain accuracy Often a density
function is parametric so that it is defined by a vector of parametersΘ=(θ1, ,θI), such that each valueθ of Θ corresponds to a particular density function, pθ(x) In order to measure the accuracy of a parametric model, one can resort to the notion of distance between a model f , which underlies the data, and an approximating model g (see, for instance, (Zucchini, 2000)).
Notable examples of distance functions are, for categorical variables: the entropic dis-tance, which describes the proportional reduction of the heterogeneity of the dependent vari-able; the chi-squared distance, based on the distance from the case of independence; the 0-1 distance, which leads to misclassification rates
The entropic distance of a distribution g from a target distribution f , is:
E d=∑i f ilogf i
The chi-squared distance of a distribution g from a target distribution f is instead:
χ2d=∑i
( f i − g i)2
The 0-1 distance between a vector of predicted values, X gr, and a vector of observed
values, X f r, is:
0−1 d=∑n
r=1
1
X f r − X gr
(32.3)
Trang 4where 1(w,z) = 1 if w = z and 0 otherwise.
For quantitative variables, the typical choice is the Euclidean distance, representing the distance between two vectors in the Cartesian plane Another possible choice is the uniform distance, applied when nonparametric models are being used
The Euclidean distance between a distribution g and a target f is expressed by the
equa-tion:
2d
X f ,X g
=
n
∑
r=1
X f r − X gr
2
(32.4)
Given two distribution functions F and G with values in [0, 1] it is defined uniform
dis-tance the quantity:
sup
Any of the previous distances can be employed to define the notion of discrepancy of a
statistical model The discrepancy of a model, g, can be obtained as the discrepancy between the unknown probabilistic model, f , and the best (closest) parametric statistical model Since
f is unknown, closeness can be measured with respect to a sample estimate of the unknown
density f
Assume that f represents the unknown density of the population, and let g= pθ be a family
of density functions (indexed by a vector of I parameters,θ) that approximates it Using, to
exemplify, the Euclidean distance, the discrepancy of a model g, with respect to a target model
f is:
Δ ( f , pϑ) =∑n
i=1( f (x i ) − pϑ(x i))2 (32.6)
A common choice of discrepancy function is the Kullback-Leibler divergence, that derives from the entropic distance, and can be applied to any type of observations In such context, the best model can be interpreted as that with a minimal loss of information from the true unknown distribution
The Kullback-Leibler divergence of a parametric model pθ with respect to an unknown
density f is defined by:
ΔK −L ( f , pϑ) =∑i f (x i)log f (x i)
where the parametric density in the denominator has been evaluated in terms of the values of
the parameters which minimizes the distance with respect to f
It can be shown that the statistical tests used for model comparison are generally based
on estimators of the total Kullback-Leibler discrepancy The most used of such estimators is the log-likelihood score Statistical hypothesis testing is based on subsequent pairwise com-parisons between pairs of alternative models The idea is to compare the log-likelihood score
of two alternative models
The log-likelihood score is then defined by:
−2∑n
i=1
Hypothesis testing theory allows to derive a threshold below which the difference between two models is not significant and, therefore, the simpler models can be chosen To summarize,
Trang 5644 Paolo Giudici
using statistical tests it is possible to make an accurate choice among the models, based on the observed data The defect of this procedure is that it allows only a partial ordering of models, requiring a comparison between model pairs and, therefore, with a large number of alternatives
it is necessary to make heuristic choices regarding the comparison strategy (such as choosing among forward, backward and stepwise criteria, whose results may diverge) Furthermore, a probabilistic model must be assumed to hold, and this may not always be a valid assumption
A less structured approach has been developed in the field of information theory, giving rise to criteria based on score functions These criteria give each model a score, which puts them into some kind of complete order We have seen how the Kullback-Leibler discrepancy can be used to derive statistical tests to compare models In many cases, however, a formal test cannot be derived For this reason, it is important to develop scoring functions, that attach
a score to each model The Kullback-Leibler discrepancy estimator is an example of such a scoring function that, for complex models, can be often be approximated asymptotically A problem with the Kullback-Leibler score is that it depends on the complexity of a model as described, for instance, by the number of parameters It is thus necessary to employ score functions that penalise model complexity
The most important of such functions is the AIC (Akaike Information Criterion, see (Akaike, 1974)) The AIC criterion is defined by the following equation:
AIC = −2logL( ˆ ϑ;x1, ,x n ) + 2q (32.9) where the first term is minus twice the the logarithm of the likelihood function calculated in the maximum likelihood parameter estimate and q is the number of parameters of the model From its definition notice that the AIC score essentially penalises the log-likelihood score with a term that increases linearly with model complexity The AIC criterion is based on the implicit assumption that q remains constant when the size of the sample increases However this assumption is not always valid and therefore the AIC criterion does not lead to a consis-tent estimate of the dimension of the unknown model An alternative, and consisconsis-tent, scoring function is the BIC criterion (Bayesian Information Criterion), also called SBC, formulated
in (Schwarz, 1978) The BIC criterion is defined by the following expression:
BIC = −2logLϑ;xˆ 1, ,x n
As can be seen from its definition the BIC differs from the AIC only in the second part
which now also depends on the sample size n Compared to the AIC, when n increases the BIC favours simpler models As n gets large, the first term (linear in n) will dominate the second term (logarithmic in n) This corresponds to the fact that, for a large n, the variance term in
the mean squared error expression tends to be negligible We also point out that, despite the superficial similarity between the AIC and the BIC, the first is usually justified by resorting to classical asymptotic arguments, while the second by appealing to the Bayesian framework
To conclude, the scoring function criteria for selecting models are easy to calculate and lead to a total ordering of the models From most statistical packages we can get the AIC and BIC scores for all the models considered A further advantage of these criteria is that they can
be used also to compare non-nested models and, more generally, models that do not belong to the same class (for instance a probabilistic neural network and a linear regression model) However, the limit of these criteria is the lack of a threshold, as well the difficult inter-pretability of their measurement scale In other words, it is not easy to determine if the dif-ference between two models is significant or not, and how it compares to another difdif-ference These criteria are indeed useful in a preliminary exploration phase To examine this criteria
Trang 6and to compare it with the previous ones see, for instance, (Zucchini, 2000) or (Hand et al.,
2001)
A possible ”compromise” between the previous two criteria is the Bayesian criteria which could be developed in a rather coherent way (see e.g (Bernardo and Smith, 1994)) It appears
to combine the advantages of the two previous approaches: a coherent decision threshold and
a complete ordering One of the problems that may arise is connected to the absence of a general purpose software For Data Mining works using Bayesian criteria the reader could see, for instance, (Giudici, 2003) and (Giudici and Castelo, 2001)
The intensive wide spread use of computational methods has led to the development of computationally intensive model comparison criteria These criteria are usually based on using dataset different than the one being analyzed (external validation) and are applicable to all the models considered, even when they belong to different classes (for example in the comparison between logistic regression, decision trees and neural networks, even when the latter two are non probabilistic) A possible problem with these criteria is that they take a long time to be designed and implemented, although general purpose softwares have made this task easier The most common of such criterion is based on validation The idea of the cross-validation method is to divide the sample into two sub-samples, a ”training” sample, with
n − m observations, and a ”validation” sample, with m observations The first sample is used
to fit a model and the second is used to estimate the expected discrepancy or to assess a distance Using this criterion the choice between two or more models is made by evaluating
an appropriate discrepancy function on the validation sample Notice that the cross-validation idea can be applied to the calculation of any distance function
One problem regarding the cross-validation criterion is in deciding how to select m, that is, the number of the observations contained in the ”validation sample” For example, if we select
m = n/2 then only n/2 observations would be available to fit a model We could reduce m
but this would mean having few observations for the validation sampling group and therefore reducing the accuracy with which the choice between models is made In practice proportions
of 75% and 25% are usually used, respectively for the training and the validation samples
To summarize these criteria have the advantage of being generally applicable but have the disadvantage of taking a long time to be calculated and of being sensitive to the characteristics
of the data being examined A way to overcome this problem is to consider model combi-nation methods, such as bagging and boosting For a thorough description of these recent
methodologies, see (Hastie et al., 2001).
One last group of criteria seem specifically tailored for the data mining field These are criteria that compare the performance of the models in terms of their relative losses, connected
to the errors of approximation made by fitting Data Mining models Criteria based on loss functions have appeared recently, although related ideas are known since longtime in Bayesian decision theory (see for instance (Bernardo and Smith, 1994)) They are of great interest and have great application potential although at present they are mainly concerned with solving problems regarding classification For a more detailed examination of these criteria the reader
can see for example (Hand , 1997,Hand et al., 2001) or the reference manuals on Data Mining
software, such as that of SAS Enterprise Miner
The idea behind these methods is that it is important to focus the attention, in the choice among alternative models, to compare the utility of the results obtained from the models and not just to look exclusively at the statistical comparison between the models themselves Since the main problem dealt with by data analysis is to reduce uncertainties on the risk factors or
”loss” factors, reference is often made to developing criteria that minimize the loss connected
to the problem being examined In other words, the best model is the one that leads to the least loss
Trang 7646 Paolo Giudici
Most of the loss function based criteria apply to predictive classification problems, where the concept of a confusion matrix arises The confusion matrix is used as an indication of the properties of a classification (discriminant) rule It contains the number of elements that have been correctly or incorrectly classified for each class On its main diagonal we can see the number of observations that have been correctly classified for each class while the off-diagonal elements indicate the number of observations that have been incorrectly classified If
it is (explicitly or implicitly) assumed that each incorrect classification has the same cost, the proportion of incorrect classifications over the total number of classifications is called rate of error, or misclassification error, and it is the quantity which must be minimized Of course the assumption of equal costs can be replaced by weighting errors with their relative costs The confusion matrix gives rise to a number of graphs that can be used to assess the rel-ative utility of a model, such as the Lift Chart, and the ROC Curve For a detailed illustration
of these graphs we refer to (Hand , 1997) or (Giudici, 2003) The lift chart puts the valida-tion set observavalida-tions, in increasing or decreasing order, on the basis of their score, which is the probability of the response event (success), as estimated on the basis of the training set Subsequently, it subdivides such scores in deciles It then calculates and graphs the observed probability of success for each of the decile classes in the validation set A model is valid
if the observed success probabilities follow the same order (increasing or decreasing) as the estimated ones Notice that, in order to be better interpreted, the lift chart of a model is usually compared with a baseline curve, for which the probability estimates are drawn in the absence
of a model, that is, taking the mean of the observed success probabilities
The ROC (Receiver Operating Characteristic) curve is a graph that also measures predic-tive accuracy of a model It is based on four conditional frequencies that can be derived from
a model, and the choice of a cut-off points for its scores:
• the observations predicted as events and effectively such (sensitivity)
• the observations predicted as events and effectively non events
• the observations predicted as non events and effectively events;
• the observations predicted as non events and effectively such (specificity)
The ROC curve is obtained representing, for any fixed cut-off value, a point in the Carte-sian plane having as x-value the false positive value (1-specificity) and as y-value the sensi-tivity value Each point in the curve corresponds therefore to a particular cut-off In terms of model comparison, the best curve is the one that is leftmost, the ideal one coinciding with the y-axis To summarize, criteria based on loss functions have the advantage of being easy to interpret and, therefore, well suited for Data Mining applications but, on the other hand, they still need formal improvements and mathematical refinements In the next section we give an example of how this can be done, and show that statistics and Data Mining applications can fruitfully interact
32.3 Application to Credit Risk Management
We now apply the previous considerations to a case-study that concerns credit risk manage-ment The objective of the analysis is the evaluation of the credit reliability of small and medium enterprises (SMEs) that demand financing for their development
In order to assess credit reliability each applicant for credit is associated with a score, usually expressed in terms of probability of repayment (default probability) Data Mining methods are used to estimate such score and, on the basis of it, to classify applicants as being reliable (worth of credit) or not
Trang 8Data Mining models for credit scoring are of the predictive (or supervised) kind: they use explanatory variables obtained from information available on the applicant in order to get
an estimate of the probability of repayment (target or response variable) The methods most used in practical credit scoring applications are: linear and logistic regression models, neural networks and classification tress Often, in banking practice, the resulting scores are called
”statistical” and supplemented with subjective, judgemental evaluations
In this section we consider the analysis of a database that includes 7134 SMEs belong-ing to the retail segment of an important Italian bank The retail segment contains companies with total annual sales less than 2,5 million per year On each of this companies the bank has calculated a score, in order to evaluate their financing (or refinancing) in the period from April 1st, 1999 to April 30th, 2000 After data cleaning, 13 variables are included in the anal-ysis database, of which one binary variable that expresses credit reliability (BAD =0 for the reliables, BAD=1 for the non reliables) can be considered as the response or target variable The sample contains about 361 companies with BAD=1 (about 5%) and 6773 observed with BAD=0 (about 95%) The objective of the analysis is to build a statistical rule that explains the target variable as a function of the explanatory one Once built on the observed data, such rule will be extrapolated to assess and predict future applicants for credit Notice the unbal-ancedness of the distribution of the target response: this situation, typical in predictive Data Mining problems, poses serious challenges to the performance of a model
The remaining 12 available variables are retained to influence reliability, and can be con-sidered as explanatory predictors Among them we have: the age of the company, its legal status, the number of employees, the total sales and variation of the sales in the last period, the region of residence, the specific business, the duration of the relationship of the managers of the company with the bank Most of them can be considered as ”demographic” information on the company, stable in time but indeed not very powerful to build a statistical model However,
it must be said that, being the companies considered all SMEs, it is rather difficult to rely on other, such as balance sheet, information
A preliminary exploratory analysis can give indications on how to code the explanatory variables, in order to maximize their predictive power In order to reach this objective we have employed statistical measures of association between pairs of variables, such as chi-squared based measures and statistical measures of dependence, such as Goodman and Kruskal’s (see (Giudici, 2003) for a systematic comparison of such measures) We remark that the use of such tools is very much beneficial for the analysis, and can considerably improve the final per-formance results As a result of our analysis, all explanatory variables have been discretised, with a number of levels ranging from 2 to 26
In order to focus on the issue of model comparison we now concentrate on the comparison
of three different logistic regression models on the data This model is the most used in credit scoring applications; other models that are employed are classification trees, linear discrimi-nant analysis and neural networks Here we prefer to compare models belonging to the same class, to better illustrate our issue; for a detailed comparison of credit scoring methods, on a different data set, see (Giudici, 2003) Our analysis have been conducted using SAS and SAS Enterprise Miner softwares, available at the bank subject of the analysis
We have chosen, in agreement with the bank’s experts, three logistic regression models:
a saturated model, that contains all explanatory variables, with the levels obtained from the explanatory analysis; a statistically selected model, using pairwise statistical hypotheses test-ing; and a model that minimizes the loss function In the following, the saturated model will
be named ”RegA (model A)”; the chosen model, according to a statistical selection strategy
”RegB (model B)”, the model chosen minimizing the loss function ”RegC (model C)” Sta-tistical model comparison has been carried out using a stepwise model selection approach,
Trang 9648 Paolo Giudici
with a reference value of 0,05 to compare p-values with On the other hand, the loss function has been expressed by the bank’s experts, as a function of the classification errors Table 32.1 below describes such a loss function
Table 32.1 The chosen loss function
Predicted
Actual
BAD GOOD BAD 0 20
GOOD -1 0
The table contains the estimated losses (in scale free values) corresponding to the combi-nations of actual and predicted values of the target variable The specified loss function means that it is retained that giving credit to a non reliable (bad) enterprise is 20 times more costly that not giving credit to a reliable (good) enterprise In statistical terms, the type I error costs
20 times the type II error As each of the four scenarios in Table 32.1 has an occurrence prob-ability, it is possible to calculate the expected loss of each considered statistical model The best one will be that minimizing such expected loss
In the SAS Enterprise Miner tool the Assessment node provides a common framework to compare models, in terms of their predictions This requires that data has been partitioned in two or more datasets, according to computational criteria of model comparison The Assess-ment node produces a table view of the model results that lists relevant statistics and model adequacy and several different charts/reports depending on whether the target variable is con-tinuous or categorical and whether a profit/loss function has been specified
In the case under examination, the initial dataset (5351 observations) has been split in two, using a sampling mechanism stratified with respect to the target variable The training dataset contains about 70% of the observations (about 3712) and the validation dataset the remaining 30% (about1639 observations) As the samples are stratified, in both the resulting datasets the percentages of ”bad” and ”good” enterprises remain the same as those in the combined dataset (5 % e il 95%)
The first model comparison tool we consider is the lift chart For a binary target, the lift (also called gains chart) is built as follows The scored data set is sorted by the probabili-ties of the target event in descending order; observations are then grouped into deciles For each decile, a lift chart can calculate either: the percentage of target responses (Bad repayers here) or the ratio between the percentage and the corresponding one for the baseline (random) model, called the lift Lift charts show the percent of positive response or the lift value on the vertical axis Table 54.1 show the calculations that give rise to the lift chart, for the credit scoring problem considered here Figure 32.3 shows the corresponding curves
Table 32.2 Calculations for the lift chart
Number of
observa-tions in each group
percentile % of captured
re-sponses (BASELINE)
% di of captured re-sponses % (REG A)
% di of captured re-sponses % (REG B)
% di of captured re-sponses % (REG C) 163.90 10 5.064 20.134 22.575 22.679
163.90 20 5.064 12.813 12.813 14.033
163.90 30 5.064 9.762 10.103 10.293
163.90 40 5.064 8.237 8.237 8.542
163.90 50 5.064 7.322 7.383 7.445
163.90 60 5.064 6.508 6.913 6.624
163.90 70 5.064 5.753 6.237 6.096
163.90 80 5.064 5.567 5.567 5.644
163.90 90 5.064 5.288 5.220 5.185
163.90 100 5.064 5.064 5.064 5.064
Trang 10Fig 32.1 Lift charts for the best model
Comparing the results in Table 54.1 and Figure 39.1 it emerges that the performances of the three models being compared are rather similar; however the best model seem to be model
C (the model that minimises the losses) as it is the model that, in the first deciles, is able to effectively capture more bad enterprises, a difficult task in the given problem Recalling that the actual percentage of bad enterprises observed is equal to 5%, the previous graph can be normalized by dividing the percentage of bads in each decile by the overall 5% percentage The result is the actual lift of a model, that is, the actual improvement with respect to the baseline situation of absence of a model (as if each company were estimated good/bad accord-ing to a purely random mechanism) In terms of model C, in the first decile (with about 164 enterprises) the lift is equal to 4,46 (i.e 22,7%5,1%); this means that, using model C it is
expected to obtain, in the first decile, a number of enterprises 4,5 times higher with respect to
a random sample of the considered enterprises
The second Assessment tool we consider is the threshold chart Threshold-based charts enable to display the agreement between the predicted and actual target values across a range
of threshold levels The threshold level is the cutoff that is used to classify an observation that
is based on the event level posterior probabilities The default threshold level is 0.50 For the credit scoring case the calculations leading to the threshold chart are in Table 32.3 and the corresponding figure in Figure 32.3 below
In order to interpret correctly the previous table and figure, let us consider some numerical examples First we remark that the results refer to the validation dataset, with 1629 enterprises