Data Mining and Knowledge Discovery Handbook, 2 Edition part 67 docx

The aim of this contribution is to illustrate the role of statistical models and, more generally, of statistics, in choosing a Data Mining model.. Key words: Model choice, statistical hy

Trang 2

Data Mining Model Comparison

Paolo Giudici

University of Pavia

Summary The aim of this contribution is to illustrate the role of statistical models and, more generally, of statistics, in choosing a Data Mining model After a preliminary introduction on the distinction between Data Mining and statistics, we will focus on the issue of how to choose

a Data Mining methodology This well illustrates how statistical thinking can bring real added value to a Data Mining analysis, as otherwise it becomes rather difﬁcult to make a reasoned choice In the third part of the paper we will present, by means of a case study in credit risk management, how Data Mining and statistics can proﬁtably interact

Key words: Model choice, statistical hypotheses testing, cross-validation, loss functions, credit risk management, logistic regression models

32.1 Data Mining and Statistics

Statistics has always been involved with creating methods to analyse data The main differ-ence compared to the methods developed in Data Mining is that statistical methods are usually developed in relation to the data being analyzed but also according to a conceptual reference paradigm Although this has made the various statistical methods available coherent and rig-orous at the same time, it has also limited their ability to adapt quickly to the methodological requests put forward by the developments in the ﬁeld of information technology

There are at least four aspects that distinguish the statistical analysis of data from Data Mining

First, while statistical analysis traditionally concerns itself with analyzing primary data that has been collected to check speciﬁc research hypotheses, Data Mining can also concern itself with secondary data collected for other reasons This is the norm, for example, when an-alyzing company data that comes from a data warehouse Furthermore, while in the statistical ﬁeld the data can be of an experimental nature (the data could be the result of an experiment which randomly allocates all the statistical units to different kinds of treatment) in Data Min-ing the data is typically of an observational nature

Second, Data Mining is concerned with analyzing great masses of data This implies new considerations for statistical analysis For example, for many applications it is impossible

to analyst or even access the whole database for reasons of computer efﬁciency Therefore

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_32, © Springer Science+Business Media, LLC 2010

Trang 3

642 Paolo Giudici

it becomes necessary to have a sample of the data from the database being examined This sampling must be carried out bearing in mind the Data Mining aims and, therefore, it cannot

be analyzed with the traditional statistical sampling theory tools

Third, many databases do not lead to the classic forms of statistical data organization This

is true, for example, of data that comes from the Internet This creates the need for appropriate analytical methods to be developed, which are not available in the statistics ﬁeld

One last but very important difference that we have already mentioned is that Data Mining results must be of some consequence This means that constant attention must be given to business results achieved with the data analysis models

32.2 Data Mining Model Comparison

Several classes of computational and statistical methods for data mining are available Once a class of models has been established the problem is to choose the ”best” model from it In this chapter, summarized from chapter 6 in (Giudici, 2003) we present a systematic comparison of them

Comparison criteria for Data Mining models can be classiﬁed schematically into: criteria based on statistical tests, based on scoring functions, Bayesian criteria, computational criteria, and business criteria

The ﬁrst are based on the theory of statistical hypothesis testing and, therefore, there is a lot of detailed literature related to this topic See for example a text about statistical inference,

such as (Mood et al., 1991) or (Bickel and Doksum, 1977) A statistical model can be speciﬁed

by a discrete probability function or by a probability density function, f (x) Such model is

usually left unspeciﬁed, up to unknown quantities that have to be estimated on the basis of the data at hand Typically, the observed sample it is not sufﬁcient to reconstruct each detail

of f (x), but can indeed be used to approximate f (x) with a certain accuracy Often a density

function is parametric so that it is deﬁned by a vector of parametersΘ=(θ1, ,θI), such that each valueθ of Θ corresponds to a particular density function, pθ(x) In order to measure the accuracy of a parametric model, one can resort to the notion of distance between a model f , which underlies the data, and an approximating model g (see, for instance, (Zucchini, 2000)).

Notable examples of distance functions are, for categorical variables: the entropic dis-tance, which describes the proportional reduction of the heterogeneity of the dependent vari-able; the chi-squared distance, based on the distance from the case of independence; the 0-1 distance, which leads to misclassiﬁcation rates

The entropic distance of a distribution g from a target distribution f , is:

E d=∑i f ilogf i

The chi-squared distance of a distribution g from a target distribution f is instead:

χ2d=∑i

( f i − g i)2

The 0-1 distance between a vector of predicted values, X gr, and a vector of observed

values, X f r, is:

0−1 d=∑n

r=1

1

X f r − X gr

(32.3)

Trang 4

where 1(w,z) = 1 if w = z and 0 otherwise.

For quantitative variables, the typical choice is the Euclidean distance, representing the distance between two vectors in the Cartesian plane Another possible choice is the uniform distance, applied when nonparametric models are being used

The Euclidean distance between a distribution g and a target f is expressed by the

equa-tion:

2d

X f ,X g

=

n

∑

r=1

X f r − X gr

2

(32.4)

Given two distribution functions F and G with values in [0, 1] it is deﬁned uniform

dis-tance the quantity:

sup

Any of the previous distances can be employed to deﬁne the notion of discrepancy of a

statistical model The discrepancy of a model, g, can be obtained as the discrepancy between the unknown probabilistic model, f , and the best (closest) parametric statistical model Since

f is unknown, closeness can be measured with respect to a sample estimate of the unknown

density f

Assume that f represents the unknown density of the population, and let g= pθ be a family

of density functions (indexed by a vector of I parameters,θ) that approximates it Using, to

exemplify, the Euclidean distance, the discrepancy of a model g, with respect to a target model

f is:

Δ ( f , pϑ) =∑n

i=1( f (x i ) − pϑ(x i))2 (32.6)

A common choice of discrepancy function is the Kullback-Leibler divergence, that derives from the entropic distance, and can be applied to any type of observations In such context, the best model can be interpreted as that with a minimal loss of information from the true unknown distribution

The Kullback-Leibler divergence of a parametric model pθ with respect to an unknown

density f is deﬁned by:

ΔK −L ( f , pϑ) =∑i f (x i)log f (x i)

where the parametric density in the denominator has been evaluated in terms of the values of

the parameters which minimizes the distance with respect to f

It can be shown that the statistical tests used for model comparison are generally based

on estimators of the total Kullback-Leibler discrepancy The most used of such estimators is the log-likelihood score Statistical hypothesis testing is based on subsequent pairwise com-parisons between pairs of alternative models The idea is to compare the log-likelihood score

of two alternative models

The log-likelihood score is then deﬁned by:

−2∑n

i=1

Hypothesis testing theory allows to derive a threshold below which the difference between two models is not signiﬁcant and, therefore, the simpler models can be chosen To summarize,

Trang 5

644 Paolo Giudici

using statistical tests it is possible to make an accurate choice among the models, based on the observed data The defect of this procedure is that it allows only a partial ordering of models, requiring a comparison between model pairs and, therefore, with a large number of alternatives

it is necessary to make heuristic choices regarding the comparison strategy (such as choosing among forward, backward and stepwise criteria, whose results may diverge) Furthermore, a probabilistic model must be assumed to hold, and this may not always be a valid assumption

A less structured approach has been developed in the ﬁeld of information theory, giving rise to criteria based on score functions These criteria give each model a score, which puts them into some kind of complete order We have seen how the Kullback-Leibler discrepancy can be used to derive statistical tests to compare models In many cases, however, a formal test cannot be derived For this reason, it is important to develop scoring functions, that attach

a score to each model The Kullback-Leibler discrepancy estimator is an example of such a scoring function that, for complex models, can be often be approximated asymptotically A problem with the Kullback-Leibler score is that it depends on the complexity of a model as described, for instance, by the number of parameters It is thus necessary to employ score functions that penalise model complexity

The most important of such functions is the AIC (Akaike Information Criterion, see (Akaike, 1974)) The AIC criterion is deﬁned by the following equation:

AIC = −2logL( ˆ ϑ;x1, ,x n ) + 2q (32.9) where the ﬁrst term is minus twice the the logarithm of the likelihood function calculated in the maximum likelihood parameter estimate and q is the number of parameters of the model From its deﬁnition notice that the AIC score essentially penalises the log-likelihood score with a term that increases linearly with model complexity The AIC criterion is based on the implicit assumption that q remains constant when the size of the sample increases However this assumption is not always valid and therefore the AIC criterion does not lead to a consis-tent estimate of the dimension of the unknown model An alternative, and consisconsis-tent, scoring function is the BIC criterion (Bayesian Information Criterion), also called SBC, formulated

in (Schwarz, 1978) The BIC criterion is deﬁned by the following expression:

BIC = −2logLϑ;xˆ 1, ,x n

As can be seen from its deﬁnition the BIC differs from the AIC only in the second part

which now also depends on the sample size n Compared to the AIC, when n increases the BIC favours simpler models As n gets large, the ﬁrst term (linear in n) will dominate the second term (logarithmic in n) This corresponds to the fact that, for a large n, the variance term in

the mean squared error expression tends to be negligible We also point out that, despite the superficial similarity between the AIC and the BIC, the first is usually justified by resorting to classical asymptotic arguments, while the second by appealing to the Bayesian framework

To conclude, the scoring function criteria for selecting models are easy to calculate and lead to a total ordering of the models From most statistical packages we can get the AIC and BIC scores for all the models considered A further advantage of these criteria is that they can

be used also to compare non-nested models and, more generally, models that do not belong to the same class (for instance a probabilistic neural network and a linear regression model) However, the limit of these criteria is the lack of a threshold, as well the difﬁcult inter-pretability of their measurement scale In other words, it is not easy to determine if the dif-ference between two models is signiﬁcant or not, and how it compares to another difdif-ference These criteria are indeed useful in a preliminary exploration phase To examine this criteria

Trang 6

and to compare it with the previous ones see, for instance, (Zucchini, 2000) or (Hand et al.,

2001)

A possible ”compromise” between the previous two criteria is the Bayesian criteria which could be developed in a rather coherent way (see e.g (Bernardo and Smith, 1994)) It appears

to combine the advantages of the two previous approaches: a coherent decision threshold and

a complete ordering One of the problems that may arise is connected to the absence of a general purpose software For Data Mining works using Bayesian criteria the reader could see, for instance, (Giudici, 2003) and (Giudici and Castelo, 2001)

The intensive wide spread use of computational methods has led to the development of computationally intensive model comparison criteria These criteria are usually based on using dataset different than the one being analyzed (external validation) and are applicable to all the models considered, even when they belong to different classes (for example in the comparison between logistic regression, decision trees and neural networks, even when the latter two are non probabilistic) A possible problem with these criteria is that they take a long time to be designed and implemented, although general purpose softwares have made this task easier The most common of such criterion is based on validation The idea of the cross-validation method is to divide the sample into two sub-samples, a ”training” sample, with

n − m observations, and a ”validation” sample, with m observations The ﬁrst sample is used

to ﬁt a model and the second is used to estimate the expected discrepancy or to assess a distance Using this criterion the choice between two or more models is made by evaluating

an appropriate discrepancy function on the validation sample Notice that the cross-validation idea can be applied to the calculation of any distance function

One problem regarding the cross-validation criterion is in deciding how to select m, that is, the number of the observations contained in the ”validation sample” For example, if we select

m = n/2 then only n/2 observations would be available to ﬁt a model We could reduce m

but this would mean having few observations for the validation sampling group and therefore reducing the accuracy with which the choice between models is made In practice proportions

of 75% and 25% are usually used, respectively for the training and the validation samples

To summarize these criteria have the advantage of being generally applicable but have the disadvantage of taking a long time to be calculated and of being sensitive to the characteristics

of the data being examined A way to overcome this problem is to consider model combi-nation methods, such as bagging and boosting For a thorough description of these recent

methodologies, see (Hastie et al., 2001).

One last group of criteria seem speciﬁcally tailored for the data mining ﬁeld These are criteria that compare the performance of the models in terms of their relative losses, connected

to the errors of approximation made by ﬁtting Data Mining models Criteria based on loss functions have appeared recently, although related ideas are known since longtime in Bayesian decision theory (see for instance (Bernardo and Smith, 1994)) They are of great interest and have great application potential although at present they are mainly concerned with solving problems regarding classiﬁcation For a more detailed examination of these criteria the reader

can see for example (Hand , 1997,Hand et al., 2001) or the reference manuals on Data Mining

software, such as that of SAS Enterprise Miner

The idea behind these methods is that it is important to focus the attention, in the choice among alternative models, to compare the utility of the results obtained from the models and not just to look exclusively at the statistical comparison between the models themselves Since the main problem dealt with by data analysis is to reduce uncertainties on the risk factors or

”loss” factors, reference is often made to developing criteria that minimize the loss connected

to the problem being examined In other words, the best model is the one that leads to the least loss

Trang 7

646 Paolo Giudici

Most of the loss function based criteria apply to predictive classification problems, where the concept of a confusion matrix arises The confusion matrix is used as an indication of the properties of a classification (discriminant) rule It contains the number of elements that have been correctly or incorrectly classified for each class On its main diagonal we can see the number of observations that have been correctly classified for each class while the off-diagonal elements indicate the number of observations that have been incorrectly classified If

it is (explicitly or implicitly) assumed that each incorrect classification has the same cost, the proportion of incorrect classifications over the total number of classifications is called rate of error, or misclassification error, and it is the quantity which must be minimized Of course the assumption of equal costs can be replaced by weighting errors with their relative costs The confusion matrix gives rise to a number of graphs that can be used to assess the rel-ative utility of a model, such as the Lift Chart, and the ROC Curve For a detailed illustration

of these graphs we refer to (Hand , 1997) or (Giudici, 2003) The lift chart puts the valida-tion set observavalida-tions, in increasing or decreasing order, on the basis of their score, which is the probability of the response event (success), as estimated on the basis of the training set Subsequently, it subdivides such scores in deciles It then calculates and graphs the observed probability of success for each of the decile classes in the validation set A model is valid

if the observed success probabilities follow the same order (increasing or decreasing) as the estimated ones Notice that, in order to be better interpreted, the lift chart of a model is usually compared with a baseline curve, for which the probability estimates are drawn in the absence

of a model, that is, taking the mean of the observed success probabilities

The ROC (Receiver Operating Characteristic) curve is a graph that also measures predic-tive accuracy of a model It is based on four conditional frequencies that can be derived from

a model, and the choice of a cut-off points for its scores:

• the observations predicted as events and effectively such (sensitivity)

• the observations predicted as events and effectively non events

• the observations predicted as non events and effectively events;

• the observations predicted as non events and effectively such (speciﬁcity)

The ROC curve is obtained representing, for any fixed cut-off value, a point in the Carte-sian plane having as x-value the false positive value (1-specificity) and as y-value the sensi-tivity value Each point in the curve corresponds therefore to a particular cut-off In terms of model comparison, the best curve is the one that is leftmost, the ideal one coinciding with the y-axis To summarize, criteria based on loss functions have the advantage of being easy to interpret and, therefore, well suited for Data Mining applications but, on the other hand, they still need formal improvements and mathematical refinements In the next section we give an example of how this can be done, and show that statistics and Data Mining applications can fruitfully interact

32.3 Application to Credit Risk Management

We now apply the previous considerations to a case-study that concerns credit risk manage-ment The objective of the analysis is the evaluation of the credit reliability of small and medium enterprises (SMEs) that demand ﬁnancing for their development

In order to assess credit reliability each applicant for credit is associated with a score, usually expressed in terms of probability of repayment (default probability) Data Mining methods are used to estimate such score and, on the basis of it, to classify applicants as being reliable (worth of credit) or not

Trang 8

Data Mining models for credit scoring are of the predictive (or supervised) kind: they use explanatory variables obtained from information available on the applicant in order to get

an estimate of the probability of repayment (target or response variable) The methods most used in practical credit scoring applications are: linear and logistic regression models, neural networks and classiﬁcation tress Often, in banking practice, the resulting scores are called

”statistical” and supplemented with subjective, judgemental evaluations

In this section we consider the analysis of a database that includes 7134 SMEs belong-ing to the retail segment of an important Italian bank The retail segment contains companies with total annual sales less than 2,5 million per year On each of this companies the bank has calculated a score, in order to evaluate their ﬁnancing (or reﬁnancing) in the period from April 1st, 1999 to April 30th, 2000 After data cleaning, 13 variables are included in the anal-ysis database, of which one binary variable that expresses credit reliability (BAD =0 for the reliables, BAD=1 for the non reliables) can be considered as the response or target variable The sample contains about 361 companies with BAD=1 (about 5%) and 6773 observed with BAD=0 (about 95%) The objective of the analysis is to build a statistical rule that explains the target variable as a function of the explanatory one Once built on the observed data, such rule will be extrapolated to assess and predict future applicants for credit Notice the unbal-ancedness of the distribution of the target response: this situation, typical in predictive Data Mining problems, poses serious challenges to the performance of a model

The remaining 12 available variables are retained to inﬂuence reliability, and can be con-sidered as explanatory predictors Among them we have: the age of the company, its legal status, the number of employees, the total sales and variation of the sales in the last period, the region of residence, the speciﬁc business, the duration of the relationship of the managers of the company with the bank Most of them can be considered as ”demographic” information on the company, stable in time but indeed not very powerful to build a statistical model However,

it must be said that, being the companies considered all SMEs, it is rather difﬁcult to rely on other, such as balance sheet, information

A preliminary exploratory analysis can give indications on how to code the explanatory variables, in order to maximize their predictive power In order to reach this objective we have employed statistical measures of association between pairs of variables, such as chi-squared based measures and statistical measures of dependence, such as Goodman and Kruskal’s (see (Giudici, 2003) for a systematic comparison of such measures) We remark that the use of such tools is very much beneﬁcial for the analysis, and can considerably improve the ﬁnal per-formance results As a result of our analysis, all explanatory variables have been discretised, with a number of levels ranging from 2 to 26

In order to focus on the issue of model comparison we now concentrate on the comparison

of three different logistic regression models on the data This model is the most used in credit scoring applications; other models that are employed are classiﬁcation trees, linear discrimi-nant analysis and neural networks Here we prefer to compare models belonging to the same class, to better illustrate our issue; for a detailed comparison of credit scoring methods, on a different data set, see (Giudici, 2003) Our analysis have been conducted using SAS and SAS Enterprise Miner softwares, available at the bank subject of the analysis

We have chosen, in agreement with the bank’s experts, three logistic regression models:

a saturated model, that contains all explanatory variables, with the levels obtained from the explanatory analysis; a statistically selected model, using pairwise statistical hypotheses test-ing; and a model that minimizes the loss function In the following, the saturated model will

be named ”RegA (model A)”; the chosen model, according to a statistical selection strategy

”RegB (model B)”, the model chosen minimizing the loss function ”RegC (model C)” Sta-tistical model comparison has been carried out using a stepwise model selection approach,

Trang 9

648 Paolo Giudici

with a reference value of 0,05 to compare p-values with On the other hand, the loss function has been expressed by the bank’s experts, as a function of the classiﬁcation errors Table 32.1 below describes such a loss function

Table 32.1 The chosen loss function

Predicted

Actual

BAD GOOD BAD 0 20

GOOD -1 0

The table contains the estimated losses (in scale free values) corresponding to the combi-nations of actual and predicted values of the target variable The speciﬁed loss function means that it is retained that giving credit to a non reliable (bad) enterprise is 20 times more costly that not giving credit to a reliable (good) enterprise In statistical terms, the type I error costs

20 times the type II error As each of the four scenarios in Table 32.1 has an occurrence prob-ability, it is possible to calculate the expected loss of each considered statistical model The best one will be that minimizing such expected loss

In the SAS Enterprise Miner tool the Assessment node provides a common framework to compare models, in terms of their predictions This requires that data has been partitioned in two or more datasets, according to computational criteria of model comparison The Assess-ment node produces a table view of the model results that lists relevant statistics and model adequacy and several different charts/reports depending on whether the target variable is con-tinuous or categorical and whether a proﬁt/loss function has been speciﬁed

In the case under examination, the initial dataset (5351 observations) has been split in two, using a sampling mechanism stratiﬁed with respect to the target variable The training dataset contains about 70% of the observations (about 3712) and the validation dataset the remaining 30% (about1639 observations) As the samples are stratiﬁed, in both the resulting datasets the percentages of ”bad” and ”good” enterprises remain the same as those in the combined dataset (5 % e il 95%)

The ﬁrst model comparison tool we consider is the lift chart For a binary target, the lift (also called gains chart) is built as follows The scored data set is sorted by the probabili-ties of the target event in descending order; observations are then grouped into deciles For each decile, a lift chart can calculate either: the percentage of target responses (Bad repayers here) or the ratio between the percentage and the corresponding one for the baseline (random) model, called the lift Lift charts show the percent of positive response or the lift value on the vertical axis Table 54.1 show the calculations that give rise to the lift chart, for the credit scoring problem considered here Figure 32.3 shows the corresponding curves

Table 32.2 Calculations for the lift chart

Number of

observa-tions in each group

percentile % of captured

re-sponses (BASELINE)

% di of captured re-sponses % (REG A)

% di of captured re-sponses % (REG B)

% di of captured re-sponses % (REG C) 163.90 10 5.064 20.134 22.575 22.679

163.90 20 5.064 12.813 12.813 14.033

163.90 30 5.064 9.762 10.103 10.293

163.90 40 5.064 8.237 8.237 8.542

163.90 50 5.064 7.322 7.383 7.445

163.90 60 5.064 6.508 6.913 6.624

163.90 70 5.064 5.753 6.237 6.096

163.90 80 5.064 5.567 5.567 5.644

163.90 90 5.064 5.288 5.220 5.185

163.90 100 5.064 5.064 5.064 5.064

Trang 10

Fig 32.1 Lift charts for the best model

Comparing the results in Table 54.1 and Figure 39.1 it emerges that the performances of the three models being compared are rather similar; however the best model seem to be model

C (the model that minimises the losses) as it is the model that, in the first deciles, is able to effectively capture more bad enterprises, a difficult task in the given problem Recalling that the actual percentage of bad enterprises observed is equal to 5%, the previous graph can be normalized by dividing the percentage of bads in each decile by the overall 5% percentage The result is the actual lift of a model, that is, the actual improvement with respect to the baseline situation of absence of a model (as if each company were estimated good/bad accord-ing to a purely random mechanism) In terms of model C, in the first decile (with about 164 enterprises) the lift is equal to 4,46 (i.e 22,7%5,1%); this means that, using model C it is

expected to obtain, in the ﬁrst decile, a number of enterprises 4,5 times higher with respect to

a random sample of the considered enterprises

The second Assessment tool we consider is the threshold chart Threshold-based charts enable to display the agreement between the predicted and actual target values across a range

of threshold levels The threshold level is the cutoff that is used to classify an observation that

is based on the event level posterior probabilities The default threshold level is 0.50 For the credit scoring case the calculations leading to the threshold chart are in Table 32.3 and the corresponding ﬁgure in Figure 32.3 below

In order to interpret correctly the previous table and ﬁgure, let us consider some numerical examples First we remark that the results refer to the validation dataset, with 1629 enterprises

Định dạng
Số trang	10
Dung lượng	404,31 KB