Data Mining and Knowledge Discovery Handbook, 2 Edition part 76 pptx

With squared error loss, we will see below that the best possible model for this problem is fband its average squared error isσ2.. In the second case, on the other hand, the error may be

Trang 1

730 Ricardo Vilalta, Christophe Giraud-Carrier, and Pavel Brazdil

Metal A Meta-Learning Assistant for Providing User Support in Machine Learning and Data Mining, 1998

Michie, D., Spiegelhalter, D J., Taylor, C.C Machine Learning, Neural and Statistical Clas-siﬁcation England: Ellis Horwood, 1994

Nakhaeizadeh, G., Schnabel, A Development of Multi-criteria Metrics for Evaluation of Data-mining Algorithms In Proceedings of the Third International Conference on Knowledge Discovery and Data-Mining, 1997

Paterson, I New Models for Data Envelopment Analysis, Measuring Efﬁciency with the VRS Frontier Economics Series No 84, Institute for Advanced Studies, Vienna, 2000 Peng, Y., Flach, P., Brazdil, P., Soares, C Decision Tree-Based Characterization for Meta-Learning In: ECML/PKDD’02 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, 111-122 University of Helsinki, 2002

Pfahringer, B., Bensusan, H., Giraud-Carrier, C Meta-learning by Landmarking Various Learning Algorithms In Proceedings of the Seventeenth International Conference on Machine Learning, 2000

Pratt, L., Thrun, S Second Special Issue on Inductive Transfer Machine Learning, 28, 1997 Pratt S., Jennings B A Survey of Connectionist Network Reuse Through Transfer In Learn-ing to Learn, Chapter 2, 19-43, Kluwer Academic Publishers, MA, 1998

Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-tive reports Lecture notes in artiﬁcial intelligence, 3055 pp 217-228, Springer-Verlag (2004)

Schmidhuber J Discovering Solutions with Low Kolmogorov Complexity and High Gen-eralization Capability Proceedings of the Twelve International Conference on Machine Learning, 488-49, Morgan Kaufman, 1995

Skalak, D Prototype Selection for Composite Nearest Neighbor Classiﬁers PhD thesis,

Uni-versity of Massachusetts, Amherst, 1997

Soares, C., Brazdil, P Zoomed Ranking: Selection of Classiﬁcation Algorithms Based on Relevant Performance Information In Proceedings of the Fourth European Conference

on Principles and Practice of Knowledge Discovery in Databases, 2000

Soares, C., Petrak, J., Brazdil, P Sampling-Based Relative Landmarks: Systematically Test-Driving Algorithms Before Choosing Proceedings of the 10th Portuguese Conference

on Artiﬁcial Intelligence, Springer, 2001

Sohn, S.Y Meta Analysis of Classiﬁcation Algorithms for Pattern Recognition IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 21(11): 1137-1144, 1999 Thrun, S Lifelong Learning Algorithms In Learning to Learn, Chapter 8, 181-209, MA: Kluwer Academic Publishers, 1998

Ting, K M., Witten I H Stacked generalization: When does it work? In Proceedings of the 15thInternational Joint Conference on Artiﬁcial Intelligence, pp 866-873, Nagoya,

Japan, Morgan Kaufmann, 1997

Todorovski, L., Dzeroski, S Experiments in Meta-level Learning with ILP In Proceedings

of the Third European Conference on Principles and Practice of Knowledge Discovery

in Databases, 1999

Todorovski, L., Dzeroski, S Combining Multiple Models with Meta Decision Trees In Pro-ceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000

Todorovski, L., Dzeroski, S Combining Classiﬁers with Meta Decision Trees Machine Learning 50 (3), 223-250, 2003

Trang 2

36 Meta-Learning 731 Utgoff P Shift of Bias for Inductive Concept Learning In Michalski, R.S et al (Ed), Ma-chine Learning: An Artiﬁcial Intelligence Approach Vol II, 107-148, Morgan Kaufman, California, 1986

Vilalta, R Research Directions in Meta-Learning: Building Self-Adaptive Learners Interna-tional Conference on Artiﬁcial Intelligence, Las Vegas, Nevada, 2001

Vilalta, R., Drissi, Y A Perspective View and Survey of Meta-Learning Journal of Artiﬁcial Intelligence Review, 18 (2): 77-95, 2002

Widmer, G On-line Metalearning in Changing Contexts MetaL(B) and MetaL(IB) In Pro-ceedings of the Third International Workshop on Multistrategy Learning (MSL-96), 1996A

Widmer, G Recognition and Exploitation of Contextual Clues via Incremental Meta-Learning In Proceedings of the Thirteenth International Conference on Machine Learn-ing (ICML-96), 1996B

Widmer, G Tracking Context Changes through Meta-Learning Machine Learning, 27(3): 259-286, 1997

Wolpert D Stacked Generalization Neural Networks, 5: 241-259, 1992

Trang 4

Bias vs Variance Decomposition For Regression and Classiﬁcation

Pierre Geurts

Department of Electrical Engineering and Computer Science, University of Li`ege, Belgium Postdoctoral Researcher, F.N.R.S., Belgium

Summary In this chapter, the important concepts of bias and variance are introduced After

an intuitive introduction to the bias/variance tradeoff, we discuss the bias/variance decom-positions of the mean square error (in the context of regression problems) and of the mean misclassification error (in the context of classification problems) Then, we carry out a small empirical study providing some insight about how the parameters of a learning algorithm in-fluence bias and variance

Key words: bias, variance, supervised learning, overﬁtting

37.1 Introduction

The general problem of supervised learning is often formulated as an optimization problem

An error measure is deﬁned that evaluates the quality of a model and the goal of learning

is to ﬁnd, in a family of models (the hypothesis space), a model that minimizes this error

estimated on the learning sample (or dataset) S So, at ﬁrst sight, if no good enough model is

found in this family, it should be sufﬁcient to extend the family or to exchange it for a more powerful one in terms of model ﬂexibility However, we are often interested in a model that generalizes well on unseen data rather than on a model that perfectly predicts the output for the learning sample cases And, unfortunately, in practice, good results on the learning set do not necessarily imply good generalization performance on unseen data, especially if the “size”

of the hypothesis space is large in comparison to the sample size

Let us use a simple one-dimensional regression problem to explain intuitively why larger hypothesis spaces do not necessarily lead to better models In this synthetic problem, learning

outputs are generated according to y = fb(x)+ε, where fbis represented by the dashed curves

in Figure 39.1 andε is distributed according to a Gaussian N(0,σ) distribution With squared error loss, we will see below that the best possible model for this problem is fband its average squared error isσ2 Let us consider two extreme situations of a bad model structure choice

• A too simple model: using a linear model y = w.x + b and minimizing squared error on

the learning set, we obtain the estimations given in the left part of Figure 39.1 for two

DOI 10.1007/978-0-387-09823-4_37, © Springer Science+Business Media, LLC 2010

Trang 5

734 Pierre Geurts

S

1

S

1

2

y

x x

x y

Fig 37.1 Left, a linear model ﬁtted to two learning samples Right, a neural network ﬁtted to the same samples

different learning set choices These models are not very good, neither on their learning sets, nor in generalization Whatever the learning set, there will always remain an error

due to the fact that the model is too simple with respect to the complexity of fb.

• A too complex model: by using a very complex model like a neural network with two

hidden layers of ten neurons each, we get the functions at the right part of Figure 39.1 for the same learning sets This time, models receive an almost perfect score on the learning set However, their generalization errors are still not very good because of two phenomena First, the learning algorithm is able to match perfectly the learning set and hence also the noise term We say in this case that the learning algorithm “overﬁts” the data Second, even if there is no noise, there will still remain some errors due to the high complexity of the model Indeed, the learning algorithm has many different models at its disposal and if the learning set size is relatively small, several of them will realize a perfect match of the learning set As at most one of them is a perfect image of the best model, any other choice

by the learning algorithm will result in suboptimality

The main source of error is very different in both cases In the ﬁrst case, the error is essentially independent of the particular learning set and must be attributed to the lack of

complexity of the model This source of error is called bias In the second case, on the other

hand, the error may be attributed to the variability of the model from one learning set to another (which is due on one hand to overﬁtting and on the other hand to the sparse nature

of the learning set with respect to the complexity of the model) This source of error is called

variance Note that in the ﬁrst case there is also a dependence of the model on the learning set

and thus some variability of the predictions However the resulting variance is negligible with respect to bias In general, bias and variance both depend on the complexity of the model but

in opposite direction and thus there must exist an optimal tradeoff between these two sources

of error As a matter of fact, this optimal tradeoff depends also on the smoothness of the best model and on the sample size An important consequence of this is that, because of variance,

we should always take care of not increasing too much the complexity of the model structure with respect to the complexity of the problem and the size of the learning sample

In the next section, we give a formal additive decomposition of the mean (over all learn-ing set choices) squared error into two terms which represent the bias and the variance effect

Trang 6

37 Bias vs Variance 735 Some propositions of similar decompositions in the context of 0-1 loss-functions are also dis-cussed They show some fundamental differences between the two types of problems although bias and variance concepts are always useful Section 3 discusses procedures to estimate bias and variance terms for practical problems In Section 4, we give some experiments and appli-cations of bias/variance decompositions

37.2 Bias/Variance Decompositions

Let us introduce some notations A learning sample S is a collection of m input/output pairs (< x1,y1>, ,< x m ,y m >), each one randomly and independently drawn from a probability distribution P D(x,y) A learning algorithm I produces a model I(S) from S, i.e a function of inputs x to the domain of y The error of this model is computed as the expectation:

Error (I(S)) = Ex,y [L(y,I(S)(x))], where L is some loss function that measures the discrepancy between its two arguments Since the learning sample S is randomly drawn from some distribution D, the model I(S) and its prediction I(S)(x) at x are also random Hence, Error(I(S)) is again a random variable and we

are interested in studying the expected value of this error (over the set of all learning sets of

size m) ES[Error(I(S))] This error can be decomposed into:

E S[Error(I(S))] = Ex[ES[Ey |x [L(y,I(S)(x))]]]

= Ex[ES[Error(I(S)(x))]], where Error(I(S)(x)) denotes the local error at point x.

Bias/variance decompositions usually try to decompose this error into three terms: the residual or minimal attainable error, the systematic error, and the effect of the variance The

exact decomposition depends on the loss function L The next two subsections are devoted to

the most common loss functions, i.e the squared loss for regression problems and the 0-1 loss for classiﬁcation problems Notice however that these loss functions are not the only plausible loss functions and several authors have studied bias/variance decompositions for other loss functions (Wolpert, 1997, Hansen, 2000) Actually, several of the decompositions for 0-1 loss presented below are derived as special cases of more general bias/variance decompositions (Tibshirani, 1996,Wolpert, 1997,Heskes, 1998,Domingos, 1996,James, 2003) The interested reader may refer to these references for more details

37.2.1 Bias/Variance Decomposition of the Squared Loss

When the output y is numerical, the usual loss function is the squared loss L2(y1,y2)=(y1−

y2)2 With this loss function, it is easy to show that the best possible model is fb(x) = Ey |x [y], which takes the expectation of the target y at each point x The best model according to a given

loss function is often called the Bayes model in statistical pattern recognition Introducing this model in the mean local error, we get with some elementary calculations:

E S[Error(I(S)(x))] = Ey |x [(y − fb(x))2] + ES[( fb(x) − I(S)(x))2]. (37.1)

Symmetrically to the Bayes model, let us deﬁne the average model, favg(x) = ES[I(S)(x)]

which outputs the average prediction among all learning sets Introducing this model in the second term of Equation (63.1), we obtain:

Trang 7

736 Pierre Geurts

E S[( fb(x) − I(S)(x))2] = ( fb(x) − favg(x))2+ ES[(I(S)(x) − favg(x))2].

In summary, we have the following well-known decomposition of the mean square error at a point x:

E S [Error(I(S)(x))] =σ2

R(x) + bias2

R(x) + varR(x)

by deﬁning:

σ2

bias2R (x) = ( fb(x) − favg(x))2, (37.3) var2R (x) = ES[(I(S)(x) − favg(x))2]. (37.4) This error decomposition is well known in estimation theory and has been introduced in the

automatic learning community by (Geman et al., 1995).

The residual squared error, σ2(x), is the error obtained by the best possible model It provides a theoretical lower bound that is independent of the learning algorithm Thus, the suboptimality of a particular learning algorithm is composed of two terms: the (squared) bias measures the discrepancy between the best and the average model It measures how well is the estimate in average The variance measures the variability of the predictions with respect

to the learning set randomness

R

bias ( ) 2x

R

2

σ ( )x

R

2

σ ( )x

R

2x

var ( )

R

2x

var ( )

R

bias ( ) 2x

x

( )

f

avg

x

( )

f b

x

( )

f avg

x

( )

f b

Err

y

Err

Too complex model Too simple model

Fig 37.2 Top: the average models; bottom: residual error, bias, and variance

To explain why these two terms are indeed the consequence of the two phenomena dis-cussed in the introduction of this chapter, let us come back to our simple regression problem The average model is depicted in the top of Figure 39.2 for the two cases of bad model choice

Residual error, bias and variance for each position x are drawn in the bottom of the same

ﬁgure The residual error is entirely speciﬁed by the problem and loss criterion and hence in-dependent of the algorithm and learning set used When the model is too simple, the average model is far from the Bayes model almost everywhere and thus the bias is large On the other hand, the variance is small as the model does not match very strongly the learning set and thus the prediction at each point does not vary too much from one learning set to another Bias is

Trang 8

37 Bias vs Variance 737 thus the dominant term of error When the model is too complex, the distribution of predic-tions matches very strongly the distribution of outputs at each point The average prediction

is thus close to the Bayes model and the bias is small However, because of the noise and the small learning set size, predictions are highly variable at each point In this case, variance is the dominant term of error

37.2.2 Bias/variance decompositions of the 0-1 loss

The usual loss function for classiﬁcation problems (i.e a discrete target variable) is the 0-1

loss function, Lc(y1,y2)=1 if y1= y2, 0 otherwise, which yields the mean misclassiﬁcation

error at x:

E S[Error(I(S)(x))] = ES[E y|x [Lc(y,I(S)(x))]]]

= P D ,S

The Bayes model in this case is the model that outputs the most probable class at x, i.e

f b(x) = argmaxc P D(y = c|x) The corresponding residual error is:

By analogy with the decomposition of the square error, it is possible to deﬁne what we call

“natural” bias and variance terms for the 0-1 loss function First, by symmetry with the Bayes model and by analogy with the square loss decomposition, the equivalent in classification of the average model is the majority vote classifier defined by:

f avg(x) = argmax

c P S(I(S)(x) = c),

which outputs at each point the class receiving the majority of votes among the distribution

of classiﬁers induced from the distribution of learning sets The square bias is the error of the average model with respect to the best possible model This deﬁnition yields here:

biasC(x) = Lc( fb(x), fma j(x)).

So, biased points are those for which the majority vote classifier disagrees with the Bayes classifier On the other hand, variance can be naturally defined as:

varC(x) = ES L c(I(S)(x), fma j(x))= P S ma j(x)),

which is the average error of the models induced from random learning samples S with respect

to the majority vote classiﬁer This deﬁnition is indeed a measure of the variability of the predictions at x: when varC(x) = 0, every model outputs the same class whatever the learning set from which it is induced and varC(x) is maximal when the probability of the class given by

the majority vote classiﬁer is equal to 1/z (with z the number of classes), which corresponds

to the most uncertain distribution of predictions

Unfortunately, these natural bias and variance terms do not sum up with the residual error

to give the local misclassiﬁcation error In other words:

Let us illustrate on a simple example how increased variance may decrease the average classi-ﬁcation error in some situations Let us suppose that we have a 3 classes problem such that the

Trang 9

738 Pierre Geurts

true class probability distribution is given by (P D(y = c1|x), P D(y = c2|x), P D(y = c3|x))=(0.7, 0.2, 0.1) The best possible prediction at x is thus the class c1and the corresponding minimal

error is 0.3 Let us suppose that we have two learning algorithms I1and I2and that the distri-bution of predictions of the models built by these algorithms are given by:

(P S (I1(S)(x) = c1),P S(I1(S)(x) = c2),P S(I1(S)(x) = c3)) = (.1,.8,.1)

(P S (I2(S)(x) = c1),P S(I2(S)(x) = c2),P S(I2(S)(x) = c3)) = (.4,.5,.1)

So, we observe that both algorithms produce models that most probably will decide

class c2(respectively with probability 0.8 and 0.5) Thus, the two methods are biased (biasC(x) = 1) On the other hand, the variances of the two methods are obtained in the following way:

varC1(x) = 1 − 0.8 = 0.2 and var2

C (x) = 1 − 0.5 = 0.5,

and their mean misclassiﬁcation errors are found to be

E S [Error(I1(S)(x))] = 0.76 and ES [Error(I2(S)(x))] = 0.61.

Thus between these two methods with identical bias, it is the one having the largest variance that has the smallest average error rate

It is easy to see that this happens here because of the existence of a bias Indeed, with 0-1 loss, an algorithm that has small variance and high bias is an algorithm that systematically (i.e whatever the learning sample) produces a wrong answer, whereas an algorithm that has a high bias but also a high variance is only wrong for a majority of learning samples, but not necessarily systematically So, this latter may be better than the former In other words, with 0-1 loss, much variance can be beneﬁcial because it can lead the system closer to the Bayes classiﬁcation

As a result of this counter-intuitive interaction between bias and variance terms with 0-1 loss, several authors have proposed their own decompositions We brieﬂy describe below the most representative of them For a more detailed discussion of these decompositions, see for example (Geurts, 2002) or (James, 2003) In the fol-lowing sections, we present a very different approach to study bias and variance of 0-1 loss due to Friedman (1997), which relates the mean error to the squared bias and variance terms of the class probability estimates

Some decompositions

Tibshirani (1996) deﬁnes the bias as the difference between the probability of the Bayes class and the probability of the majority vote class:

biasT (x) = PD(y = fb(x)|x) − P D(y = fma j(x)|x). (37.6) Thus, the sum of this bias and the residual error is actually the misclassiﬁcation error

of the majority vote classiﬁer:

σC(x) + biasT(x) = 1 − P D(y = fma j(x)|x) = Error( fma j(x)).

Trang 10

37 Bias vs Variance 739

This is exactly the part of the error that would remain if we could completely can-cel the variability of the predictions The variance is then defined as the difference between the mean misclassification error and the error of the majority vote classifier:

varT (x) = ES [Error(I(S)(x))] − Error( fma j(x)). (37.7) Tibshirani (1996) denotes this variance term the aggregation effect Indeed, this is the variation of error that results from the aggregation of the predictions over all learning sets Note that this variance term is not necessarily positive From different consid-erations, James (2003) has proposed exactly the same decomposition To distinguish (63.3) and (63.5) from the natural bias and variance terms, he calls them system-atic and variance effect respectively Dietterich and Kong (1995) have proposed a decomposition that applies only to the noise-free case but that exactly reduces to Tibshirani’s decomposition in this latter case

Domingos (2000) agrees with the natural deﬁnition of bias, variance given in the introduction of this section and he combines them into a non-additive expression like:

E S [Error(I(S)(x))] = b1(x).σC(x) + biasC (x) + b2(x).varC(x),

where b1and b2are two factors that are in fact functions of the true class distribution and of the distribution of predictions

Kohavi and Wolpert (1996) have proposed a very different decomposition which

is closer in spirit to the decomposition of the squared loss Their decomposition

makes use of quadratic functions of the probabilities P S(I(S)(x)|x) and P(y|x).

Heskes (1998) adopts the natural variance term varC and, ignoring the residual er-ror, defines bias as the difference between the mean misclassification error and his variance As a consequence, it can happen that his bias is smaller than the residual error Breiman (1996a, 2000) has successively proposed two decompositions In the first one, bias and variance are defined globally instead of locally Bias is the part of the error due to biased points (i.e such that biasC(x) = 1) and variance is defined as

the part of the error due to unbiased points

This multitude of decompositions translates well the complexity of the interac-tion between bias and variance in classiﬁcainterac-tion Each decomposiinterac-tion has its pros and cons Notably, we may observe in some case counterintuitive behavior with respect

to what would be observed with the classical decomposition of the squared error (e.g

a negative variance) This makes the choice, both in theoretical and empirical studies,

of a particular decomposition difﬁcult Nevertheless, all decompositions have proven

to be useful to analyze classiﬁcation algorithms, each one at least in the context of its introduction

Bias and variance of class probability estimates

Many classification algorithms work by first computing an estimate I c (S)(x) of the conditional probability of each class c at x and then deriving their classification

model by:

I (S)(x) = argmax

c I c (S)(x).

Định dạng
Số trang	10
Dung lượng	400,8 KB