With squared error loss, we will see below that the best possible model for this problem is fband its average squared error isσ2.. In the second case, on the other hand, the error may be
Trang 1730 Ricardo Vilalta, Christophe Giraud-Carrier, and Pavel Brazdil
Metal A Meta-Learning Assistant for Providing User Support in Machine Learning and Data Mining, 1998
Michie, D., Spiegelhalter, D J., Taylor, C.C Machine Learning, Neural and Statistical Clas-sification England: Ellis Horwood, 1994
Nakhaeizadeh, G., Schnabel, A Development of Multi-criteria Metrics for Evaluation of Data-mining Algorithms In Proceedings of the Third International Conference on Knowledge Discovery and Data-Mining, 1997
Paterson, I New Models for Data Envelopment Analysis, Measuring Efficiency with the VRS Frontier Economics Series No 84, Institute for Advanced Studies, Vienna, 2000 Peng, Y., Flach, P., Brazdil, P., Soares, C Decision Tree-Based Characterization for Meta-Learning In: ECML/PKDD’02 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, 111-122 University of Helsinki, 2002
Pfahringer, B., Bensusan, H., Giraud-Carrier, C Meta-learning by Landmarking Various Learning Algorithms In Proceedings of the Seventeenth International Conference on Machine Learning, 2000
Pratt, L., Thrun, S Second Special Issue on Inductive Transfer Machine Learning, 28, 1997 Pratt S., Jennings B A Survey of Connectionist Network Reuse Through Transfer In Learn-ing to Learn, Chapter 2, 19-43, Kluwer Academic Publishers, MA, 1998
Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-tive reports Lecture notes in artificial intelligence, 3055 pp 217-228, Springer-Verlag (2004)
Schmidhuber J Discovering Solutions with Low Kolmogorov Complexity and High Gen-eralization Capability Proceedings of the Twelve International Conference on Machine Learning, 488-49, Morgan Kaufman, 1995
Skalak, D Prototype Selection for Composite Nearest Neighbor Classifiers PhD thesis,
Uni-versity of Massachusetts, Amherst, 1997
Soares, C., Brazdil, P Zoomed Ranking: Selection of Classification Algorithms Based on Relevant Performance Information In Proceedings of the Fourth European Conference
on Principles and Practice of Knowledge Discovery in Databases, 2000
Soares, C., Petrak, J., Brazdil, P Sampling-Based Relative Landmarks: Systematically Test-Driving Algorithms Before Choosing Proceedings of the 10th Portuguese Conference
on Artificial Intelligence, Springer, 2001
Sohn, S.Y Meta Analysis of Classification Algorithms for Pattern Recognition IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 21(11): 1137-1144, 1999 Thrun, S Lifelong Learning Algorithms In Learning to Learn, Chapter 8, 181-209, MA: Kluwer Academic Publishers, 1998
Ting, K M., Witten I H Stacked generalization: When does it work? In Proceedings of the 15thInternational Joint Conference on Artificial Intelligence, pp 866-873, Nagoya,
Japan, Morgan Kaufmann, 1997
Todorovski, L., Dzeroski, S Experiments in Meta-level Learning with ILP In Proceedings
of the Third European Conference on Principles and Practice of Knowledge Discovery
in Databases, 1999
Todorovski, L., Dzeroski, S Combining Multiple Models with Meta Decision Trees In Pro-ceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000
Todorovski, L., Dzeroski, S Combining Classifiers with Meta Decision Trees Machine Learning 50 (3), 223-250, 2003
Trang 236 Meta-Learning 731 Utgoff P Shift of Bias for Inductive Concept Learning In Michalski, R.S et al (Ed), Ma-chine Learning: An Artificial Intelligence Approach Vol II, 107-148, Morgan Kaufman, California, 1986
Vilalta, R Research Directions in Meta-Learning: Building Self-Adaptive Learners Interna-tional Conference on Artificial Intelligence, Las Vegas, Nevada, 2001
Vilalta, R., Drissi, Y A Perspective View and Survey of Meta-Learning Journal of Artificial Intelligence Review, 18 (2): 77-95, 2002
Widmer, G On-line Metalearning in Changing Contexts MetaL(B) and MetaL(IB) In Pro-ceedings of the Third International Workshop on Multistrategy Learning (MSL-96), 1996A
Widmer, G Recognition and Exploitation of Contextual Clues via Incremental Meta-Learning In Proceedings of the Thirteenth International Conference on Machine Learn-ing (ICML-96), 1996B
Widmer, G Tracking Context Changes through Meta-Learning Machine Learning, 27(3): 259-286, 1997
Wolpert D Stacked Generalization Neural Networks, 5: 241-259, 1992
Trang 4Bias vs Variance Decomposition For Regression and Classification
Pierre Geurts
Department of Electrical Engineering and Computer Science, University of Li`ege, Belgium Postdoctoral Researcher, F.N.R.S., Belgium
Summary In this chapter, the important concepts of bias and variance are introduced After
an intuitive introduction to the bias/variance tradeoff, we discuss the bias/variance decom-positions of the mean square error (in the context of regression problems) and of the mean misclassification error (in the context of classification problems) Then, we carry out a small empirical study providing some insight about how the parameters of a learning algorithm in-fluence bias and variance
Key words: bias, variance, supervised learning, overfitting
37.1 Introduction
The general problem of supervised learning is often formulated as an optimization problem
An error measure is defined that evaluates the quality of a model and the goal of learning
is to find, in a family of models (the hypothesis space), a model that minimizes this error
estimated on the learning sample (or dataset) S So, at first sight, if no good enough model is
found in this family, it should be sufficient to extend the family or to exchange it for a more powerful one in terms of model flexibility However, we are often interested in a model that generalizes well on unseen data rather than on a model that perfectly predicts the output for the learning sample cases And, unfortunately, in practice, good results on the learning set do not necessarily imply good generalization performance on unseen data, especially if the “size”
of the hypothesis space is large in comparison to the sample size
Let us use a simple one-dimensional regression problem to explain intuitively why larger hypothesis spaces do not necessarily lead to better models In this synthetic problem, learning
outputs are generated according to y = fb(x)+ε, where fbis represented by the dashed curves
in Figure 39.1 andε is distributed according to a Gaussian N(0,σ) distribution With squared error loss, we will see below that the best possible model for this problem is fband its average squared error isσ2 Let us consider two extreme situations of a bad model structure choice
• A too simple model: using a linear model y = w.x + b and minimizing squared error on
the learning set, we obtain the estimations given in the left part of Figure 39.1 for two
DOI 10.1007/978-0-387-09823-4_37, © Springer Science+Business Media, LLC 2010
Trang 5734 Pierre Geurts
S
1
S
1
2
y
y
y
x x
x y
Fig 37.1 Left, a linear model fitted to two learning samples Right, a neural network fitted to the same samples
different learning set choices These models are not very good, neither on their learning sets, nor in generalization Whatever the learning set, there will always remain an error
due to the fact that the model is too simple with respect to the complexity of fb.
• A too complex model: by using a very complex model like a neural network with two
hidden layers of ten neurons each, we get the functions at the right part of Figure 39.1 for the same learning sets This time, models receive an almost perfect score on the learning set However, their generalization errors are still not very good because of two phenomena First, the learning algorithm is able to match perfectly the learning set and hence also the noise term We say in this case that the learning algorithm “overfits” the data Second, even if there is no noise, there will still remain some errors due to the high complexity of the model Indeed, the learning algorithm has many different models at its disposal and if the learning set size is relatively small, several of them will realize a perfect match of the learning set As at most one of them is a perfect image of the best model, any other choice
by the learning algorithm will result in suboptimality
The main source of error is very different in both cases In the first case, the error is essentially independent of the particular learning set and must be attributed to the lack of
complexity of the model This source of error is called bias In the second case, on the other
hand, the error may be attributed to the variability of the model from one learning set to another (which is due on one hand to overfitting and on the other hand to the sparse nature
of the learning set with respect to the complexity of the model) This source of error is called
variance Note that in the first case there is also a dependence of the model on the learning set
and thus some variability of the predictions However the resulting variance is negligible with respect to bias In general, bias and variance both depend on the complexity of the model but
in opposite direction and thus there must exist an optimal tradeoff between these two sources
of error As a matter of fact, this optimal tradeoff depends also on the smoothness of the best model and on the sample size An important consequence of this is that, because of variance,
we should always take care of not increasing too much the complexity of the model structure with respect to the complexity of the problem and the size of the learning sample
In the next section, we give a formal additive decomposition of the mean (over all learn-ing set choices) squared error into two terms which represent the bias and the variance effect
Trang 637 Bias vs Variance 735 Some propositions of similar decompositions in the context of 0-1 loss-functions are also dis-cussed They show some fundamental differences between the two types of problems although bias and variance concepts are always useful Section 3 discusses procedures to estimate bias and variance terms for practical problems In Section 4, we give some experiments and appli-cations of bias/variance decompositions
37.2 Bias/Variance Decompositions
Let us introduce some notations A learning sample S is a collection of m input/output pairs (< x1,y1>, ,< x m ,y m >), each one randomly and independently drawn from a probability distribution P D(x,y) A learning algorithm I produces a model I(S) from S, i.e a function of inputs x to the domain of y The error of this model is computed as the expectation:
Error (I(S)) = Ex,y [L(y,I(S)(x))], where L is some loss function that measures the discrepancy between its two arguments Since the learning sample S is randomly drawn from some distribution D, the model I(S) and its prediction I(S)(x) at x are also random Hence, Error(I(S)) is again a random variable and we
are interested in studying the expected value of this error (over the set of all learning sets of
size m) ES[Error(I(S))] This error can be decomposed into:
E S[Error(I(S))] = Ex[ES[Ey |x [L(y,I(S)(x))]]]
= Ex[ES[Error(I(S)(x))]], where Error(I(S)(x)) denotes the local error at point x.
Bias/variance decompositions usually try to decompose this error into three terms: the residual or minimal attainable error, the systematic error, and the effect of the variance The
exact decomposition depends on the loss function L The next two subsections are devoted to
the most common loss functions, i.e the squared loss for regression problems and the 0-1 loss for classification problems Notice however that these loss functions are not the only plausible loss functions and several authors have studied bias/variance decompositions for other loss functions (Wolpert, 1997, Hansen, 2000) Actually, several of the decompositions for 0-1 loss presented below are derived as special cases of more general bias/variance decompositions (Tibshirani, 1996,Wolpert, 1997,Heskes, 1998,Domingos, 1996,James, 2003) The interested reader may refer to these references for more details
37.2.1 Bias/Variance Decomposition of the Squared Loss
When the output y is numerical, the usual loss function is the squared loss L2(y1,y2)=(y1−
y2)2 With this loss function, it is easy to show that the best possible model is fb(x) = Ey |x [y], which takes the expectation of the target y at each point x The best model according to a given
loss function is often called the Bayes model in statistical pattern recognition Introducing this model in the mean local error, we get with some elementary calculations:
E S[Error(I(S)(x))] = Ey |x [(y − fb(x))2] + ES[( fb(x) − I(S)(x))2]. (37.1)
Symmetrically to the Bayes model, let us define the average model, favg(x) = ES[I(S)(x)]
which outputs the average prediction among all learning sets Introducing this model in the second term of Equation (63.1), we obtain:
Trang 7736 Pierre Geurts
E S[( fb(x) − I(S)(x))2] = ( fb(x) − favg(x))2+ ES[(I(S)(x) − favg(x))2].
In summary, we have the following well-known decomposition of the mean square error at a point x:
E S [Error(I(S)(x))] =σ2
R(x) + bias2
R(x) + varR(x)
by defining:
σ2
bias2R (x) = ( fb(x) − favg(x))2, (37.3) var2R (x) = ES[(I(S)(x) − favg(x))2]. (37.4) This error decomposition is well known in estimation theory and has been introduced in the
automatic learning community by (Geman et al., 1995).
The residual squared error, σ2(x), is the error obtained by the best possible model It provides a theoretical lower bound that is independent of the learning algorithm Thus, the suboptimality of a particular learning algorithm is composed of two terms: the (squared) bias measures the discrepancy between the best and the average model It measures how well is the estimate in average The variance measures the variability of the predictions with respect
to the learning set randomness
R
bias ( ) 2x
R
2
σ ( )x
R
2
σ ( )x
R
2x
var ( )
R
2x
var ( )
R
bias ( ) 2x
x
( )
f
avg
x
( )
f b
x
( )
f avg
x
( )
f b
Err
y
y
Err
Too complex model Too simple model
Fig 37.2 Top: the average models; bottom: residual error, bias, and variance
To explain why these two terms are indeed the consequence of the two phenomena dis-cussed in the introduction of this chapter, let us come back to our simple regression problem The average model is depicted in the top of Figure 39.2 for the two cases of bad model choice
Residual error, bias and variance for each position x are drawn in the bottom of the same
figure The residual error is entirely specified by the problem and loss criterion and hence in-dependent of the algorithm and learning set used When the model is too simple, the average model is far from the Bayes model almost everywhere and thus the bias is large On the other hand, the variance is small as the model does not match very strongly the learning set and thus the prediction at each point does not vary too much from one learning set to another Bias is
Trang 837 Bias vs Variance 737 thus the dominant term of error When the model is too complex, the distribution of predic-tions matches very strongly the distribution of outputs at each point The average prediction
is thus close to the Bayes model and the bias is small However, because of the noise and the small learning set size, predictions are highly variable at each point In this case, variance is the dominant term of error
37.2.2 Bias/variance decompositions of the 0-1 loss
The usual loss function for classification problems (i.e a discrete target variable) is the 0-1
loss function, Lc(y1,y2)=1 if y1= y2, 0 otherwise, which yields the mean misclassification
error at x:
E S[Error(I(S)(x))] = ES[E y|x [Lc(y,I(S)(x))]]]
= P D ,S
The Bayes model in this case is the model that outputs the most probable class at x, i.e
f b(x) = argmaxc P D(y = c|x) The corresponding residual error is:
By analogy with the decomposition of the square error, it is possible to define what we call
“natural” bias and variance terms for the 0-1 loss function First, by symmetry with the Bayes model and by analogy with the square loss decomposition, the equivalent in classification of the average model is the majority vote classifier defined by:
f avg(x) = argmax
c P S(I(S)(x) = c),
which outputs at each point the class receiving the majority of votes among the distribution
of classifiers induced from the distribution of learning sets The square bias is the error of the average model with respect to the best possible model This definition yields here:
biasC(x) = Lc( fb(x), fma j(x)).
So, biased points are those for which the majority vote classifier disagrees with the Bayes classifier On the other hand, variance can be naturally defined as:
varC(x) = ES L c(I(S)(x), fma j(x))= P S ma j(x)),
which is the average error of the models induced from random learning samples S with respect
to the majority vote classifier This definition is indeed a measure of the variability of the predictions at x: when varC(x) = 0, every model outputs the same class whatever the learning set from which it is induced and varC(x) is maximal when the probability of the class given by
the majority vote classifier is equal to 1/z (with z the number of classes), which corresponds
to the most uncertain distribution of predictions
Unfortunately, these natural bias and variance terms do not sum up with the residual error
to give the local misclassification error In other words:
Let us illustrate on a simple example how increased variance may decrease the average classi-fication error in some situations Let us suppose that we have a 3 classes problem such that the
Trang 9738 Pierre Geurts
true class probability distribution is given by (P D(y = c1|x), P D(y = c2|x), P D(y = c3|x))=(0.7, 0.2, 0.1) The best possible prediction at x is thus the class c1and the corresponding minimal
error is 0.3 Let us suppose that we have two learning algorithms I1and I2and that the distri-bution of predictions of the models built by these algorithms are given by:
(P S (I1(S)(x) = c1),P S(I1(S)(x) = c2),P S(I1(S)(x) = c3)) = (.1,.8,.1)
(P S (I2(S)(x) = c1),P S(I2(S)(x) = c2),P S(I2(S)(x) = c3)) = (.4,.5,.1)
So, we observe that both algorithms produce models that most probably will decide
class c2(respectively with probability 0.8 and 0.5) Thus, the two methods are biased (biasC(x) = 1) On the other hand, the variances of the two methods are obtained in the following way:
varC1(x) = 1 − 0.8 = 0.2 and var2
C (x) = 1 − 0.5 = 0.5,
and their mean misclassification errors are found to be
E S [Error(I1(S)(x))] = 0.76 and ES [Error(I2(S)(x))] = 0.61.
Thus between these two methods with identical bias, it is the one having the largest variance that has the smallest average error rate
It is easy to see that this happens here because of the existence of a bias Indeed, with 0-1 loss, an algorithm that has small variance and high bias is an algorithm that systematically (i.e whatever the learning sample) produces a wrong answer, whereas an algorithm that has a high bias but also a high variance is only wrong for a majority of learning samples, but not necessarily systematically So, this latter may be better than the former In other words, with 0-1 loss, much variance can be beneficial because it can lead the system closer to the Bayes classification
As a result of this counter-intuitive interaction between bias and variance terms with 0-1 loss, several authors have proposed their own decompositions We briefly describe below the most representative of them For a more detailed discussion of these decompositions, see for example (Geurts, 2002) or (James, 2003) In the fol-lowing sections, we present a very different approach to study bias and variance of 0-1 loss due to Friedman (1997), which relates the mean error to the squared bias and variance terms of the class probability estimates
Some decompositions
Tibshirani (1996) defines the bias as the difference between the probability of the Bayes class and the probability of the majority vote class:
biasT (x) = PD(y = fb(x)|x) − P D(y = fma j(x)|x). (37.6) Thus, the sum of this bias and the residual error is actually the misclassification error
of the majority vote classifier:
σC(x) + biasT(x) = 1 − P D(y = fma j(x)|x) = Error( fma j(x)).
Trang 1037 Bias vs Variance 739
This is exactly the part of the error that would remain if we could completely can-cel the variability of the predictions The variance is then defined as the difference between the mean misclassification error and the error of the majority vote classifier:
varT (x) = ES [Error(I(S)(x))] − Error( fma j(x)). (37.7) Tibshirani (1996) denotes this variance term the aggregation effect Indeed, this is the variation of error that results from the aggregation of the predictions over all learning sets Note that this variance term is not necessarily positive From different consid-erations, James (2003) has proposed exactly the same decomposition To distinguish (63.3) and (63.5) from the natural bias and variance terms, he calls them system-atic and variance effect respectively Dietterich and Kong (1995) have proposed a decomposition that applies only to the noise-free case but that exactly reduces to Tibshirani’s decomposition in this latter case
Domingos (2000) agrees with the natural definition of bias, variance given in the introduction of this section and he combines them into a non-additive expression like:
E S [Error(I(S)(x))] = b1(x).σC(x) + biasC (x) + b2(x).varC(x),
where b1and b2are two factors that are in fact functions of the true class distribution and of the distribution of predictions
Kohavi and Wolpert (1996) have proposed a very different decomposition which
is closer in spirit to the decomposition of the squared loss Their decomposition
makes use of quadratic functions of the probabilities P S(I(S)(x)|x) and P(y|x).
Heskes (1998) adopts the natural variance term varC and, ignoring the residual er-ror, defines bias as the difference between the mean misclassification error and his variance As a consequence, it can happen that his bias is smaller than the residual error Breiman (1996a, 2000) has successively proposed two decompositions In the first one, bias and variance are defined globally instead of locally Bias is the part of the error due to biased points (i.e such that biasC(x) = 1) and variance is defined as
the part of the error due to unbiased points
This multitude of decompositions translates well the complexity of the interac-tion between bias and variance in classificainterac-tion Each decomposiinterac-tion has its pros and cons Notably, we may observe in some case counterintuitive behavior with respect
to what would be observed with the classical decomposition of the squared error (e.g
a negative variance) This makes the choice, both in theoretical and empirical studies,
of a particular decomposition difficult Nevertheless, all decompositions have proven
to be useful to analyze classification algorithms, each one at least in the context of its introduction
Bias and variance of class probability estimates
Many classification algorithms work by first computing an estimate I c (S)(x) of the conditional probability of each class c at x and then deriving their classification
model by:
I (S)(x) = argmax
c I c (S)(x).