E new = E train + generalization error

We have already discussed the fact thatEtraincannot be used in estimatingEnew. In fact, it usually holds that

E¯train<E¯new, (5.9)

Put in words, this means that on average, a method usually performs worse on new, unseen data, than on training data. A methods ability to perform well on unseen data after being trained on training data, can be understood as the method’s ability togeneralizefrom training data. The difference betweenEnewand Etrainis accordingly called thegeneralization error2, as

generalization error,Enew−Etrain. (5.10) The generalization error thereby gives a connection between the performance on training data and the performance ‘in production’ on new, previously unseen data. It can therefore be interesting to understand how big (or small) the generalization error is.

Generalization error and model complexity

The size of the generalization error depends on the method and the problem. Concerning the method, one can typically say thatthe more the model has adopted to the training data, the larger the generalization error. A theoretical study of how much a model adopts to training data can be done using the so-called VC dimension, eventually leading to probabilistic bounds on the generalization error. Unfortunately those bounds are usually rather conservative, and we will not pursue that formal approach any further.3 Instead, we only use the vague termmodel complexity, by which we mean the ability of a method to adopt to complicated patterns in the training data, and reason about what we see in practice. A model with high complexity (such as a neural network) can describe very complicated relationships, whereas a model with low complexity (such as LDA) is less flexible in what functions it can describe. For parametric methods, the model complexity is related to the number of parameters that are trained. Flexible non-parametric methods (such as trees with many leaf nodes ork-NN with smallk) have higher model complexity than parametric methods with few parameters, etc. Techniques such as regularization, early stopping and dropout (for neural networks) effectively decrease the model complexity.

2SometimesEnewis called generalization error; not in this text. In our terminology we do not distinguish between the generalization error for a model trained on a certain training data set, and its training-data averaged counterpart.

3If you are interested, a good book is Abu-Mostafa, Magdon-Ismail, and Lin 2012.

E¯new

E¯train

Generalizarionerror

Overfit Underfit

Model complexity

Error

Figure 5.3:Behavior ofE¯trainandE¯newfor many supervised machine learning methods, as a function of model complexity. We have not made a formal definition of complexity, but a rough proxy is the number of parameters that are learned from the data. The difference between the two curves is the generalization error. In general, one can expectE¯trainto decrease as the model complexity increases, whereasE¯newtypically has a U-shape. If the model is so complex thatE¯newis larger than it had been with a less complex model, the termoverfitis commonly used.

Somewhat less common is the termunderfitused for the opposite situation. The level of model complexity which gives the minimumE¯new(at the dotted line) would in a consistent terminology perhaps be called a balanced fit. A method with a balanced fit is usually desirable, but often hard to find since we do not know neitherE¯newnorEnewin practice.

Typically,higher model complexity implies larger generalization error. Furthermore,E¯trainusually decreases as the model complexity increases, whereasE¯newattains a minimum for some intermediate model complexity value: too smallandtoo high model complexity both raisesE¯new. This is illustrated in Figure 5.3. The region whereE¯newis larger than its minimum due to too high model complexity is commonly referred to asoverfit. The other region (whereE¯newis larger than its minimum due to too small model complexity) is sometimes referred to asunderfit. In a consistent terminology, the point whereE¯new attains it minimum could be referred to as a balanced fit. Since the goal is to minimizeE¯new, we are interested in finding this point. We also illustrate this by Example 5.1.

Remark 5.3 This and the next section discuss the usual behavior ofE¯new,E¯trainand the generalization error. We use the term ‘usually’ because there are so many supervised machine learning methods and problems that it is almost impossible to make any claim that isalwaystrue for all possible situations.

Pathological counter-examples may exist. One should also keep in mind that claims aboutE¯trainandE¯new are about theaveragebehavior, which hopefully is clear in Example 5.1.

5.3 UnderstandingEnew

Example 5.1: EtestandEnewin a simulated example

We consider a simulated binary classification example with two-dimensional inputsx. On the contrary to all real world machine learning problems, in a simulated problem like this we can actually computeEnew, since we do knowp(x, y)(otherwise we could not make the simulation).

In this example,p(x)is a uniform distribution on the square[−1,1]2, andp(y|x)is defined as follows: all points above the dotted curve in the figure below are green with probability0.9, and points below the curve are red with probability0.9. (The dotted line is also the decision boundary for Bayes’ classifier. Why? And what wouldEnewbe for Bayes’ classifier?)

−1 0 1

x1 x2

We generaten= 100samples as training data, and learn three classifiers: a logistic regression classifier, a QDA classifier and ak-NN classifier withk = 2. If we are to rank these methods in model complexity order, logistic regression is simpler than QDA (logistic regression is a linear classifier, whereas QDA is more general), and QDA is simpler thank-NN (sincek-NN is non-parametric and can have rather complicated decision boundaries). We plot their decision boundaries, together with the training data:

−1 0 1

x1 x2

Logistic regression

−1 0 1

x1 x2

QDA

−1 0 1

x1 x2

k-NN,k =2

For each of these three classifiers, we can computeEtrain by simply counting the fraction of training data points that are on the wrong side of the decision boundary. From left to right, we getEtrain = 0.17,0.16,0.11. Since we are in a simulated example, we can also accessEnew (or rather estimate it numerically by simulating a lot of test data), and from left to right we get Enew= 0.22,0.15,0.24. This pattern resembles Figure 5.3, except for the fact thatEnewis smaller thanEtrainfor QDA. Is this unexpected? Not really, what we have discussed in the main text is the averageE¯newandE¯train,notthe situation withEnewandEtrainfor one particular set of training data. We therefore repeat this experiment 100 times, and compute the averageE¯newandE¯trainover those100experiments:

Logistic regression QDA k-NN withk= 2

E¯train 0.17 0.14 0.10

E¯new 0.18 0.15 0.19

This follows Figure 5.3 well: The generalization error (difference betweenE¯newandE¯train) is positive and increases with model complexity,E¯traindecreases with model complexity, andE¯new has its minimum for QDA. This suggests thatk-NN withk= 2suffers from overfitting for this problem, whereas logistic regression is a case of underfitting.

E¯new

E¯train

Size of training datan

Error

(a)Simple model

E¯new

E¯train

Size of training datan

Error

(b)Complex model

Figure 5.4:Typical relationship betweenE¯new,E¯trainand the number of data pointsnin the training data set. The generalization error (difference betweenE¯newandE¯train) decreases, at the same time asE¯trainincreases. Typically, a more complex model (right panel) will for large enoughnattain a smallerE¯newthan a simpler model (left panel) would on the same problem (the axes of the figures are comparable). However, the generalization error is typically larger for a more complex model, in particular when there is little training datan.

Generalization error and sizenof training data

The previous section and Figure 5.3 are concerned about the relationship between E¯new, E¯train, the generalization error (their difference) and the model complexity. Yet another important aspect is the size of the training data set,n. Intuitively, one may expect that the more training data, the better the possibilities to learn how to generalize. Yet again, we do not make a formal derivation, but we can in general expect thatthe more training data, the smaller the generalization error. On the other hand,E¯train typically increases asnincreases, since most models are not able to fit all training data perfectly if there are too many of them. A typical behavior ofE¯trainandE¯newis sketched in Figure 5.4.

E new = E train + generalization error

Using LDA and QDA in practice

More on classification and classifiers