Using LDA and QDA in practice

3.3 Linear and quadratic discriminant analysis (LDA & QDA)

3.3.2 Using LDA and QDA in practice

We have derived LDA and QDA by studying Bayes’ theorem (3.17). We are ultimately interested in the left hand side of (3.17), and we went there by making an assumption about the right hand side, namely thatp(x|y)has a Gaussian distribution. In most practical cases that assumption does not hold in reality (or, at least, it is hard for us to verify whether it holds or not), but LDA as well as QDA turns out to be useful classifiers even when that assumption does not hold.

How do we go about in practice, if we want to learn an LDA or QDA classifier from training data {xi, yi}ni=1(without knowing something about therealdistributionp(x|y)) and use it to make a prediction?

Learning the parameters

First, the parametersπbk,àbkandΣb (for LDA) orΣbk(for QDA) have, for eachk= 1, . . . , K, to be learned, or estimated, from the training data. The perhaps most straightforward parameter to learn isπbk, the relative occurrence of classkin the training data,

b πk= nk

n, (3.20a)

3.3 Linear and quadratic discriminant analysis (LDA & QDA) wherenk is the number of training data samples in classk. Consequently, allnkmust sum ton, and therebyP

kbπk= 1. Further, the mean vectoràkof each class is learned as b

àk= 1 nk

i:yi=k

xi, (3.20b)

the empirical mean among all training samples of classk. For LDA, the common covariance matrixΣfor all classes is usually learned as

Σb = 1 n−K

XK k=1

i:yi=k

(xi−àbk)(xi−àbk)T (3.20c) which can be shown to be an unbiased estimate of the covariance matrix5. For QDA, one covariance matrixΣkhas to be learned for each classk= 1, . . . , K, usually as

Σbk= 1 nk−1

i:yi=k

(xi−àbk)(xi−àbk)T, (3.20d) which similarly also can be shown to be an unbiased estimate.

Remark 3.3 To derive the learning of LDA and QDA, we did not make use of the maximum likelihood idea, in contrast to linear and logistic regression. Furthermore, learning LDA and QDA amounts to inserting the training data into the closed-form expressions(3.20), similar to linear regression (the normal equations), but different from logistic regression (which requires numerical optimization).

Making predictions

Once we have learned the parametersbπk,àbkandΣb orΣbkfor all classesk= 1, . . . , K, we have a model for p(y|x) (3.18 and 3.19) we can use it for making predictions for a test input x?. As for logistic regression, we turnp(y|x?)into actual predictionsby?by taking the most probable class as the prediction,

y?= arg max

p(y=k|x?). (3.21)

We summarize this by algorithm 2 and 3, and illustrate by Figure 3.5 and 3.6.

Algorithm 2:Linear Discriminant Analysis, LDA

Data: Training data{xi, yi}ni=1(with output classesk= 1, . . . , K) and test inputx? Result:Predicted test outputyb

Learn

1 fork= 1, . . . , K do

2 Computebπk(3.20a) andàbk(3.20b)

3 end

4 ComputeΣb (3.20c) Predict

5 fork= 1, . . . , K do

6 Computep(y=k|x?)(3.18)

7 end

8 Find largestp(y=k|x?)and setby?to thatk

5This means that the if we estimateΣblike this for new training data over and over again, the average would be the true covariance matrix ofp(x).

Algorithm 3:Quadratic Discriminant Analysis, QDA

Data: Training data{xi, yi}ni=1(with output classesk= 1, . . . , K) and test inputx? Result:Predicted test outputyb?

Learn

1 fork= 1, . . . , K do

2 Computebπk(3.20a),àbk(3.20b) andΣbk(3.20d)

3 end Predict

4 fork= 1, . . . , K do

5 Computep(y=k|x?)(3.19)

6 end

7 Find largestp(y=k|x?)and setby?to thatk

−3 bà1 −1 bà2 1 àb3 3

0 0.2 0.4 0.6

b σ

y=1

b σ

y=2

b σ y=3

p(x|y)

y=1 y=2 y=3

0 0.2 0.4 0.6

b π1

b π2

b π3

p(y)

y=1 y=2 y=3

−3 −1 1 3

0 0.5 1

p(y|x) Bayes’ theorem p(y|x)= ÍKp(x|y)p(y)

k=1p(x|k)p(k)

Figure 3.5:An illustration of LDA forK= 3classes, with dimensionp= 1of the inputx. In the upper left panel is the Gaussian model ofp(x|k)shown, parameterized byàbkandΣ. The parametersb àbkandΣ, as well asb πbk, are learned from training data, not shown in the figure. (Sincep= 1, we only have a scalar varianceΣ2, instead of a covaraince matrixΣ). In the upper right panel isbπk, an approximation ofp(k), shown. These are used in Bayes’

theorem to computeP(k|x), shown in the bottom panel. We take the final prediction as the class which is modeled to have the highest probability, which means the topmost solid colored line in the bottom plot (e.g., the prediction for x?= 0.7would beyb= 2(green)). The decision boundaries (vertical dotted lines in the bottom plot) are hence found where the solid colored lines are intersecting.

3.3 Linear and quadratic discriminant analysis (LDA & QDA)

−3 bà1 −1 bà2 1 àb3 3

0 0.5 1

b σ1

y=1 bσ2

y=2

b σ3 y=3

p(x|y)

y=1 y=2 y=3

0 0.2 0.4 0.6

b π1

b π2

b π3

p(y)

y=1 y=2 y=3

−3 −1 1 3

0 0.5 1

p(y|x) Bayes’ theorem

Figure 3.6:An illustration of QDA forK= 3classes, in the same fashion as Figure 3.5. However, in contrast to LDA in Figure 3.5, is the learned varianceΣbk ofp(x|k)different for differentk(upper left panel). For this reason can the resulting decision boundaries (bottom panel) be more complicated than for LDA, note for instance the small slice ofby= 3(blue) inbetweenby= 1(red) andby= 2(green) around−0.5.

Decision boundaries for LDA and QDA

Once we have learned the parameters from training data, we can computeby?for a test inputx?by inserting everything into (3.18) for each classk, and take the prediction as the class which is predicted to have the highest probabilityp(y|x). As it turns out, the equations (3.18) and (3.19) are simple enough so that we can, by only using pen and paper, say something about thedecision boundary, i.e., the boundary (in the input space) where the predictions shift between different classes.

If we note that neither the logarithm nor terms independent ofkchange the location of the maximizing argument (arg maxk), we can for LDA write

yLDA= arg max

p(y=k|x) =

= arg max

logp(y=k|x) =

= arg max

logπk+ logN

x|àbk,Σb

−log



 XK j=1

πjN

x|àbj,Σb

=

= arg max

logπk+ logN

x|àbk,Σb

= arg max

logπk−1

2log det 2πΣb −1

2(x−àbk)TΣb−1(x−àbk) =

= arg max

logπk−1

2àbTkΣb−1àbk+xTΣb−1àbk

| {z }

,δLDAk (x)

. (3.22)

The functionδkLDA(x)on the last row is sometimes referred to as thediscriminant function. The pointsxon theboundarybetween two class predictions, sayk= 0andk= 1, is characterized byδ0LDA(x) =δLDA1 (x), i.e., the decision boundary between two classes0and1can be written as the set of pointsxwhich fulfills

δ0LDA(x) =δ1LDA(x)⇔ (3.23)

logπ0−1

2àT0Σb−1à0+xTΣb−1à0 = logπ1−1

2àbT1Σb−1àb1+xTΣ−1àb1⇔ xTΣ−1(àb0−àb1) = logbπ1−logbπ0−1

àT1Σ−1àb1−àbT0Σb−1àb0

| {z }

constant (independent ofx)

. (3.24)

From linear algebra, we know that{x:xTA =c}, for some matrixAand some constantc, defines a hyperplane in thex-space. Thus, the decision boundary for LDA is alwayslinear, and hence its name, lineardiscriminant analysis.

For QDA we can do a similar derivation b

yQDA= arg max

logπk−1

2log detΣbk−1

2àTkΣb−k1àbk+xTΣb−k1àbk−1

2xTΣb−k1x

| {z }

,δQDAk (x)

. (3.25)

and setδQDA0 (x) =δQDA1 (x)to find the decision boundary as the set of pointsxfor which logbπ0−1

2log detΣb0−1

2àbT0Σb−01àb0+xTΣb−01àb0−1

2xTΣb−01x

= logπ1−1

2log detΣb1−1

2àT1Σb−11àb1+xTΣb−11àb1−1

2xTΣb−11x

⇔xT(Σb−10 àb0−Σb−11 àb1)−1

2xT(Σb−10 +Σb−11 )x

= logπb1−logπb0− 1

2log detΣb1+1

2log detΣb0− 1 2

àbT1Σb−11 àb1−àbT0Σb−10 àb0

| {z }

constant (independent ofx)

, (3.26)

This is now on the format{x :xTA+xTBx =c}, aquadratic form, and the decision boundary for QDA is thus alwaysquadratic(and thereby also nonlinear!), which is the reason for its namequadratic discriminant analysis.

3.3 Linear and quadratic discriminant analysis (LDA & QDA)

x1 x2

(a)LDA forK= 2classes always gives a linear decision boundary. The red dots and green circles are training data from different classes, and the intersection between the red and green fields is the decision boundary obtained for an LDA classifier learned from the training data.

x1 x2

(b)LDA forK = 3classes. We have now introduced training data from a third class, marked with blue crosses.

The decision boundary between any two pair of classes is still linear.

x1 x2

(c)QDA has quadratic (i.e., nonlinear) decision boundaries, as in this example where a QDA classifier is learned from the shown training data.

x1 x2

(d)WithK = 3classes are the decision boundaries for QDA possibly more complex than with LDA, as in this case (cf. (b)).

Figure 3.7:Examples of decision boundaries for LDA and QDA, respectively. This can be compared to Figure 3.3, where the decision boundary for logistic regression (with the same training data) is shown. LDA and logistic regression both have linear decision boundaries, but they are not identical.

y= 0 y= 1 0

0.5

p(y|x)

(a)WithK= 2classes, Bayes’ classifier tell us to take the class which has probability>0.5as the predictiony. Here, the prediction would thereforeb beyb= 1.

y= 1 y= 2 y= 3 y= 4

0 0.5

p(y|x)

(b)ForK= 4classes, Bayes’ classifier tells us to take the prediction ybas the highest bar, which meansyb= 4here. (In contrast toK= 2 classes in (a), it can happen that no class has probability>0.5.) Figure 3.8:Bayes’ classifier: The probabilitiesp(y|x)are shown as the height of the bars. Bayes’ classifier says that if we want to make as few misclassificiations as possible, on the average, we should predictbyas the class which has highest probability.

More on classification and classifiers

E new = E train + generalization error