For now, assume that p = 1—that is, we have only one predictor. We would like to obtain an estimate forfk(x) that we can plug into (4.10) in order to estimatepk(x). We will then classify an observation to the class for which pk(x) is greatest. In order to estimatefk(x), we will first make some assumptions about its form.
Suppose we assume that fk(x) is normal or Gaussian. In the one-
normal Gaussian
dimensional setting, the normal density takes the form fk(x) = 1
√2πσk
exp
− 1
2σk2(x−μk)2
, (4.11)
whereμk andσ2k are the mean and variance parameters for the kth class.
For now, let us further assume thatσ12=. . .=σK2: that is, there is a shared variance term across allK classes, which for simplicity we can denote by σ2. Plugging (4.11) into (4.10), we find that
pk(x) =
πk√1 2πσexp
−2σ12(x−μk)2 K
l=1πl√1 2πσexp
−2σ12(x−μl)2 . (4.12) (Note that in (4.12),πk denotes the prior probability that an observation belongs to thekth class, not to be confused withπ≈3.14159, the math- ematical constant.) The Bayes classifier involves assigning an observation
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
012345
FIGURE 4.4.Left: Two one-dimensional normal density functions are shown.
The dashed vertical line represents the Bayes decision boundary.Right:20 obser- vations were drawn from each of the two classes, and are shown as histograms.
The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data.
X = x to the class for which (4.12) is largest. Taking the log of (4.12) and rearranging the terms, it is not hard to show that this is equivalent to assigning the observation to the class for which
δk(x) =xãμk σ2 − μ2k
2σ2 + log(πk) (4.13) is largest. For instance, if K = 2 and π1 = π2, then the Bayes classifier assigns an observation to class 1 if 2x(μ1−μ2) > μ21−μ22, and to class 2 otherwise. In this case, the Bayes decision boundary corresponds to the point where
x= μ21−μ22
2(μ1−μ2) =μ1+μ2
2 . (4.14)
An example is shown in the left-hand panel of Figure 4.4. The two normal density functions that are displayed,f1(x) andf2(x), represent two distinct classes. The mean and variance parameters for the two density functions areμ1 =−1.25, μ2 = 1.25, and σ12 =σ22 = 1. The two densities overlap, and so given thatX=x, there is some uncertainty about the class to which the observation belongs. If we assume that an observation is equally likely to come from either class—that is, π1 =π2 = 0.5—then by inspection of (4.14), we see that the Bayes classifier assigns the observation to class 1 if x < 0 and class 2 otherwise. Note that in this case, we can compute the Bayes classifier because we know that X is drawn from a Gaussian distribution within each class, and we know all of the parameters involved.
In a real-life situation, we are not able to calculate the Bayes classifier.
In practice, even if we are quite certain of our assumption thatXis drawn from a Gaussian distribution within each class, we still have to estimate the parameters μ1, . . . , μK, π1, . . . , πK, and σ2. The linear discriminant
analysis(LDA) method approximates the Bayes classifier by plugging esti-
linear discriminant analysis
mates forπk,μk, andσ2into (4.13). In particular, the following estimates are used:
ˆ
μk = 1 nk
i:yi=k
xi
ˆ
σ2 = 1
n−K K k=1
i:yi=k
(xi−μˆk)2 (4.15) wherenis the total number of training observations, andnk is the number of training observations in thekth class. The estimate forμk is simply the average of all the training observations from the kth class, while ˆσ2 can be seen as a weighted average of the sample variances for each of the K classes. Sometimes we have knowledge of the class membership probabili- tiesπ1, . . . , πK, which can be used directly. In the absence of any additional information, LDA estimatesπk using the proportion of the training obser- vations that belong to thekth class. In other words,
ˆ
πk=nk/n. (4.16)
The LDA classifier plugs the estimates given in (4.15) and (4.16) into (4.13), and assigns an observationX =xto the class for which
δˆk(x) =xãμˆk ˆ σ2 − μˆ2k
2ˆσ2 + log(ˆπk) (4.17) is largest. The word linear in the classifier’s name stems from the fact that thediscriminant functionsδˆk(x) in (4.17) are linear functions ofx(as
discriminant function
opposed to a more complex function ofx).
The right-hand panel of Figure 4.4 displays a histogram of a random sample of 20 observations from each class. To implement LDA, we began by estimatingπk,μk, andσ2using (4.15) and (4.16). We then computed the decision boundary, shown as a black solid line, that results from assigning an observation to the class for which (4.17) is largest. All points to the left of this line will be assigned to the green class, while points to the right of this line are assigned to the purple class. In this case, sincen1=n2= 20, we have ˆπ1 = ˆπ2. As a result, the decision boundary corresponds to the midpoint between the sample means for the two classes, (ˆμ1+ ˆμ2)/2. The figure indicates that the LDA decision boundary is slightly to the left of the optimal Bayes decision boundary, which instead equals (μ1+μ2)/2 = 0. How well does the LDA classifier perform on this data? Since this is simulated data, we can generate a large number of test observations in order to compute the Bayes error rate and the LDA test error rate. These are 10.6 % and 11.1 %, respectively. In other words, the LDA classifier’s error rate is only 0.5 % above the smallest possible error rate! This indicates that LDA is performing pretty well on this data set.
x1 x1
x2
x2
FIGURE 4.5. Two multivariate Gaussian density functions are shown, with p= 2. Left:The two predictors are uncorrelated. Right:The two variables have a correlation of0.7.
To reiterate, the LDA classifier results from assuming that the observa- tions within each class come from a normal distribution with a class-specific mean vector and a common varianceσ2, and plugging estimates for these parameters into the Bayes classifier. In Section 4.4.4, we will consider a less stringent set of assumptions, by allowing the observations in thekth class to have a class-specific variance,σk2.