In the previous section we saw the cross-validation, jackknife, and boot- strap estimates of expected excess error for Gregory’s rule. These estimates give bias corrections to the apparent error. Do these corrections offer real improvements? Introduce the “estimators” app∫0, the zero-correction estimate corresponding to the apparent error, and ideal∫E(R), the best constant estimate if we knew the expected excess error E(R). To compare
cross, jack, bootagainst these worst and best cases appand ideal, we perform some simulations.
To judge the performance of estimators in the simulations, we use two criteria:
the root mean squared error (RMSE) about the excess error, and
the root mean squared error about the expected excess error. Notice that since
RMSE1( ) also measures the performance of the bias-corrected estimate
app+ as an estimator of the true error app+R.
I pause to clarify the distinction between excess error and expected excess error. In the chronic hepatitis problem, the training sample that Gregory observed led to a particular realization (3.2). The excess error is the difference between the true and apparent error of the realized rule h based on this training sample. The expected excess error averages the
Fˆ
qˆ Rˆ
qˆ
Rˆ
E R[ˆ -R]2=E q[ (ˆapp+Rˆ)-(qˆapp+Rˆ) ]2, RMSE2
2 1 2
ˆ ˆ ,
R E R E R
( )=( [ - ( ) ] )
RMSE1
2 1 2
ˆ ˆ ,
R E R R
( )=( [ - ] )
rˆ rˆ
rˆ rˆ rˆ
rˆ rˆ
excess error over the many training samples that Gregory might have observed and, therefore, over many realizations of this prediction rule.
Because cross, jack, boot average over many realizations, they are, strictly speaking, estimates of the expected excess error. Gregory, however, would much rather know the excess error of his particular realization.
It is perhaps unfair to think of cross, jack, bootas estimators of the excess error. A simple analogy may be helpful. Suppose Xis an observa- tion from the distribution Fz, and T(X) estimates z. The bias is the expected difference E[T(X) -z] and is analogous to the expected excess error. The difference T(X) - zis analogous to the excess error. Getting a good estimate of the bias is sometimes possible, but getting a good esti- mate of the difference T(X) -z would be equivalent to knowing z.
In the simulations, the underlying model was the logistic model that assumes x1=(t1, y1), . . . , xn= (tn, yn) are independent and identically distributed such that yiconditional on tiis Bernoulli with probability of success q(ti), where
(4.1) where ti=(ti1, . . . , tip) is p-variate normal with zero mean and a specified covariance structure S.
I performed two sets of simulations. In the first set (simulations 1.1, 1.2, 1.3) I let the sample sizes be, respectively, n=20, 40, 60; the dimen- sion of tibe p =4; and
(4.2)
where t=0.80. We would expect a good prediction rule to choose vari- ables t1and t2, and due to the correlation between variables t2and t3, a prediction rule choosing t1and t3would probably not be too bad. In the second set of simulations (simulations 2.1, 2.2, 2.3, the sample sizes were again n= 20, 40, 60; the dimension of tiwas increased to p =6; and
S = Ê
Ë ÁÁ ÁÁ ÁÁ Á
ˆ
¯
˜˜
˜˜
˜˜
˜
= =
Ê
Ë ÁÁ ÁÁ ÁÁ Á
ˆ
¯
˜˜
˜˜
˜˜
˜ 1 0 0 0 0 0
0 1 0 0 0 0 0 0 1 0 0 0
0 0 0 1 0
0 0 0 1 0
0 0 0 0 0 1
0
1 1 1 2 0 0 t 0
t
b b
, , .
S = Ê
Ë ÁÁ ÁÁ
ˆ
¯
˜˜
˜˜
= =
Ê
Ë ÁÁ ÁÁ
ˆ
¯
˜˜
˜˜ 1 0 0 0
0 0 0
0 1 0
0 0 0 1
0
1 2 0 0
0
t
t , b , b ,
logit q( )ti =b0+tib, rˆ rˆ rˆ rˆ
rˆ rˆ
APPENDIX B EXCESS ERROR ESTIMATION IN FORWARD LOGISTIC REGRESSION 181 (4.3)
Each of the six simulations consisted of 400 experiments. The results of all 400 experiments of simulation 1.1 are summarized in Table 1. In each experiment, we estimate the excess error Rby evaluating the realized pre- diction rule on a large number (5,000) of new observations. We estimate the expected excess error by the sample average of the excess errors in the 400 experiments. To compare the three estimators, I first remark that in the 400 experiments, the bootstrap estimate was closest to the true excess error 210 times. From Table 1 we see that since
are all close, crossand jackare nearly unbiased estimates of the expected excess error E(R), whereas bootwith expectation E( boot) = 0.0786 is biased downwards. [Actually, since we are using the sample averages of the excess errors in 400 experiments as estimates of the expected excess errors, we are more correct in saying that a 95% confidence interval for E( cross) is (0.0935), 0.1143), which contains E(R), and a 95% confidence interval for E( jack) is (0.0866, 1036), which also contains E(R). On the other hand, a 95% confidence interval for E( boot) is (0.0761, 0.0811), which does not contain E(R).] However, corssand jackhave enormous standard deviations, 0.1060 and 0.0864, respectively, compared to 0.0252, the standard deviation of boot. From the column for RMSE1,
with RMSE1( boot) being about one-third of the distance between RMSE1( ideal) and RMSE1( app). The same ordering holds for RMSE2.
Recall that simulations 1.1, 1.2, and 1.3 had the same underlying distri- bution but differing sample sizes, n= 20, 40, and 60. As sample size increased, the expected excess error decreased, as did the mean squared error of the apparent error. We observed a similar pattern in simulations 2.1, 2.2, and 2.3, where the sample sizes were again n=20, 40, and 60,
rˆ rˆ
rˆ
RMSE1(rˆideal)<RMSE1(rˆboot)<RMSE1(rˆapp)~RMSE1(rˆcross)~RMSE1(rˆjack), rˆ
rˆ rˆ
rˆ rˆ
rˆ rˆ
rˆ rˆ rˆ
E r(ˆcross)=0 1039. , E r(ˆjack)=0 0951. , E R( )=0 1006.
E( ) SD( ) RMSE1( ) RMSE2( )
apparent 0.0000 0.0000 0.1354 0.1006
cross 0.1039 0.1060 0.1381 0.1060
jack 0.0951 0.0864 0.1274 0.0865
boot 0.0786 0.0252 0.1078 0.0334
ideal 0.1006 0.0000 0.0906 0.0000
Rˆ Rˆ
Rˆ Rˆ
Rˆ
TABLE 1 The Results of 400 Experiments of Simulation 1.1
Note: RMSE1is the root mean squared error about the true excess, and RMSE2is that about the expected excess error. The expected excess error is E(Rˆ) for ideal.
and the dimension of tiwas increased to p= 6 and S, b0, and bgiven in (4.3). For larger sample sizes, bias corrections to the apparent error became less important. It is still interesting, however, to compare mean squared errors. For all six simulations, I plot RMSE1’s in Figure 2 and RMSE2’s in Figure 3. It is interersting to note that the ordering noticed in simulation 1.1 of the root mean squared error of the five estimates also held in the other five simulations. That is,
and RMSE1( boot) is about one-third of the distance between
RMSE1( ideal)and RMSE1( app). Similar remarks hold for RMSE2. Cross- validation and the jackknife offer no improvement over the apparent error, whereas the improvement given by the bootstrap is substantial.
The superiority of the bootstrap over cross-validation has been observed in other problems. Efron (1983) discussed estimates of excess error and
rˆ rˆ
rˆ
RMSE1(rˆapp)~RMSE1(rˆcross)~RMSE1(rˆjack),
APPENDIX B EXCESS ERROR ESTIMATION IN FORWARD LOGISTIC REGRESSION 183
AC B J 1.1
1.2
1.3
2.1
2.2
2.3
0.00 0.04 0.08 0.12 0.16 0.20
FIGURE 2 95% (nonsimultaneous) Confidence Intervals for RMSE1. In each set of simulations, there are five confidence intervals for, respectively, apparent (A), cross-validation (C), jackknife (J), bootstrap (B), and ideal (1) estimates of the excess error. Each confidence interval is indicated by — —. The middle vertical bar in each confidence interval represents the value of the estimate.
performed several simulations with a flavor similar to mine. I report on only one of his simulations here. When the prediction rule is the usual Fisher discriminant and the training sample consists of 14 observations that are equally likely from N((-1–2, 0), I) or N((+–12, 0), I), then the RMSE1of apparent, cross-validation, bootstrap, and ideal estimates are, respectively, 0.149, 0.144, 0.134, and 0.114. Notice that the RMSE1’s of cross-validation and apparent estimates are close, whereas the RMSE1of the bootstrap estimate is about halfway between that of the ideal and apparent estimates.
In the remainder of this section, I discuss the sufficiency of the number of bootstrap replications and the number of experiments.
Throughout the simulations, I used B= 100 bootstrap replications for each experiment. Denote
A C
B J 1.1
1.2
1.3
2.1
2.2
2.3
0.00 0.04 0.08 0.12 0.16 0.20
FIGURE 3 95% (nonsimultaneous) Confidence Intervals for RMSE2. In each set of simulations, there are four confidence intervals for, respectively, apparent (A), cross-validation (C), jackknife (J), and bootstrap (B) estimates of the expected excess error. Notice that app∫0, so RMSE2( app) is the expected excess error, a constant; the “confidence interval” for RMSE2( app) is a single value, indicated by a single bar. In addition, RMSE2( ideal) =0 and its confidence intervals are not shown. Some of the bootstrap confidence intervals are so small that they are indistinguishable from single bars.
rˆ
rˆ rˆ rˆ
Using a component-of-variance calculation (Gong 1982), for Simulation 1.1
so if we are interested in comparing root mean squared errors about the excess error, we need not perform more than B=100 bootstrap replications.
In each simulation, I included 400 experiments and therefore used the approximation
where eand Re are the estimate and true excess of the eth experiment.
Figure 2 and 3 show 95% nonsimultaneous confidence intervals for RMSE1’s and RMSE2’s. Shorter intervals for RMSE1’s would be prefer- able, but obtaining them would be time-consuming. Four hundred experi- ments of simulation 1.1 with p= 4, n=20, and B=100 took 16
computer hours on the PDP-11/34 minicomputer, whereas 400 experi- ments of simulation 2.3 with p= 6, n=60, and B=100 took 72 hours.
Halving the length of the confidence intervals in Figures 2 and 3 would require four times the number of experiments and four times the com- puter time. On the other hand, for each simulation in Figure 3, the confi- dence interval for RMSE2( ideal) is disjoint from that of RMSE2( boot), and both and disjoint from the confidence intervals for RMSE2( jack), RMSE2( cross), and RMSE2( app). Thus, for RMSE2, we can convincingly argue that the number of experiments is sufficient.