Similar to this, the bootstrap standard confidence interval is given by Bootstrap-Bootstrap-tttt ConfConfConfiiiiddddeeeencncnceeee IntervaIntervaIntervallll The second type of confidenc
Trang 1Similar to this, the bootstrap standard confidence interval is given by
Bootstrap-Bootstrap-tttt ConfConfConfiiiiddddeeeencncnceeee IntervaIntervaIntervallll
The second type of confidence interval using the bootstrap is called the
the following quantity is computed:
As before, is the bootstrap replicate of , but is the estimated dard error of for that bootstrap sample If a formula exists for the stan-dard error of , then we can use that to determine the denominator ofEquation 6.22 For instance, if is the mean, then we can calculate the stan-dard error as explained in Chapter 3 However, in most situations where wehave to resort to using the bootstrap, these formulas are not available Oneoption is to use the bootstrap method of finding the standard error, keeping
stan-in mstan-ind that you are estimatstan-ing the standard error of using the bootstrapsample In other words, one resamples with replacement from the boot-strap sample to get an estimate of
Once we have the B bootstrapped values from Equation 6.22, the nextstep is to estimate the quantiles needed for the endpoints of the interval The-th quantile, denoted by of the , is estimated by
This says that the estimated quantile is the such that % of thepoints are less than this number For example, if and
, then could be estimated as the fifth largest value of the
One could also use the quantile estimates cussed previously in Chapter 3 or some other suitable estimate
dis-We are now ready to calculate the bootstrap-t confidence interval This is
Trang 2, (6.24)
where is an estimate of the standard error of The bootstrap- t interval
is suitable for location statistics such as the mean or quantiles However, itsaccuracy for more general situations is questionable [Efron and Tibshirani,1993] The next method based on the bootstrap percentiles is more reliable
PROCEDURE - BOOTSTRAP-T CONFIDENCE INTERVAL
1 Given a random sample, , calculate
2 Sample with replacement from the original sample to get
3 Calculate the same statistic using the sample in step 2 to get
4 Use the bootstrap sample to get the standard error of Thiscan be calculated using a formula or estimated by the bootstrap
5 Calculate using the information found in steps 3 and 4
6 Repeat steps 2 through 5, B times, where
7 Order the from smallest to largest Find the quantiles
and
8 Estimate the standard error of using the B bootstrap
repli-cates of (from step 3)
9 Use Equation 6.24 to get the confidence interval
The number of bootstrap replicates that are needed is quite large for
confi-dence intervals It is recommended that B should be 1000 or more If no
for-mula exists for calculating the standard error of , then the bootstrapmethod can be used This means that there are two levels of bootstrapping:one for finding the and one for finding the , which can greatlyincrease the computational burden For example, say that and weuse 50 bootstrap replicates to find , then this results in a total of 50,000resamples
Example 6.11
Say we are interested in estimating the variance of the forearm data, and we
decide to use the following statistic,
,
θˆ tˆ( 1 – α 2 ⁄ )
SEˆ θˆ
⋅– ,θˆ tˆ– (α 2⁄ )⋅SEˆ θˆ
Trang 3which is the sample second central moment We write our own simple
func-tion called mom (included in the Computafunc-tional Statistics Toolbox) to estimate
this
% This function will calculate the sample 2nd
% central moment for a given sample vector x
function mr = mom(x)
n = length(x);
mu = mean(x);
mr = (1/n)*sum((x-mu).^2);
We use this function as an input argument to bootstrp to get the
bootstrap-t confidence inbootstrap-terval The MATLAB code given below also shows how bootstrap-to gebootstrap-t
the bootstrap estimate of standard error for each bootstrap sample First weload the data and get the observed value of the statistic
Now we get the bootstrap replicates using the function bootstrp One of
the optional output arguments from this function is a matrix of indices for the
resamples As shown below, each column of the output bootsam contains
the indices to a bootstrap sample We loop through all of the bootstrap ples to estimate the standard error of the bootstrap replicate using that resa-mple
sam-% Get the bootstrap replicates and samples.
[bootreps, bootsam] = bootstrp(B,'mom',forearm);
% Set up some storage space for the SE’s.
zvals = (bootreps - thetahat)./sehats;
Then we get the estimate of the standard error that we need for the endpoints
of the interval
% Estimate the SE using the bootstrap.
SE = std(bootreps);
Trang 4Now we get the quantiles that we need for the interval given in Equation 6.24and calculate the interval.
% Get the quantiles.
k = B*alpha/2;
szval = sort(zvals);
tlo = szval(k);
thi = szval(B-k);
% Get the endpoints of the interval.
blo = thetahat - thi*SE;
bhi = thetahat - tlo*SE;
The bootstrap-t interval for the variance of the forearm data is
Bootstrap Perce
Bootstrap Percennnnttttile Intervaile Intervaile Intervallll
An improved bootstrap confidence interval is based on the quantiles of thedistribution of the bootstrap replicates This technique has the benefit of
being more stable than the bootstrap-t, and it also enjoys better theoretical
coverage properties [Efron and Tibshirani, 1993] The bootstrap percentile
where is the quantile in the bootstrap distribution of For
position of the ordered bootstrap replicates Similarly, is the replicate
in position 975 As discussed previously, some other suitable estimate for thequantile can be used
The procedure is the same as the general bootstrap method, making it easy
to understand and to implement We outline the steps below
PROCEDURE - BOOTSTRAP PERCENTILE INTERVAL
1 Given a random sample, , calculate
2 Sample with replacement from the original sample to get
3 Calculate the same statistic using the sample in step 2 to get thebootstrap replicates,
4 Repeat steps 2 through 3, B times, where
5 Order the from smallest to largest
1.00 1.57,
θˆB* ( α 2 ⁄ )θˆB* 1 ( – α 2 ⁄ ),
Trang 57 The lower endpoint of the interval is given by the bootstrap cate that is in the -th position of the ordered , and theupper endpoint is given by the bootstrap replicate that is in the
repli th position of the same ordered list Alternatively,using quantile notation, the lower endpoint is the estimated quan-tile and the upper endpoint is the estimated quantile ,where the estimates are taken from the bootstrap replicates
Example 6.12
Let’s find the bootstrap percentile interval for the same forearm data The
confidence interval is easily found from the bootstrap replicates, as shownbelow
% Use Statistics Toolbox function
% to get the bootstrap replicates.
This interval is given by , which is slightly narrower than the
bootstrap-t interval from Example 6.11.
So far, we discussed three types of bootstrap confidence intervals The dard interval is the easiest and assumes that is normally distributed The
stan-bootstrap-t interval estimates the standardized version of from the data,
avoiding the normality assumptions used in the standard interval The centile interval is simple to calculate and obtains the endpoints directly fromthe bootstrap estimate of the distribution for It has another advantage inthat it is range-preserving This means that if the parameter can take onvalues in a certain range, then the confidence interval will reflect that This isnot always the case with the other intervals
per-According to Efron and Tibshirani [1993], the bootstrap-t interval has good
coverage probabilities, but does not perform well in practice The bootstrappercentile interval is more dependable in most situations, but does not enjoy
the good coverage property of the bootstrap-t interval There is another
boot-strap confidence interval, called the interval, that has both good age and is dependable This interval is described in the next chapter
cover-The bootstrap estimates of bias and standard error are also random ables, and they have their own error associated with them So, how accurateare they? In the next chapter, we discuss how one can use the jackknifemethod to evaluate the error in the bootstrap estimates
vari-As with any method, the bootstrap is not appropriate in every situation.When analytical methods are available to understand the uncertainty associ-
Trang 6ated with an estimate, then those are more efficient than the bootstrap Inwhat situations should the analyst use caution in applying the bootstrap?One important assumption that underlies the theory of the bootstrap is thenotion that the empirical distribution function is representative of the truepopulation distribution If this is not the case, then the bootstrap will notyield reliable results For example, this can happen when the sample size issmall or the sample was not gathered using appropriate random samplingtechniques Chernick [1999] describes other examples from the literaturewhere the bootstrap should not be used We also address a situation in Chap-
ter 7 where the bootstrap fails This can happen when the statistic is smooth, such as the median
We include several functions with the Computational Statistics Toolbox thatimplement some of the bootstrap techniques discussed in this chapter Theseare listed in Table 6.2. Like bootstrp, these functions have an input argu-
ment that specifies a MATLAB function that calculates the statistic
As we saw in the examples, the MATLAB Statistics Toolbox has a function
called bootstrp that will return the bootstrap replicates from the input argument bootfun (e.g., mean, std, var, etc.) It takes an input data set, finds the bootstrap resamples, applies the bootfun to the resamples, and
stores the replicate in the first row of the output argument The user can gettwo outputs from the function: the bootstrap replicates and the indices thatcorrespond to the points selected in the resample
There is a Bootstrap MATLAB Toolbox written by Zoubir and Iskander atthe Curtin University of Technology It is available for download at
TTTTAAAABBBBLLLLEEEE 6.26.2
List of MATLAB Functions for Chapter 6
General bootstrap: resampling,
estimates of standard error and bias
csboot bootstrp
Constructing bootstrap confidence
Intervals
csbootint csbooperint csbootbca
Trang 7www.atri.curtin.edu.au/csp It requires the MATLAB Statistics box and has a postscript version of the reference manual.
Tool-Other software exists for Monte Carlo simulation as applied to statistics.The Efron and Tibshirani [1993] book has a description of S code for imple-menting the bootstrap This code, written by the authors, can be downloadedfrom the statistics archive at Carnegie-Mellon University that was mentioned
in Chapter 1 Another software package that has some of these capabilities iscalled Resampling Stats® [Simon, 1999], and information on this can befound at www.resample.com Routines are available from Resampling Statsfor MATLAB [Kaplan, 1999] and Excel
6.6 Further Reading
Mooney [1997] describes Monte Carlo simulation for inferential statistics that
is written in a way that is accessible to most data analysts It has some lent examples of using Monte Carlo simulation for hypothesis testing usingmultiple experiments, assessing the behavior of an estimator, and exploringthe distribution of a statistic using graphical techniques The text by Gentle[1998] has a chapter on performing Monte Carlo studies in statistics He dis-cusses how simulation can be considered as a scientific experiment andshould be held to the same high standards Hoaglin and Andrews [1975] pro-vide guidelines and standards for reporting the results from computations.Efron and Tibshirani [1991] explain several computational techniques, writ-ten at a level accessible to most readers Other articles describing MonteCarlo inferential methods can be found in Joeckel [1991], Hope [1968], Besagand Diggle [1977], Diggle and Gratton [ 1984], Efron [1979], Efron and Gong[1983], and Teichroew [1965]
excel-There has been a lot of work in the literature on bootstrap methods haps the most comprehensive and easy to understand treatment of the topiccan be found in Efron and Tibshirani [1993] Efron’s [1982] earlier monogram
Per-on resampling techniques describes the jackknife, the bootstrap and validation A more recent book by Chernick [1999] gives an updated descrip-tion of results in this area, and it also has an extensive bibliography (over1,600 references!) on the bootstrap Hall [1992] describes the connectionbetween Edgeworth expansions and the bootstrap A volume of papers onthe bootstrap was edited by LePage and Billard [1992], where many applica-tions of the bootstrap are explored Politis, Romano, and Wolf [1999] presentsubsampling as an alternative to the bootstrap A subset of articles thatpresent the theoretical justification for the bootstrap are Efron [1981, 1985,1987] The paper by Boos and Zhang [2000] looks at a way to ease the compu-tational burden of Monte Carlo estimation of the power of tests that uses res-ampling methods For a nice discussion on the coverage of the bootstrappercentile confidence interval, see Polansky [1999]
Trang 86.1 Repeat Example 6.1 where the population standard deviation for thetravel times to work is minutes Is minutes stillconsistent with the null hypothesis?
6.2 Using the information in Example 6.3, plot the probability of Type IIerror as a function of How does this compare with Figure 6.2?
6.3 Would you reject the null hypothesis in Example 6.4 if ?6.4 Using the same value for the sample mean, repeat Example 6.3 fordifferent sample sizes of What happens to the curveshowing the power as a function of the true mean as the sample sizechanges?
6.5 Repeat Example 6.6 using a two-tail test In other words, test for thealternative hypothesis that the mean is not equal to 454
6.6 Repeat Example 6.8 for larger M Does the estimated Type I error get
closer to the true value?
6.7 Write MATLAB code that implements the parametric bootstrap Test
it using the forearm data Assume that the normal distribution is a
reasonable model for the data Use your code to get a bootstrapestimate of the standard error and the bias of the coefficient of skew-ness and the coefficient of kurtosis Get a bootstrap percentile intervalfor the sample central second moment using your parametric boot-strap approach
6.8 Write MATLAB code that will get the bootstrap standard confidence
interval Use it with the forearm data to get a confidence interval
for the sample central second moment Compare this interval withthe ones obtained in the examples and in the previous problem
6.9 Use your program from problem 6.8 and the forearm data to get a
bootstrap confidence interval for the mean Compare this to the oretical one
the-6.10 The remiss data set contains the remission times for 42 leukemia
patients Some of the patients were treated with the drug called
6-mercaptopurine (mp), and the rest were part of the control group (control) Use the techniques from Chapter 5 to help determine asuitable model (e.g., Weibull, exponential, etc.) for each group Devise
a Monte Carlo hypothesis test to test for the equality of means between
the two groups [Hand, et al., 1994; Gehan, 1965] Use the p-value
Trang 9average undergraduate grade point average (gpa) for the 1973
fresh-man class at 82 law schools Note that these data constitute the entire
population The data contained in law comprise a random sample of
15 of these classes Obtain the true population variances for the lsat and the gpa Use the sample in law to estimate the population vari-
ance using the sample central second moment Get bootstrap mates of the standard error and the bias in your estimate of thevariance Make some comparisons between the known populationvariance and the estimated variance
esti-6.12 Using the lawpop data, devise a test statistic to test for the
signifi-cance of the correlation between the LSAT scores and the ing grade point averages Get a random sample from the population,and use that sample to test your hypothesis Do a Monte Carlo sim-ulation of the Type I and Type II error of the test you devise
correspond-6.13 In 1961, 16 states owned the retail liquor stores In 26 others, the
stores were owned by private citizens The data contained in whisky
reflect the price (in dollars) of a fifth of whisky from these 42 states.Note that this represents the population, not a sample Use the
whisky data to get an appropriate bootstrap confidence interval forthe median price of whisky at the state owned stores and the medianprice of whisky at the privately owned stores First get the randomsample from each of the populations, and then use the bootstrap withthat sample to get the confidence intervals Do a Monte Carlo studywhere you compare the confidence intervals for different samplesizes Compare the intervals with the known population medians[Hand, et al., 1994]
6.14 The quakes data [Hand, et al., 1994] give the time in days between
successive earthquakes Use the bootstrap to get an appropriate fidence interval for the average time between earthquakes
Trang 10• To evaluate the accuracy of the model or classification scheme;
• To decide what is a reasonable model for the data;
• To find a smoothing parameter in density estimation;
• To estimate the bias and error in parameter estimation;
• And many others
We start off with an example to motivate the reader We have a samplewhere we measured the average atmospheric temperature and the corre-sponding amount of steam used per month [Draper and Smith, 1981] Ourgoal in the analysis is to model the relationship between these variables Once
we have a model, we can use it to predict how much steam is needed for agiven average monthly temperature The model can also be used to gainunderstanding about the structure of the relationship between the two vari-ables
The problem then is deciding what model to use To start off, one shouldalways look at a scatterplot (or scatterplot matrix) of the data as discussed in
Chapter 5 The scatterplot for these data is shown in Figure 7.1 and is ined in Example 7.3 We see from the plot that as the temperature increases,the amount of steam used per month decreases It appears that using a line(i.e., a first degree polynomial) to model the relationship between the vari-ables is not unreasonable However, other models might provide a better fit.For example, a cubic or some higher degree polynomial might be a bettermodel for the relationship between average temperature and steam usage
exam-So, how can we decide which model is better? To make that decision, weneed to assess the accuracy of the various models We could then choose the
Trang 11model that has the best accuracy or lowest error In this chapter, we use theprediction error (see Equation 7.5) to measure the accuracy One way toassess the error would be to observe new data (average temperature and cor-responding monthly steam usage) and then determine what is the predictedmonthly steam usage for the new observed average temperatures We cancompare this prediction with the true steam used and calculate the error We
do this for all of the proposed models and pick the model with the smallesterror The problem with this approach is that it is sometimes impossible toobtain new data, so all we have available to evaluate our models (or our sta-tistics) is the original data set In this chapter, we consider two methods thatallow us to use the data already in hand for the evaluation of the models.These are cross-validation and the jackknife
Cross-validation is typically used to determine the classification error ratefor pattern recognition applications or the prediction error when buildingmodels In Chapter 9, we will see two applications of cross-validation where
it is used to select the best classification tree and to estimate the tion rate In this chapter, we show how cross-validation can be used to assessthe prediction accuracy in a regression problem
misclassifica-In the previous chapter, we covered the bootstrap method for estimatingthe bias and standard error of statistics The jackknife procedure has a similarpurpose and was developed prior to the bootstrap [Quenouille,1949] Theconnection between the methods is well known and is discussed in the liter-ature [Efron and Tibshirani, 1993; Efron, 1982; Hall, 1992] We include thejackknife procedure here, because it is more a data partitioning method than
a simulation method such as the bootstrap We return to the bootstrap at theend of this chapter, where we present another method of constructing boot-strap confidence intervals using the jackknife In the last section, we showhow the jackknife method can be used to assess the error in our bootstrapestimates
7.2 Cross-Validation
Often, one of the jobs of a statistician or engineer is to create models usingsample data, usually for the purpose of making predictions For example,given a data set that contains the drying time and the tensile strength ofbatches of cement, can we model the relationship between these two vari-ables? We would like to be able to predict the tensile strength of the cementfor a given drying time that we will observe in the future We must thendecide what model best describes the relationship between the variables andestimate its accuracy
Unfortunately, in many cases the naive researcher will build a model based
on the data set and then use that same data to assess the performance of themodel The problem with this is that the model is being evaluated or tested
Trang 12with data it has already seen Therefore, that procedure will yield an overlyoptimistic (i.e., low) prediction error (see Equation 7.5) Cross-validation is atechnique that can be used to address this problem by iteratively partitioningthe sample into two sets of data One is used for building the model, and theother is used to test it.
We introduce cross-validation in a linear regression application, where weare interested in estimating the expected prediction error We use linearregression to illustrate the cross-validation concept, because it is a topic thatmost engineers and data analysts should be familiar with However, before
we describe the details of cross-validation, we briefly review the concepts inlinear regression We will return to this topic in Chapter 10, where we discussmethods of nonlinear regression
Say we have a set of data, , where denotes a predictor variable
and represents the corresponding response variable We are interested in
modeling the dependency of Y on X The easiest example of linear regression
is in situations where we can fit a straight line between X and Y In Figure 7.1,
we show a scatterplot of 25 observed pairs [Draper and Smith, 1981]
The X variable represents the average atmospheric temperature measured in degrees Fahrenheit, and the Y variable corresponds to the pounds of steam
used per month The scatterplot indicates that a straight line is a reasonablemodel for the relationship between these variables We will use these data toillustrate linear regression
The linear, first-order model is given by
where and are parameters that must be estimated from the data, and represents the error in the measurements It should be noted that the word
model refers to the highest power of the predictor variable X We know from
elementary algebra that is the slope and is the y-intercept As another
example, we represent the linear, second-order model by
To get the model, we need to estimate the parameters and Thus, theestimate of our model given by Equation 7.1 is
where denotes the predicted value of Y for some value of X, and and
are the estimated parameters We do not go into the derivation of the
esti-mators, since it can be found in most introductory statistics textbooks
Trang 13Assume that we have a sample of observed predictor variables with sponding responses We denote these by , The leastsquares fit is obtained by finding the values of the parameters that minimizethe sum of the squared errors
where RSE denotes the residual squared error.
Estimates of the parameters and are easily obtained in MATLAB
using the function polyfit, and other methods available in MATLAB will
be explored in Chapter 10. We use the function polyfit in Example 7.1 to
model the linear relationship between the atmospheric temperature and theamount of steam used per month (see Figure 7.1)
Trang 14observed x values, the observed y values and the degree of the polynomial
that we want to fit to the data The following commands fit a polynomial ofdegree one to the steam data
% Loads the vectors x and y.
Trang 15The prediction error is defined as
where the expectation is with respect to the true population To estimate the
error given by Equation 7.5, we need to test our model (obtained from
poly-fit) using an independent set of data that we denote by This meansthat we would take an observed and obtain the estimate of usingour model:
We then compare with the true value of Obtaining the outputs or
from the model is easily done in MATLAB using the polyval function as
We now show how to estimate the prediction error using Equation 7.7 We
first choose some points from the steam data set and put them aside to use
as an independent test sample The rest of the observations are then used toobtain the model
load steam
% Get the set that will be used to
% estimate the line.
indtest = 2:2:20; % Just pick some points.
xtest = x(indtest);
ytest = y(indtest);
% Now get the observations that will be
% used to fit the model.
Trang 16xtrain(indtest) = [];
ytrain(indtest) = [];
The next step is to fit a first degree polynomial:
% Fit a first degree polynomial (the model)
% to the data.
[p,s] = polyfit(xtrain,ytrain,1);
We can use the MATLAB function polyval to get the predictions at the x
val-ues in the testing set and compare these to the observed y valval-ues in the testing
set
% Now get the predictions using the model and the
% testing data that was set aside.
yhat = polyval(p,xtest);
% The residuals are the difference between the true
% and the predicted values.
parti-the above procedure, repeatedly partitioning parti-the data into many training and
testing sets This is the fundamental idea underlying cross-validation
The most general form of this procedure is called K-fold cross-validation The basic concept is to split the data into K partitions of approximately equal
size One partition is reserved for testing, and the rest of the data are used forfitting the model The test set is used to calculate the squared error
Note that the prediction is from the model obtained using the current
training set (one without the i-th observation in it) This procedure is repeated until all K partitions have been used as a test set Note that we have
n squared errors because each observation will be a member of one testing
set The average of these errors is the estimated expected prediction error
In most situations, where the size of the data set is relatively small, the lyst can set , so the size of the testing set is one Since this requires fit-
ana-ting the model n times, this can be computationally expensive if n is large We
note, however, that there are efficient ways of doing this [Gentle 1998; Hjorth,
Trang 171994] We outline the steps for cross-validation below and demonstrate thisapproach in Example 7.3
PROCEDURE - CROSS-VALIDATION
1 Partition the data set into K partitions For simplicity, we assume
that , so there are r observations in each set.
2 Leave out one of the partitions for testing purposes
3 Use the remaining data points for training (e.g., fit the model,build the classifier, estimate the probability density function)
4 Use the test set with the model and determine the squared errorbetween the observed and predicted response:
5 Repeat steps 2 through 4 until all K partitions have been used as a
test set
6 Determine the average of the n errors.
Note that the error mentioned in step 4 depends on the application and thegoal of the analysis [Hjorth, 1994] For example, in pattern recognition appli-cations, this might be the cost of misclassifying a case In the following exam-ple, we apply the cross-validation technique to help decide what type of
model should be used for the steam data.
Example 7.3
In this example, we apply cross-validation to the modeling problem of ple 7.1 We fit linear, quadratic (degree 2) and cubic (degree 3) models to thedata and compare their accuracy using the estimates of prediction errorobtained from cross-validation
Exam-% Set up the array to store the prediction errors.
n = length(x);
r1 = zeros(1,n);% store error - linear fit
r2 = zeros(1,n);% store error - quadratic fit
r3 = zeros(1,n);% store error - cubic fit
% Loop through all of the data Remove one point at a
% time as the test point.
Trang 18% Fit a quadratic to the data.
[p2,s] = polyfit(xtrain,ytrain,2);
% Fit a cubic to the data
[p3,s] = polyfit(xtrain,ytrain,3);
% Get the errors
r1(i) = (ytest - polyval(p1,xtest)).^2;
r2(i) = (ytest - polyval(p2,xtest)).^2;
r3(i) = (ytest - polyval(p3,xtest)).^2;
end
We obtain the estimated prediction error of both models as follows,
% Get the prediction error for each one.
7.3 Jackknife
The jackknife is a data partitioning method like cross-validation, but the goal
of the jackknife is more in keeping with that of the bootstrap The jackknifemethod is used to estimate the bias and the standard error of statistics
Let’s say that we have a random sample of size n, and we denote our
esti-mator of a parameter as
So, might be the mean, the variance, the correlation coefficient or someother statistic of interest Recall from Chapters 3 and 6 that is also a randomvariable, and it has some error associated with it We would like to get an esti-mate of the bias and the standard error of the estimate so we can assessthe accuracy of the results
When we cannot determine the bias and the standard error using analyticaltechniques, then methods such as the bootstrap or the jackknife may be used.The jackknife is similar to the bootstrap in that no parametric assumptionsare made about the underlying population that generated the data, and thevariation in the estimate is investigated by looking at the sample data
θ
θˆ = T = t x( 1, , ,x2 … xn)θˆ
T T,
Trang 19The jackknife method is similar to cross-validation in that we leave out oneobservation from our sample to form a jackknife sample as follows
This says that the i-th jackknife sample is the original sample with the i-th
data point removed We calculate the value of the estimate using this reduced
jackknife sample to obtain the i-th jackknife replicate This is given by
This means that we leave out one point at a time and use the rest of the ple to calculate our statistic We continue to do this for the entire sample, leav-
sam-ing out one observation at a time, and the end result is a sequence of n
jackknife replications of the statistic
The estimate of the bias of obtained from the jackknife technique is given
by [Efron and Tibshirani, 1993]
1⁄(n 1– )
SEˆ Ja ck( )x
Trang 20PROCEDURE - JACKKNIFE
1 Leave out an observation
2 Calculate the value of the statistic using the remaining samplepoints to obtain
3 Repeat steps 1 and 2, leaving out one point at a time, until all n
are recorded
4 Calculate the jackknife estimate of the bias of T using Equation 7.9.
5 Calculate the jackknife estimate of the standard error of T using
Equation 7.11
The following two examples show how this is used to obtain jackknife mates of the bias and standard error for an estimate of the correlation coeffi-cient
esti-Example 7.4
In this example, we use a data set that has been examined in Efron and shirani [1993] Note that these data are also discussed in the exercises for
class of 82 law schools in 1973 The average score for the entering class on a
national law test (lsat) and the average undergraduate grade point average (gpa) were recorded A random sample of size was taken from thepopulation We would like to use these sample data to estimate the correla-
tion coefficient between the test scores (lsat) and the grade point average (gpa) We start off by finding the statistic of interest
% Loads up a matrix - law.
% returns a matrix of correlation coefficients We
% want the one in the off-diagonal position.
T = tmp(1,2);
We get an estimated correlation coefficient of and we would like
to get an estimate of the bias and the standard error of this statistic The lowing MATLAB code implements the jackknife procedure for estimatingthese quantities
fol-% Set up memory for jackknife replicates.
ρˆ = 0.78,
Trang 21% Store as temporary vector:
% Get correlation coefficient:
% In this example, we want off-diagonal element tmp = corrcoef(gpat,lsatt);
This data set will be explored further in the exercises
Example 7.5
We provide a MATLAB function called csjack that implements the
jack-knife procedure This will work with any MATLAB function that takes therandom sample as the argument and returns a statistic This function can be
one that comes with MATLAB, such as mean or var, or it can be one written
by the user We illustrate its use with a user-written function called corr that
returns the single correlation coefficient between two univariate randomvariables
function r = corr(data)
% This function returns the single correlation
% coefficient between two variables.
tmp = corrcoef(data);
r = tmp(1,2);
The data used in this example are taken from Hand, et al [1994] They wereoriginally from Anscombe [1973], where they were created to illustrate thepoint that even though an observed value of a statistic is the same for datasets , that does not tell the entire story He also used them to show
SEˆ J ac k( )ρˆ = 0.14
Biasˆ Ja ck( )ρˆ = –0.0065
ρˆ 0.82=
Trang 22the importance of looking at scatterplots, because it is obvious from the plotsthat the relationships between the variables are not similar The scatterplotsare shown in Figure 7.3.
% Here is another example
% We have 4 data sets with essentially the same
% correlation coefficient
% The scatterplots look very different.
% When this file is loaded, you get four sets
The jackknife method is also described in the literature using
pseudo-val-ues The jackknife pseudo-values are given by
Trang 23where is the value of the statistic computed on the sample with the i-th
data point removed
We take the average of the pseudo-values given by
and use this to get the jackknife estimate of the standard error, as follows
PROCEDURE - PSEUDO-VALUE JACKKNIFE
1 Leave out an observation
2 Calculate the value of the statistic using the remaining samplepoints to obtain
This shows the scatterplots of the four data sets discussed in Example 7.5 These data were created to show the importance of looking at scatterplots [Anscombe, 1973] All data sets
relationship between the variables is very different.
Trang 243 Calculate the pseudo-value using Equation 7.12.
4 Repeat steps 2 and 3 for the remaining data points, yielding n values
% In this example, is off-diagonal element.
% Get the jackknife pseudo-value for the i-th point reps(i) = n*T-(n-1)*tmp(1,2);
proce-tistic is the median Here smoothness refers to staproce-tistics where small changes
Trang 25in the data set produce small changes in the value of the statistic We illustratethis situation in the next example.
Example 7.7
Researchers collected data on the weight gain of rats that were fed four ferent diets based on the amount of protein (high and low) and the source ofthe protein (beef and cereal) [Snedecor and Cochran, 1967; Hand, et al., 1994]
dif-We will use the data collected on the rats who were fed a low protein diet ofcereal The sorted data are
x = [58, 67, 74, 74, 80, 89, 95, 97, 98, 107];
The median of this data set is To see how the median changes
with small changes of x, we increment the fourth observation by one.The change in the median is zero, because it is still at In fact, themedian does not change until we increment the fourth observation by 7, atwhich time the median becomes Let’s see what happens when weuse the jackknife approach to get an estimate of the standard error in themedian
% Set up memory for jackknife replicates.
% Now get the estimate of standard error using
Trang 26Thi s y ie lds a n e st im a te of t he s t a nda rd e rro r o f t he m e di a n o f
In the exercises, the reader is asked to see what happenswhen the statistic is the mean and should find that the jackknife and boot-strap estimates of the standard error of the mean are similar
It can be shown [Efron & Tibshirani, 1993] that the jackknife estimate of thestandard error of the median does not converge to the true standard error as For the data set of Example 7.7, we had only two distinct values ofthe median in the jackknife replicates This gives a poor estimate of the stan-dard error of the median On the other hand, the bootstrap produces data setsthat are not as similar to the original data, so it yields reasonable results The
delete-d jackknife [Efron and Tibshirani, 1993; Shao and Tu, 1995] deletes d
observations at a time instead of only one This method addresses the lem of inconsistency with non-smooth statistics
prob-7.4 Better Bootstrap Confidence Intervals
In Chapter 6, we discussed three types of confidence intervals based on the
bootstrap: the bootstrap standard interval, the bootstrap-t interval and the
bootstrap percentile interval Each of them is applicable under more generalassumptions and is superior in some sense (e.g., coverage performance,range-preserving, etc.) to the previous one The bootstrap confidence intervalthat we present in this section is an improvement on the bootstrap percentileinterval This is called the interval, which stands for bias-corrected and
accelerated
Recall that the upper and lower endpoints of the bootstrappercentile confidence interval are given by
Say we have bootstrap replications of our statistic, which we denote
as , To find the percentile interval, we sort the bootstrapreplicates in ascending order If we want a 90% confidence interval, then oneway to obtain is to use the bootstrap replicate in the 5th position of theordered list Similarly, is the bootstrap replicate in the 95th position Asdiscussed in Chapter 6, the endpoints could also be obtained using otherquantile estimates
The interval adjusts the endpoints of the interval based on two eters, and The confidence interval using the method is
param-SEˆ Bo ot(qˆ0.5) = 7.1
n→∞
BC a
1–α( ) 100%⋅
θˆLo,θˆH i( ) θˆB* ( α 2 ⁄ )
θˆB* 1 ( – α ⁄ 2 ),
Trang 27Interval: , (7.16)where
(7.17)
Let’s look a little closer at and given in Equation 7.17 Since denotes the standard normal cumulative distribution function, we know that
and So we see from Equation 7.16 and 7.17 that instead
of basing the endpoints of the interval on the confidence level of , theyare adjusted using information from the distribution of bootstrap replicates
We discuss, shortly, how to obtain the acceleration and the bias ever, before we do, we want to remind the reader of the definition of This denotes the -th quantile of the standard normal distribution It is the
How-value of z that has an area to the left of size As an example, for
We now turn our attention to how we determine the parameters and The bias-correction is given by , and it is based on the proportion of boot-strap replicates that are less than the statistic calculated from the orig-inal sample It is given by
Trang 284 Repeat steps 2 through 3, B times, where
5 Calculate the bias correction (Equation 7.18) and the accelerationfactor (Equation 7.19)
6 Determine the adjustments for the interval endpoints using tion 7.17
Equa-7 The lower endpoint of the confidence interval is the quantile
of the bootstrap replicates, and the upper endpoint of theconfidence interval is the quantile of the bootstrap repli-cates
θˆ
BCa
x = (x1, ,… xn)θˆ
Trang 29Example 7.8
We use an example from Efron and Tibshirani [1993] to illustrate the interval Here we have a set of measurements of 26 neurologically impairedchildren who took a test of spatial perception called test A We are interested
in finding a 90% confidence interval for the variance of a random score on test
A We use the following estimate for the variance
,
where represents one of the test scores This is a biased estimator of thevariance, and when we calculate this statistic from the sample we get a value
of We provide a function called csbootbca that will determine
the interval Because it is somewhat lengthy, we do not include theMATLAB code here, but the reader can view it in Appendix D However,
before we can use the function csbootbca, we have to write an M-file
func-tion that will return the estimate of the second sample central moment usingonly the sample as an input It should be noted that MATLAB Statistics Tool-
box has a function (moment) that will return the sample central moments of any order We do not use this with the csbootbca function, because the function specified as an input argument to csbootbca can only use the sam-
ple as an input Note that the function mom is the same function used in
% First load the data.
load spatial
% Now find the BC-a bootstrap interval.
alpha = 0.10;
B = 2000;
% Use the function we wrote to get the
% 2nd sample central moment - 'mom'.
[blo,bhi,bvals,z0,ahat] =
csbootbca(spatial','mom',B,alpha);
From this function, we get a bias correction of and an accelerationfactor of The endpoints of the interval from csbootbca are
In the exercises, the reader is asked to compare this to the
bootstrap-t interval and the bootstrap percentile interval.