Suppose again that X1,X2, . . . ,Xn are independent random variables from a common distribution having meanθand varianceσ2. Although the sample mean X =n
i=1Xi/nis an effective estimator ofθ, we do not really expect thatXwill be equal toθbut rather that it will be “close.” As a result, it is sometimes more valuable to be able to specify an interval for which we have a certain degree of confidence thatθlies within.
To obtain such an interval we need the (approximate) distribution of the estimator X. To determine this, first recall, from Equations (8.1) and (8.2), that
E[X]=θ, Var(X)= σ2 n
and thus, from the central limit theorem, it follows that for largen
√n(X−θ)
σ ˙∼N(0,1)
where ˙∼N(0,1)means “is approximately distributed as a standard normal.” In addition, if we replace the unknown standard deviationσ by its estimator S, the sample standard deviation, then it still remains the case (by a result known as Slutsky’s theorem) that the resulting quantity is approximately a standard normal.
That is, whennis large √
n(X−θ)/S˙∼N(0,1) (8.8)
Now for anyα,0< α <1, letzαbe such that P{Z >zα} =α
142 8 Statistical Analysis of Simulated Data
P{Z < −x} P{Z > x}
−x 0 x
Figure 8.1. Standard normal density.
where Z is a standard normal random variable. (For example,z.025 = 1.96.) It follows from the symmetry of the standard normal density function about the origin thatz1−α, the point at which the area under the density to its right is equal to 1−α, is such that (see Figure8.1)
z1−α= −zα Therefore (see Figure8.1)
P{−zα/2<Z <zα/2} =1−α It thus follows from (8.8) that
P
−zα/2<√
n(X−θ) S <zα/2
≈1−α or, equivalently, upon multiplying by -1,
P
−zα/2<√
n(θ−X) S <zα/2
≈1−α
which is equivalent to P
X−zα/2 S
√n < θ <X+zα/2 S
√n ≈1−α (8.9)
In other words, with probability 1−αthe population meanθwill lie within the regionX±zα/2S/√
n.
Definition If the observed values of the sample mean and the sample standard deviation are X =x and S=s, call the interval x±zα/2s/√
n an (approximate) 100(1−α)percent confidence interval estimate ofθ.
Remarks
1. To clarify the meaning of a “100(1−α)percent confidence interval,” consider, for example, the case whereα=0.05, and sozα/2=1.96. Now before the data are observed, it will be true, with probability (approximately) equal to 0.95, that the sample mean X and the sample standard deviationS will be such thatθwill lie betweenX±1.96S/√
n. AfterXandSare observed to equal, respectively,x ands, there is no longer any probability concerning whetherθ lies in the interval x±1.96s/√
n, for either it does or it does not. However, we are “95% confident” that in this situation it does lie in this interval (because we know that over the long run such intervals will indeed contain the mean 95 percent of the time).
2. (A technical remark.) The above analysis is based on Equation (8.8), which states that√
n(X−θ)/Sis approximately a standard normal random variable whennis large. Now if the original data valuesXiwere themselves normally distributed, then it is known that this quantity has (exactly) at-distribution withn−1 degrees of freedom. For this reason, many authors have proposed using this approximate distribution in the general case where the original distribution need not be normal. However, since it is not clear that thet- distribution withn−1 degrees of freedom results in a better approximation than the normal in the general case, and because these two distributions are approximately equal for largen, we have used the normal approximation rather than introducing thet-random variable.
Consider now the case, as in a simulation study, where additional data values can be generated and the question is to determine when to stop generating new data values. One solution to this is to initially choose valuesαand l and to continue generating data until the approximate 100(1−α) percent confidence interval estimate of θ is less than l. Since the length of this interval will be 2zα/2S/√
n we can accomplish this by the following technique.
1. Generate at least 100 data values.
2. Continue to generate additional data values, stopping when the number of values you have generated—call itk—is such that 2zα/2S/√
k <l, where Sis the sample standard deviation based on thosekvalues. [The value ofS should be constantly updated, using the recursion given by (8.6) and (8.7), as new data are generated.]
3. Ifxandsare the observed values of X andS, then the 100(1−α)percent confidence interval estimate ofθ, whose length is less thanl, isx±zα/2s/√
k.
A Technical Remark The more statistically sophisticated reader might wonder about our use of an approximate confidence interval whose theory was based on the assumption that the sample size was fixed when in the above situation
144 8 Statistical Analysis of Simulated Data
the sample size is clearly a random variable depending on the data values generated.
This, however, can be justified when the sample size is large, and so from the viewpoint of simulation we can safely ignore this subtlety.
As noted in the previous section, the analysis is modified whenX1, . . . ,Xnare Bernoulli random variables such that
Xi =
1 with probabilityp 0 with probability1−p
Since in this case Var(Xi)can be estimated by X(1− X), it follows that the equivalent statement to Equation (8.8) is that whennis large
√n(X−p)
X(1−X)˙∼N(0,1) (8.10)
Hence, for anyα, P
−zα/2<√
n(X−p)
X(1−X)<zα/2
=1−α or, equivalently,
P
X−zα/2
X(1−X)/n<p <X+zα/2
X(1−X)/n =1−α Hence, if the observed value of X is pn, we say that the “100(1−α)percent confidence interval estimate” ofpis
pn±zα/2
pn(1−pn)/n