One needsstrict safeguards against committing an error of type one, and if there are severaldecision rules which are equally safe with respect to errors of type one, then one willselect
Trang 1CHAPTER 15
Hypothesis Testing
Imagine you are a business person considering a major investment in order tolaunch a new product The sales prospects of this product are not known withcertainty You have to rely on the outcome of n marketing surveys that measurethe demand for the product once it is offered If µ is the actual (unknown) rate ofreturn on the investment, each of these surveys here will be modeled as a randomvariable, which has a Normal distribution with this mean µ and known variance 1.Let y1, y2, , yn be the observed survey results How would you decide whether tobuild the plant?
The intuitively reasonable thing to do is to go ahead with the investment ifthe sample mean of the observations is greater than a given value c, and not to do
Trang 2it otherwise This is indeed an optimal decision rule, and we will discuss in whatrespect it is, and how c should be picked.
Your decision can be the wrong decision in two different ways: either you decide
to go ahead with the investment although there will be no demand for the product,
or you fail to invest although there would have been demand There is no decisionrule which eliminates both errors at once; the first error would be minimized by therule never to produce, and the second by the rule always to produce In order todetermine the right tradeoff between these errors, it is important to be aware of theirasymmetry The error to go ahead with production although there is no demand haspotentially disastrous consequences (loss of a lot of money), while the other errormay cause you to miss a profit opportunity, but there is no actual loss involved, andpresumably you can find other opportunities to invest your money
To express this asymmetry, the error with the potentially disastrous consequences
is called “error of type one,” and the other “error of type two.” The distinctionbetween type one and type two errors can also be made in other cases Locking up
an innocent person is an error of type one, while letting a criminal go unpunished
is an error of type two; publishing a paper with false results is an error of type one,while foregoing an opportunity to publish is an error of type two (at least this iswhat it ought to be)
Trang 3Such an asymmetric situation calls for an asymmetric decision rule One needsstrict safeguards against committing an error of type one, and if there are severaldecision rules which are equally safe with respect to errors of type one, then one willselect among those that decision rule which minimizes the error of type two.
Let us look here at decision rules of the form: make the investment if ¯y > c
An error of type one occurs if the decision rule advises you to make the investmentwhile there is no demand for the product This will be the case if ¯y > c but µ ≤ 0.The probability of this error depends on the unknown parameter µ, but it is at most
α = Pr[¯y> c | µ = 0] This maximum value of the type one error probability is calledthe significance level, and you, as the director of the firm, will have to decide on αdepending on how tolerable it is to lose money on this venture, which presumablydepends on the chances to lose money on alternative investments It is a seriousshortcoming of the classical theory of hypothesis testing that it does not providegood guidelines how α should be chosen, and how it should change with sample size.Instead, there is the tradition to choose α to be either 5% or 1% or 0.1% Given α,
a table of the cumulative standard normal distribution function allows you to findthat c for which Pr[¯y> c | µ = 0] = α
Problem 213 2 points Assume each yi ∼ N (µ, 1), n = 400 and α = 0.05, anddifferent yi are independent Compute the value c which satisfies Pr[¯y> c | µ = 0] =
α You shoule either look it up in a table and include a xerox copy of the table with
Trang 4the entry circled and the complete bibliographic reference written on the xerox copy,
or do it on a computer, writing exactly which commands you used In R, the functionqnorm does what you need, find out about it by typing help(qnorm)
Answer In the case n = 400, ¯ y has variance 1/400 and therefore standard deviation 1/20 = 0.05 Therefore 20¯ y is a standard normal: from Pr[¯ y > c | µ = 0] = 0.05 follows Pr[20¯ y > 20c | µ = 0] = 0.05 Therefore 20c = 1.645 can be looked up in a table, perhaps use [JHG + 88, p 986], the row for ∞ d.f.
Let us do this in R The p-“quantile” of the distribution of the random variable y is defined
as that value q for which Pr[ y ≤ q] = p If y is normally distributed, this quantile is computed
by the R-function qnorm(p, mean=0, sd=1, lower.tail=TRUE) In the present case we need either qnorm(p=1-0.05, mean=0, sd=0.05) or qnorm(p=0.05, mean=0, sd=0.05, lower.tail=FALSE) which gives the value 0.08224268.
Choosing a decision which makes a loss unlikely is not enough; your decisionmust also give you a chance of success E.g., the decision rule to build the plant if
−0.06 ≤ ¯y ≤ −0.05 and not to build it otherwise is completely perverse, althoughthe significance level of this decision rule is approximately 4% (if n = 100) In otherwords, the significance level is not enough information for evaluating the performance
of the test You also need the “power function,” which gives you the probabilitywith which the test advises you to make the “critical” decision, as a function ofthe true parameter values (Here the “critical” decision is that decision which might
Trang 5-3 -2 -1 0 1 2 3
Figure 1 Eventually this Figure will show the Power function of
a one-sided normal test, i.e., the probability of error of type one as
a function of µ; right now this is simply the cdf of a Standard Normal
potentially lead to an error of type one.) By the definition of the significance level, the power function does not exceed the significance level for those parameter values for which going ahead would lead to a type 1 error But only those tests are “powerful” whose power function is high for those parameter values for which it would be correct
to go ahead In our case, the power function must be below 0.05 when µ ≤ 0, and
we want it as high as possible when µ > 0 Figure 1 shows the power function for the decision rule to go ahead whenever ¯y≥ c, where c is chosen in such a way that the significance level is 5%, for n = 100
The hypothesis whose rejection, although it is true, constitutes an error of type one, is called the null hypothesis, and its alternative the alternative hypothesis (In the examples the null hypotheses were: the return on the investment is zero or negative,
Trang 6the defendant is innocent, or the results about which one wants to publish a researchpaper are wrong.) The null hypothesis is therefore the hypothesis that nothing isthe case The test tests whether this hypothesis should be rejected, will safeguardagainst the hypothesis one wants to reject but one is afraid to reject erroneously Ifyou reject the null hypothesis, you don’t want to regret it.
Mathematically, every test can be identified with its null hypothesis, which is
a region in parameter space (often consisting of one point only), and its “criticalregion,” which is the event that the test comes out in favor of the “critical decision,”i.e., rejects the null hypothesis The critical region is usually an event of the formthat the value of a certain random variable, the “test statistic,” is within a givenrange, usually that it is too high The power function of the test is the probability
of the critical region as a function of the unknown parameters, and the significancelevel is the maximum (or, if this maximum depends on unknown parameters, anyupper bound) of the power function over the null hypothesis
Problem 214 Mr Jones is on trial for counterfeiting Picasso paintings, andyou are an expert witness who has developed fool-proof statistical significance testsfor identifying the painter of a given painting
• a 2 points There are two ways you can set up your test
Trang 7a: You can either say: The null hypothesis is that the painting was done byPicasso, and the alternative hypothesis that it was done by Mr Jones.b: Alternatively, you might say: The null hypothesis is that the painting wasdone by Mr Jones, and the alternative hypothesis that it was done by Pi-casso.
Does it matter which way you do the test, and if so, which way is the correct one.Give a reason to your answer, i.e., say what would be the consequences of testing inthe incorrect way
Answer The determination of what the null and what the alternative hypothesis is depends
on what is considered to be the catastrophic error which is to be guarded against On a trial, Mr Jones is considered innocent until proven guilty Mr Jones should not be convicted unless he can be proven guilty beyond “reasonable doubt.” Therefore the test must be set up in such a way that the hypothesis that the painting is by Picasso will only be rejected if the chance that it is actually by Picasso is very small The error of type one is that the painting is considered counterfeited although
it is really by Picasso Since the error of type one is always the error to reject the null hypothesis although it is true, solution a is the correct one You are not proving, you are testing
•b 2 points After the trial a customer calls you who is in the process of acquiring
a very expensive alleged Picasso painting, and who wants to be sure that this painting
is not one of Jones’s falsifications Would you now set up your test in the same way
as in the trial or in the opposite way?
Trang 8Answer It is worse to spend money on a counterfeit painting than to forego purchasing a true Picasso Therefore the null hypothesis would be that the painting was done by Mr Jones, i.e.,
Problem215 7 points Someone makes an extended experiment throwing a coin10,000 times The relative frequency of heads in these 10,000 throws is a randomvariable Given that the probability of getting a head is p, what are the mean andstandard deviation of the relative frequency? Design a test, at 1% significance level,
of the null hypothesis that the coin is fair, against the alternative hypothesis that
p < 0.5 For this you should use the central limit theorem If the head showed 4,900times, would you reject the null hypothesis?
Answer Let x i be the random variable that equals one when the i-th throw is a head, and zero otherwise The expected value of x is p, the probability of throwing a head Since x 2 = x , var[ x ] = E[ x ] − (E[ x ]) 2 = p(1 − p) The relative frequency of heads is simply the average of all x i , call it ¯ x It has mean p and variance σ 2
x =p(1−p)10,000 Given that it is a fair coin, its mean is 0.5 and its standard deviation is 0.005 Reject if the actual frequency < 0.5 − 2.326σ ¯ x = 48857 Another approach:
(15.0.33) Pr(¯ x ≤ 0.49) = Pr
x ¯ − 0.5 0.005 ≤ −2
= 0.0227 since the fraction is, by the central limit theorem, approximately a standard normal random variable.
Trang 915.1 Duality between Significance Tests and Confidence RegionsThere is a duality between confidence regions with confidence level 1 − α andcertain significance tests Let us look at a family of significance tests, which all have
a significance level ≤ α, and which define for every possible value of the parameter
φ0∈ Ω a critical region C(φ0) for rejecting the simple null hypothesis that the trueparameter is equal to φ0 The condition that all significance levels are ≤ α meansmathematically
(15.1.1) PrC(φ0)|φ = φ0 ≤ α for all φ0∈ Ω
Mathematically, confidence regions and such families of tests are one and thesame thing: if one has a confidence region R(y), one can define a test of the nullhypothesis φ = φ0 as follows: for an observed outcome y reject the null hypothesis
if and only if φ0is not contained in R(y) On the other hand, given a family of tests,one can build a confidence region by the prescription: R(y) is the set of all thoseparameter values which would not be rejected by a test based on observation y
Problem216 Show that with these definitions, equations (14.0.5) and (15.1.1)are equivalent
Answer Since φ0∈ R(y) iff y ∈ C 0 (φ0) (the complement of the critical region rejecting that the parameter value is φ ), it follows Pr[R( y ) ∈ φ |φ = φ ] = 1 − Pr[C(φ )|φ = φ ] ≥ 1 − α
Trang 10This duality is discussed in [BD77, pp 177–182].
15.2 The Neyman Pearson Lemma and Likelihood Ratio TestsLook one more time at the example with the fertilizer Why are we consideringonly regions of the form ¯y≥ µ0, why not one of the form µ1≤ ¯y≤ µ2, or maybe notuse the mean but decide to build ify1≥ µ3? Here the µ1, µ2, and µ3can be chosensuch that the probability of committing an error of type one is still α
It seems intuitively clear that these alternative decision rules are not reasonable.The Neyman Pearson lemma proves this intuition right It says that the criticalregions of the form ¯y ≥ µ0 are uniformly most powerful, in the sense that everyother critical region with same probability of type one error has equal or higherprobability of committing error of type two, regardless of the true value of µ
Here are formulation and proof of the Neyman Pearson lemma, first for thecase that both null hypothesis and alternative hypothesis are simple: H0 : θ = θ0,
HA: θ = θ1 In other words, we want to determine on the basis of the observations ofthe random variablesy1, ,ynwhether the true θ was θ0or θ1, and a determination
θ = θ1when in fact θ = θ0is an error of type one The critical regionCis the set ofall outcomes that lead us to conclude that the parameter has value θ1
The Neyman Pearson lemma says that a uniformly most powerful test exists inthis situation It is a so-called likelihood-ratio test, which has the following critical
Trang 11(15.2.1) C= {y1, , yn: L(y1, , yn; θ1) ≥ kL(y1, , yn; θ0)}
C consists of those outcomes for which θ1is at least k times as likely as θ0 (where k
is chosen such that Pr[C|θ0] = α)
To prove that this decision rule is uniformly most powerful, assumeDis the ical region of a different test with same significance level α, i.e., if the null hypothesis
crit-is correct, then C and D reject (and therefore commit an error of type one) withequally low probabilities α In formulas, Pr[C|θ0] = Pr[D|θ0] = α Look at figure2
with C=U ∪V and D=V ∪W SinceC and D have the same significance level,
Trang 12since U ⊂ C and C were chosen such that the likelihood (density) function of thealternative hypothesis is high relatively to that of the null hypothesis Since W liesoutsideC, the same argument gives
Pr[W|θ1] ≤ k Pr[W|θ0]
(15.2.4)
Linking those two inequalities and the equality gives
(15.2.5) Pr[W|θ1] ≤ k Pr[W|θ0] = k Pr[U|θ0] ≤ Pr[U|θ1],
hence Pr[D|θ1] ≤ Pr[C|θ1] In other words, if θ1is the correct parameter value, then
C will discover this and reject at least as often as D Therefore C is at least aspowerful as D, or the type two error probability ofC is at least as small as that of
Back to our fertilizer example To make both null and alternative hypothesessimple, assume that either µ = 0 (fertilizer is ineffective) or µ = t for some fixed
Trang 13Figure 2 Venn Diagram for Proof of Neyman Pearson Lemma ec660.1005
t > 0 Then the likelihood ratio critical region has the form
n
e−1(y2+···+y2n )
}(15.2.6)
Trang 14i.e., C has the form ¯y ≥ some constant The dependence of this constant on k is notrelevant, since this constant is usually chosen such that the maximum probability oferror of type one is equal to the given significance level.
Problem 217 8 points You have four independent observations y1, , y4 from
an N(µ, 1), and you are testing the null hypothesis µ = 0 against the alternativehypothesis µ = 1 For your test you are using the likelihood ratio test with criticalregion
(15.2.10) C = {y1, , y4: L(y1, , y4; µ = 1) ≥ 3.633 · L(y1, , y4; µ = 0)}.Compute the significance level of this test (According to the Neyman-Pearsonlemma, this is the uniformly most powerful test for this significance level.) Hints:
In order to show this you need to know that ln 3.633 = 1.29, everything else can bedone without a calculator Along the way you may want to show that C can also bewritten in the form C = {y1, , y4: y1+ · · · + y4≥ 3.290}
Answer Here is the equation which determines when y 1 , , y 4 lie in C:
(2π)−2exp −1
2
(y 1 − 1)2+ · · · + (y 4 − 1)2
−12
(y 1 − 1) 2 + · · · + (y 4 − 1) 2≥ ln(3.633) −1
Trang 15Since Pr[ y1+ · · · + y4≥ 3.290] = Pr[ z = ( y1+ · · · + y4)/2 ≥ 1.645] and z is a standard normal, one obtains the significance level of 5% from the standard normal table or the t -table Note that due to the properties of the Normal distribution, this critical region,for a given significance level, does not depend at all on the value of t Therefore thistest is uniformly most powerful against the composite hypothesis µ > 0.
One can als write the null hypothesis as the composite hypothesis µ ≤ 0, becausethe highest probability of type one error will still be attained when µ = 0 Thiscompletes the proof that the test given in the original fertilizer example is uniformlymost powerful
Most other distributions discussed here are equally well behaved, therefore formly most powerful one-sided tests exist not only for the mean of a normal withknown variance, but also the variance of a normal with known mean, or the param-eters of a Bernoulli and Poisson distribution
uni-However the given one-sided hypothesis is the only situation in which a uniformlymost powerful test exists In other situations, the generalized likelihood ratio test hasgood properties even though it is no longer uniformly most powerful Many knowntests (e.g., the F test) are generalized likelihood ratio tests
Assume you want to test the composite null hypothesis H0 : θ ∈ ω, where ω is
a subset of the parameter space, against the alternative HA : θ ∈ Ω, where Ω ⊃ ω
is a more comprehensive subset of the parameter space ω and Ω are defined by
Trang 16functions with continuous first-order derivatives The generalized likelihood ratiocritical region has the form
(15.2.14) C = {x1, , xn: supθ∈ΩL(x1, , xn; θ)
supθ∈ωL(x1, , xn; θ) ≥ k}
where k is chosen such that the probability of the critical region when the nullhypothesis is true has as its maximum the desired significance level It can be shownthat twice the log of this quotient is asymptotically distributed as a χ2
q−s, where q
is the dimension of Ω and s the dimension of ω (Sometimes the likelihood ratio
is defined as the inverse of this ratio, but whenever possible we will define our teststatistics so that the null hypothjesis is rejected if the value of the test statistic istoo large.)
In order to perform a likelihood ratio test, the following steps are necessary:First construct the MLE’s for θ ∈ Ω and θ ∈ ω, then take twice the difference of theattained levels of the log likelihoodfunctions, and compare with theχ2tables
15.3 The Runs Test[Spr98, pp 171–175] is a good introductory treatment, similar to the one givenhere More detail in [GC92, Chapter 3] (not in University of Utah Main Library)and even more in [Bra68, Chapters 11 and 23] (which is in the Library)
Trang 17Each of your three research assistants has to repeat a certain experiment 9 times,and record whether each experiment was a success (1) or a failure (0) In all cases, theexperiments happen to have been successful 4 times Assistant A has the followingsequence of successes and failures: 0, 1, 0, 0, 1, 0, 1, 1, 0, B has 0, 1, 0, 1, 0, 1, 0, 1, 0, and
C has 1, 1, 1, 1, 0, 0, 0, 0, 0
On the basis of these results, you suspect that the experimental setup used by
B and C is faulty: for C, it seems that something changed over time so that thefirst experiments were successful and the latter experiments were not Or perhapsthe fact that a given experiment was a success (failure) made it more likely that alsothe next experiment would be a success (failure) For B, the opposite effect seems
to have taken place
From the pattern of successes and failures you made inferences about whetherthe outcomes were independent or followed some regularity A mathematical for-malization of this inference counts “runs” in each sequence of outcomes A run is asucession of several ones or zeros The first outcome had 7 runs, the second 9, andthe third only 2 Given that the number of successes is 4 and the number of failures
is 5, 9 runs seem too many and 2 runs too few
The “runs test” (sometimes also called “run test”) exploits this in the followingway: it counts the number of runs, and then asks if this is a reasonable number of
Trang 18runs to expect given the total number of successes and failures It rejects wheneverthe number of runs is either too large or too low.
The choice of the number of runs as test statistic cannot be derived from a lihood ratio principle, since we did not specify the joint distribution of the outcome
like-of the experiment But the above argument says that it will probably detect at leastsome of the cases we are interested in
In order to compute the error of type one, we will first derive the probabilitydistribution of the number of runs conditionally on the outcome that the number ofsuccesses is 4 This conditional distribution can be computed, even if we do not knowthe probability of success of each experiment, as long as their joint distribution hasthe following property (which holds under the null hypothesis of statistical indepen-dence): the probability of a given sequence of failures and successes only depends onthe number of failures and successes, not on the order in which they occur Then theconditional distribution of the number of runs can be obtained by simple counting
How many arrangements of 5 zeros and 4 ones are there? The answer is 94 =
(9)(8)(7)(6)
(1)(2)(3)(4) = 126 How many of these arrangements have 9 runs? Only one, i.e., theprobability of having 9 runs (conditionally on observing 4 successes) is 1/126 Theprobability of having 2 runs is 2/126, since one can either have the zeros first, or theones first
Trang 19In order to compute the probability of 7 runs, lets first ask: what is the bility of having 4 runs of ones and 3 runs of zeros? Since there are only 4 ones, eachrun of ones must have exactly one element So the distribution of ones and zerosmust be:
proba-1 − one or more zeros − proba-1 − one or more zeros − proba-1 − one or more zeros − proba-1
In order to specify the distribution of ones and zeros completely, we must thereforecount how many ways there are to split the sequence of 5 zeros into 3 nonemptybatches Here are the possibilities:
Generally, the number of possibilities is 42 because there are 4 spaces between those
5 zeros, and we have to put in two dividers
We have therfore 6 possibilities to make 4 runs of zeros and 3 runs of ones Nowhow many possiblities are there to make 3 runs of zeros and 4 runs of ones? Thereare 4 ways to split the 5 zeros into 4 batches, and there are 3 ways to split the 4 ones
Trang 20into 3 batches, represented by the schemes
n−1
s + m−1
s
n−1 s−1
m+n m
(15.3.3)
Pr[r= 2s] = 2
m−1 s−1
n−1 s−1
m+n m
(15.3.4)
Some computer programs (StatXact, www.cytel.com) compute these probabilitiesexactly or by monte carlo simulation; but there is also an asymptotic test based on
Trang 21the facts that
E[r] = 1 + 2mn
m + n var[r] =
2mn(2mn − m − n)(m + n)2(m + n − 1)(15.3.5)
and that the standardized number of runs is asymptotically a Normal distribution.(see [GC92, section 3.2])
We would therefore reject when the observed number of runs is in the tails ofthis distribution Since the exact test statistic is discrete, we cannot make testsfor every arbitrary significance level In the given example, if the critical region is{r = 9}, then the significance level is 1/126 If the critical region is {r = 2 or 9},the significance level is 3/126
We said before that we could not make precise statements about the power ofthe test, i.e., the error of type two But we will show that it is possible to makeprecise statements about the error of type one
Right now we only have the conditional probability of errors of type one, giventhat there are exactly 4 successes in our 9 trials And we have no information aboutthe probability of having indeed four successes, it might be 1 in a million However incertain situations, the conditional significance level is exactly what is needed Andeven if the unconditional significance level is needed, there is one way out If wewere to specify a decision rule for every number of successes in such a way that theconditional probability of rejecting is the same in all of them, then this conditional
Trang 22666
6Figure 3 Distribution of runs in 7 trials, if there are 4 successes
and 3 failures
probability is also equal to the unconditional probability The only problem here
is that, due to discreteness, we can make the probability of type one errors onlyapproximately equal; but with increasing sample size this problem disappears
Trang 23Problem 218 Write approximately 200 x’es and o’s on a piece of paper trying
to do it in a random manner Then make a run test whether these x’s and o’s wereindeed random Would you want to run a two-sided or one-sided test?
The law of rare events literature can be considered a generalization of the runtest For epidemiology compare [Cha96], [DH94], [Gri79], and [JL97]
15.4 Pearson’s Goodness of Fit Test
Given an experiment with r outcomes, which have probabilities p1, , pr, where
P pi = 1 You make n independent trials and the ith outcome occurredxi times.Thex1, ,xr have the multinomial distribution with parameters n and p1, , pr.Their mean and covariance matrix are given in equation (8.4.2) above How do youtest H0: p1= p0, , pr= p0?
Pearson’s Goodness of Fit test uses as test statistic a weighted sum of the squareddeviations of the observed values from their expected values:
.This test statistic is often called the Chi-Square statistic It is asymptotically dis-tributed as aχ2r−1; reject the null hypothesis when the observed value of this statistic
is too big, the critical region can be read off a table of the χ2
Trang 24Why does one get aχ2 distribution in the limiting case? Because thexi selves are asymptotically normal, and certain quadratic forms of normal distributionsare χ2 The matter is made a little complicated by the fact that thexi are linearlydependent, sincePxj= n, and therefore their covariance matrix is singular Thereare two ways to deal with such a situation One is to drop one observation; onewill not lose any information by this, and the remaining r − 1 observations are wellbehaved (This explains, by the way, why one has aχ2
them-r−1 instead of aχ2.)
We will take an alternative route, namely, use theorems which are valid even
if the covariance matrix is singular This is preferable because it leads to moreunified theories In equation (10.4.9), we characterized all the quadratic forms ofmultivariate normal variables that are χ2’s Here it is again: Assume y is a jointlynormal vector random variable with mean vector µ and covariance matrix σ2Ψ, and
Ω is a symmetric nonnegative definite matrix Then (y− µ)>Ω(y− µ) ∼ σ2χ2k iffΨΩΨΩΨ = ΨΩΨ and k is the rank of Ω If Ψ is singular, i.e., does not have aninverse, and Ω is a g-inverse of Ψ, then condition (10.4.9) holds A matrix Ω is ag-inverse of Ψ iff ΨΩΨ = Ψ Every matrix has at least one g-inverse, but may havemore than one
Now back to our multinomial distribution By the central limit theorem, thexi
are asymptotically jointly normal; their mean and covariance matrix are given byequation (8.4.2) This covariance matrix is singular (has rank r − 1), and a g-inverse