Conflict between P-values and lower bounds to Bayes factors and posterior probabilities: Case of a sharp null

Một phần của tài liệu Handbook of Statistics Vol 25 Supp 1 (Trang 168 - 173)

Consider the problem of testing

H0: θ=θ0 versus H1: θ=θ0

on the basis of a random observableX having a densityf (x|θ ), θΘ. LetT (X)be a relevant statistic, large values of which indicate deviation fromH0. The P-value for observed datax, denotedp, is given by

(4) p=Pθ0

T (X)T (x) .

In the following we denote the prior π(θ|H1)byg(θ )as in Berger and Delampady (1987) and Berger and Sellke (1987). For a continuous prior densityg(θ )on{θ=θ0},

the Bayes factor ofH0toH1is

(5) BF01= f (x|θ0)

f (x|θ )g(θ )dθ,

and, for a specified prior probabilityπ0ofH0, the posterior probability ofH0is given by(3)above.

By way of illustration we consider the problem of testing a normal mean. Suppose thatX=(X1, . . . , Xn)is observed, where theXi’s are i.i.d. N(θ, σ2),σ2being known.

The usual test statistic is T (X)=√

n|Xθ0|/σ,

whereXis the sample mean. For observed datax, withT (x)=t, the P-value is given by

(6) p=2

1−Φ(t ) ,

whereΦ is the cdf of the standard normal distribution.

Assume now that the prior p.d.f.gforθ, conditional onH1is given by the N(à, τ2) density. It follows that the Bayes factor is given by

(7) BF01=

1+ρ−2exp −1 2

(tρη)2 (1+ρ2)η2

, whereρ :=σ/(

nτ )andη:=0−à)/τ.

We consider the special case withà=θ0(ensuring symmetry ofgaboutθ0),τ =σ, π0 =1/2. The corresponding prior is close to the one recommended by Jeffreys who suggested taking a Cauchy form of the prior underH1.Berger and Delampady (1987) and Berger and Sellke (1987)have calculated(6) and(7) andP (H0|x)given by (3) for the above choice of (à, τ, π0)and for several choices of the observed valuet of T and sample size n. This enables one to see the conflict between P-value and the Bayesian measures of evidence without much difficulty. In particular, it is demonstrated that P (H0|x)is 5 to 50 times larger than the P-value. For example, if n = 50 and t =1.960, thenp=0.05, BF01 =1.08 andP (H0|x)=0.52. Thus, in a situation like this, a classical statistician will rejectH0at significance level 0.05, whereas a Bayesian will conclude thatH0has probability 0.52 of being true. In fact, following the recom- mendation ofJeffreys (1961, p. 432), a Bayesian will conclude that the null hypothesis is supported by the data but the evidence is not strong (see also Kass and Raftery, 1995). This simple calculation in an important problem reveals practical disagreement between classical and Bayesian statistics, with special reference to testing of point null hypothesis – a conflict pointed out independently byLindley (1957), Jeffreys (1961), andEdwards et al. (1963). The latter authors initiated calculations of this kind, with further substantial refinements, extensions and clarifications made byBerger and Sellke (1987), Berger and Delampady (1987), andDickey (1977). In particular, it appears that the conflict is not so much between the Bayesian and frequentist paradigms as between Bayesian measures of evidence and a popular classical measure of evidence. This point is further argued in Sections3 and 5.

Edwards et al. (1963), Berger and Delampady (1987) and Berger and Sellke (1987) investigate to what extent the above conflict between the P-value and the Bayesian mea- sures BF01andP (H0|x)depends on the choice of the priorg. Below we present results fromBerger and Sellke (1987)who calculate

(8) BF01(G)def= inf

gGBF01

and

(9) P (H0|x, G)def= inf

gGP (H0|x)=

1+1−π0

π0 × 1

BF01(G) −1

for suitable classesGof prior densities.

Interesting choices forGinclude GA= {all densities},

GS = {densities symmetric aboutθ0}, (10)

GUS= {unimodal, symmetric densities with mode atθ0}, GNOR= {normal (conjugate) densities with meanθ0}.

However,Gshould not be too large or too small. One should allow all reasonableg inGbut a too largeGmay lead to a “uselessly small” bound. The choiceG=GAseems to be highly biased towardH1. The classGUS is recommended as the most sensible choice forG. In what follows we record some of the relevant findings ofBerger and Sellke (1987).

•LetG=GA. In this case, BF01(GA)is equal to the likelihood ratio test statisticλ.

Moreover,

(11) t >1.68, π0=1/2⇒P (H0|x, GA)/pt >

π/2∼=1.253, and

(12)

tlim→∞P (H0|x, GA)/pt= π/2.

Thus, for larget,P (H0|x)is at least 1.25pt, for any prior. In particular, whenπ0=1/2 andt =1.96,

p=0.05, P (H0|x, GA)=0.128. (13)

•Suppose now thatG=GS. Then

(14) t >2.28, π0=1/2⇒P (H0|x, GS)/pt >

2π∼=2.507, and

(15)

tlim→∞P (H0|x, GS)/pt =√ 2π . In particular, whenπ0=1/2 andt =1.96,

(16) p=0.05, P (H0|x, GS)=0.227.

•Consider now the caseG=GUS. Here

(17) t >0, π0=1/2⇒P (H0|x, GUS)/pt2>1

and

(18)

tlim→∞P (H0|x, GUS)/pt2=1.

In particular, whenπ0=1/2 andt =1.96,

(19) p=0.05, P (H0|x, GUS)=0.290.

•Consider finally the caseG=GNOR(Edwards et al., 1963). Whenπ0=1/2 and t =1.96,

(20) p=0.05, P (H0|x, GNOR)=0.321.

The facts stated in(11)–(20)demonstrate convincingly the conflict between P-value and its Bayesian counterparts, in the context of testing a point null hypothesis.

Berger and Delampady (1987)consider the problem of testing a k-variate normal mean, where the covariance is a multiple of the identity matrix, and calculate, for every dimensionk, the lower bounds BF01(G)andP (H0|x, G)withG =GUS. They also address similar questions for a nonsymmetric and discrete situation like binomial test- ing. All these investigations yield results similar to those in the univariate normal testing problem. In particular, for the multivariate normal problem, whenk=2 andp=0.05, P (H0|x, GUS)is obtained as 0.2582. For a fixedp, as kincreases,P decreases at a slow rate but decreases to a value much larger thanp. Corresponding top =0.05,P converges to 0.2180 ask→ ∞.

One possible reaction to the above findings is to dismiss point null hypotheses al- together on the ground that they are unrealistic. To respond to this argument and also to answer when one can model a hypothesis testing problem as a test of a point null hypothesis, the above authors have explored when one can approximate an interval null hypothesis by a point null. More precisely, they have tried to answer when the problem of testing

(21) H0: |θθ0|⩽ε against H1: |θθ0|> ε

can be approximated by that of testing H0∗: θ =θ0 against H1∗: θ=θ0. (22)

To answer this question within the framework of classical (non-Bayesian) statistics, the authors have introduced

(23) αε def= sup

|θθ0|⩽ε

Pθ

T (X)t

and in the special case of normal testing problem have, moreover, investigated for what values ofε,α0is close toαε. They have tabulated upper bounds toε∗ :=ε

n/σ that achieve this. In particular, they have shown that for a moderate value oft, “as long as ε is no more than 1/4 to 1/6 of a sample standard deviation, the use of a point null

will cause at most 10% error in the calculated P-value”. The authors have addressed this issue also within the Bayesian framework. In this situation, one considers a prior densityπ(θ )which is continuous but sharply spiked nearθ0. Let

=

θ: |θθ0|⩽ε

, =

θ: |θθ0|> ε , and

π0=

π(θ )dθ, g0(θ )= 1

π0π(θ )I(θ ),

(24) g1(θ )= 1

1−π0π(θ )I(θ ).

Here,π0is the prior probability assigned toH0,g0is the conditional density ofθgiven thatH0is true, and g1 is the conditional density ofθ given that H1 is true. With all these, the authors have considered the normal testing problem. In this case, one observes X∼N(θ, σ2/n)and lettingf (x¯|θ )denote its density, computes the exact Bayes factor

for testingH0againstH1in(21)by

(25) B def=

f (x¯|θ )g0(θ )dθ

f (x¯|θ )g1(θ )dθ.

On the other hand, the Bayes factor for testingH0∗againstH1∗is given by (26)

B def= f (x¯|θ0) f (x¯|θ )g(θ )dθ,

whereg(θ )is the conditional prior density ofθgiven thatH1∗is true and satisfies

(27) (i) g(θ )g1(θ ) onΩ,

(ii)

g(θ )dθ is suitably small.

Finally, the authors have obtained a result to answer whenBprovides a good approx- imation toB. In particular, they have shown that wheng is taken to be the p.d.f. of N0, σ2), andε = σ/(4√

n),n ⩾ 4, the error in approximatingB byBis no more than 10%. This result improves one due toDickey (1976).

We conclude this section by discussing briefly another interesting measure, proposed byEfron and Gous (2001). They call it a ‘frequentist Bayes factor’ and define it as

λ−1(x) λ−1(y),

whereλis the likelihood ratio test statistic as defined in(1)andy is a threshold. One would rejectH0if this is large.Efron and Gous (2001)define what they call Fisher’s scale of evidence and choose theiryas the point such thatλ−1(y)equals the 90th per- centile of the likelihood ratio test statisticλ(X)underH0. They show that at least for k=1, there is a prior, given byU (−4.85,4.85), such that the Bayes factor correspond- ing to this prior is very close to the new measure (see Figure 2 ofEfron and Gous, 2001).

For higher dimensions, it appears that one would have to use percentiles of order higher than 90% (see Figure 3 ofEfron and Gous, 2001). In the followingα0refers to the area to the left of the chosen percentile. Our calculation fork =1 indicatesα0need not be fixed at 0.90 but can be varied considerably without affecting very much the agreement between the Bayesian and frequentist Bayes factors. However, the corresponding priors are quite different. For example, forα0=0.99, the prior for the matching Bayes factor isU (−34.58,34.58).

In addition to the above, there is an interesting discussion inEfron and Gous (2001) on the different scales of evidence of Fisher and Jeffreys which has some points in common with the main conclusions of Section5of this chapter.

Một phần của tài liệu Handbook of Statistics Vol 25 Supp 1 (Trang 168 - 173)

Tải bản đầy đủ (PDF)

(1.044 trang)