Bayesian sample size calculation

In this section, the theory behind Bayesian sample size calculations is considered. Pre- vious reviews of which, and comparisons to a frequentist framework, have been given by Addock [172] and more recently Inoue et al. [173]. Please note that throughout this chapter, the traditional notation for Type I and Type II error rates as – and — respectively are used. These are not to be confused with the intensity function and log hazard rate parameters used in previous and forthcoming chapters.

Sample size calculations for specific data types have been previously explored, see M’Lan et al. [174] for binomial responses and Joseph and Belisle [175] for normal responses. Both Gould [176] and Gubbiotti and De Santi [177] consider the design of equivalence studies. Typically however, due to the added complexity in the Bayesian

framework, there are few analytical solutions to the Bayesian sample size problem. As a result simulation techniques are generally required, an approach for which is well defined by Wang and Gelfand [178] and Rubin and Stern [179]. These make use of utility functions on the posterior distribution of interest such as those described by Lindley [180] and Pham-Gia [181].

Under the typical design of a Phase II/III randomised clinical trial, the aim in a frequentist framework is to control the Type I and Type II error rates. These design parameters are defined conditional on some fixed minimum clinically relevant difference, denoted ”. A difference that must be observed for a trial to be determined a success.

Using the notation of Chow et al. [182] denote the trial outcome as positive (nega- tive) usingC= + (C=≠) and the ‘true’ outcome asT = + (T =≠). In a frequentist framework the definitions for – and — are given by

–=P r(C= +|T =≠)

— =P r(C=≠|T = +).

A frequentist trial will then control designed parameters conditional on the ‘true’

trial outcome which can never be known. Further criticisms of the frequentist approach to be noted are:

• Typically frequentist sample size calculations will be based on some estimate of the standard deviation for the key parameter of interest. This parameter is often treated as known when it rarely is.

• Often prior information regarding the treatment effect or behaviour about a par- ticular treatment arm may be available at the design stage but must be disre- garded in a frequentist framework.

• Setting a minimum clinically relevant difference,”, can be difficult in practice.

Further difficulties have been noted as to the effect on trials that a minimum clinically relevant difference can have. In many areas, such as oncology, any improvement, however small, may be considered clinically relevant. Designing trials on this basis results in unfeasibility large trials however. This can lead, as has been previously noted, to values of ” being set to satisfy a sample size calculation and therefore undermines the rigorous philosophy of a frequentist design. Furthermore, it allows for trials to observe a statistically significant difference but a point estimate that does not pass the criterion of being clinically relevant.

Bayesian sample size estimations are advantageous as they not only account for any parameter uncertainty, in key or nuisance parameters, but also can build in prior information at the design stage. Furthermore, as Bayesian designs typically rely on

simulation approaches, any layer of complexity/variability required can be built into the modelling approach and provide more informed sample size calculations than may otherwise be available.

To describe Bayesian sample size calculations, first consider that a trial is to be conducted which will collect data x which it is planned to model based on some parameters ◊. Assume the final analysis will be based on a model which will provide a log likelihood of the forml(◊|x). GivenP r(x|◊) is proportional tol(◊|D) and the prior distribution for the model parameters,P r(◊), the marginal distribution for the datax is

P r(x) =⁄ P r(x|◊)P r(◊).

The full posterior distribution in a Bayesian analysis is dependent on the marginal distribution. The Bayesian approach to sample size calculations are not based on controlling Type I and Type II error, rather they concentrate on controlling aspects of the posterior distribution. Denote the statistic on which the posterior distribution is evaluated asT(x). Many forms ofT(x) have been previously proposed. Three popular approaches given by Joseph and Belisle [175] are the Average Coverage Criterion (ACC), the Average Length Criterion (ALC) and the Worst Outcome Criterion (WOC) with Wang and Gelfand suggesting two further approaches, the Average Posterior Variance Criterion (APVC) and the probability of detecting a treatment effect of size ”ú. These criteria are described in further detail in Sections 6.3.1 to 6.3.5.

Typically, analytical solutions are not available and so simulation approaches are required. The approach taken is set out by Wang and Gelfand [178, 183]. The general formulation is

1. Sample a value ˜◊from the prior distribution P r(◊)

2. Sample data ˜x from the marginal distribution, dependent on ˜◊ 3. CalculateT(˜x)

4. Repeat the process for a total of N simulations

From N simulations we can directly calculate E[T(x)] as the arithmetic mean of T(˜x) over all simulations andP[T(x)]œAas being the proportion of the N simulations for which T(˜x) belongs to A.

6.3.1 Average coverage criterion

Here some fixed length of the posterior distribution is set and the aim is to estimate the coverage of the posterior distribution provided by a given sample size. Define the point estimate (mean or median) of a symmetric posterior distribution for Â and set

some fixed length l such that an interval A(y(n)) = (Â≠l/2,Â+l/2) can be formed.

Given some–ỉ0, define the ACC as

E[P r(ÂœA(y(n))|y(n))]ỉ1≠Ÿ.

For non symmetric distributions, the length l can be amended to represent some highest posterior density (HPD) such that A(y(n)) = {Â : P r(◊|y(n)) ỉ cn(l)}. Here cn(l) is chosen such as the Lebesque measure of A(y(n)) = l. Typical values of Ÿ are 0.05 or 0.1 which are equivalent to 95% and 90% credibility intervals.

6.3.2 Average length criterion

Similarly to the ACC, here the coverage of the posterior density is fixed and a length of some desirable credibility interval is calculated for a given sample size. As with the ACC we can define the length l either assuming a symmetric distribution or via an HPD for non-symmetric distributions. Firstly define the interval A(y(n) = (FÂ≠1|y(n)(Ÿ/2), FÂ≠1|y(n)(1≠Ÿ/2)) where FÂ≠1

|y(n)(Ÿ/2) defines the Ÿ/2 quantile of the posterior distribution. The ALC is obtained by a givenn which satisfies

E[FÂ|y≠1(n)(1≠Ÿ/2)≠F≠1

Â|y(n)(Ÿ/2)]ặl.

6.3.3 Worst outcome criterion

For each of the ACC and the ALC there is a 50% chance that the the coverage/length will be greater than what is desired as it only controls the required quantities on average.

When this is of concern, the worst outcome criterion (WOC) is a viable alternative.

Here, of instead of using the expectation some appropriate subset of the sample space S0 is defined such that

yninfœS0[P r(ÂœA(y(n))|y(n))]

6.3.4 Average posterior variance criterion

The average posterior variance approach aims to control the variance of the posterior distributionV ar(Â|y(n)) and seeks an nfor some‘ỉ0 such that

E[V ar(Â|y(n))]ặ‘.

6.3.5 Effect size criterion

In the clinical trials setting, it is often the case that interest lies in whether or not a parameter is greater than (or less than) some pre-specified effect size. This may, or may not be analogous to a minimum clinically relevant difference under a frequentist

framework. Setting some threshold value as Âú, a sample size n is obtained which satisfies

E[P r(Â>Âỳ|yn)]ỉÍ.

Here Í is some given quantile of the posterior distribution. In many situations it is simply set thatÂú = 0 which would represent evidence of some positive effect.

6.3.6 Successful trial criterion

The effect size criterion is extended by considering that at the outset of a trial, there are conditions under which the trial would be considered a success. The Successful Trial Criterion (STC) is then set up to estimate the probability that the posterior distribution of the key parameter of interest meets some pre-defined criterion.

Taking for example the situation where a trial would be defined a success if a sufficient proportion of the posterior density Â is greater than (or less than) some threshold valueÂú, we estimate n to satisfy the criterion

P)P r(Â(ậ)ỉÂỳ|y(n))*

whereÂ(Ë)is theËlevel of the posterior distribution. Here for example, settingŸ= 0.1 and Âú = 0 would consider the trial a success only if the 0.1 quantile of the posterior distribution is less than zero. Note that this criterion can be used to provide Bayesian equivalents of the frequentist Type I and Type II error rates although these are some what dependent on the prior distributions that are set for the parameters in the model.

Treating all design parameters as fixed and setting prior distributions for the parameter of interest to match the null and alternative hypotheses with zero variance will give results analogous to frequentist Type I and Type II errors.

Modelling non-proportionality via an asymmetry parameter

Applied Bayesian analysis of ESPAC-3 data