In clinical trials a hypothesis is a postulation, assumption, or statement that is made about the population regarding the efficacy, safety, or other pharmacoeconomics outcomes (e.g., quality of life) of a drug product under study. This statement or hypothesis is usually a sci- entific question that needs to be investigated. A clinical trial is often designed to address the question by translating it into specific study objective(s). Once the study objective(s) has been carefully selected and defined, a random sample can be drawn through an appro- priate study design to evaluate the hypothesis about the drug product. For example, a scientific question regarding a drug product, say drug A, of interest could be either (1) Is the mortality reduced by drug A? or (2) Is drug Asuperior to drug Bin treating hyperten- sion? The hypothesis to be questioned is usually referred to as the null hypothesis, denoted by H0. The hypothesis that the investigator wishes to establish is called the alternative hypothesis, denoted by Ha. In practice, we attempt to gain support for the alternative hypothesis by producing evidence to show that the null hypothesis is false. For the ques- tions regarding drug Adescribed above, the null hypotheses are that (1) there is no differ- ence between drug A and the placebo in the reduction of mortality and (2) there is no difference between drug Aand drug Bin treating hypertension, respectively. The alterna- tive hypotheses are that (1) drug Areduces the mortality and (2) drug Ais superior to drug B in treating hypertension, respectively. These scientific questions or hypotheses to be tested can then be translated into specific study objectives as to compare (1) the efficacy of drug Awith no therapy in the prevention of reinfarction of (2) the efficacy of drug Awith that of drug Bin reducing blood pressure in elderly patients, respectively.
Chow and Liu (2000) recommended the following steps be taken to perform a hypothe- sis testing:
1. Choose the null hypothesis that is to be questioned.
2. Choose an alternative hypothesis that is of particular interest to the investigators.
3. Select a test statistic, and define the rejection region (or a rule) for decision making about when to reject the null hypothesis and when not to reject it.
4. Draw a random sample by conducting a clinical trial.
5. Calculate the test statistic and its corresponding p-value.
6. Make conclusion according to the predetermined rule specified in step 3.
When performing a hypotheses testing, basically two kinds of errors occur. If the null hypothesis is rejected when it is true, then a type I error has occurred. For example, a type I error has occurred if we claim that drug Areduces the mortality when in fact there is no difference between drug Aand the placebo in the reduction of mortality. The probability of
HYPOTHESES TESTING AND p-VALUES 71
committing type I error is known as the level of significance. It is usually denoted by α. In practice,α represents the consumer’s risk which is often chosen to be 5%. On the other hand, if the null hypothesis is not rejected when it is false, then a type II error has been made. For example, we have made a type II error if we claim that there is no difference between drug Aand the placebo in the reduction of mortality when in fact drug Adoes reduce the mortality. The probability of committing type II error, denoted by β, is some- times referred to as the producer’s risk. In practice, 1βis known as the power of the test, which represents the probability of correctly rejecting the null hypothesis when it is false.
Table 2.6.1 summarizes the relationship between type I and type II errors when testing hypotheses. Furthermore a graph based on the null hypothesis of no difference is presented in Figure 2.6.1 to illustrate the relationship between αand β (or power) for various β’s under H0for various alternatives at α5% and 10%. It can be seen that αdecreases as β increases or α increases as β decreases. The only way of decreasing both αand β is to increase the sample size. In clinical trials a typical approach is to first choose a significant level αand then select a sample size to achieve a desired test power. In other words, a sam- ple size is chosen to reduce type II error such that βis within an acceptable range at a prespecified significant level of α. From Table 2.6.1 and Figure 2.6.1 it can be seen that α and β depend on the selected null and alternative hypotheses. As indicated earlier, the hypothesis to be questioned is usually chosen as the null hypothesis. The alternative hypothesis is usually of particular interest to the investigators. In practice, the choice of the
Table 2.6.1 Relationship Between Type I and Type II Errors
If H0is
True False
When
Fail to reject No error Type II error
Reject Type I error No error
Figure 2.6.1 Relationship between probabilities of type I and type II errors.
null hypothesis and the alternative hypothesis has an impact on the parameter to be tested.
Chow and Liu (2000) indicate that the null hypothesis may be selected based on the impor- tance of the type I error. In either case, however, it should be noted that we will never be able to prove that H0is true even though the data fail to reject it.
p-Values
In medical literature p-values are often used to summarize results of clinical trials in a probabilistic way. For example, in a study of 10 patients with congestive heart failure, Davis et al. (1979) report that at single daily doses of captopril of 25 to 150 mg, the cardiac index rose from 1.750.18 to 2.770.39 (meanSD) liters per minute per square meter (p0.001). Powderly et al. (1995) confirmed that fluconazole was effective in preventing esophageal candidiasis (adjusted relative hazard, 5.8; 95% confidence interval 1.7 to 20.0;
p0.004) in patients with advance human immunodeficiency virus (HIV) infection. In a multicenter trial Coniffet al. (1995) indicate that all active treatments (acarbose, tolbu- tamide, and acarbose plus tolbutamide) were superior (p0.05) to placebo in reducing postprandial hyperglycemia and HbA1C levels in noninsulin-dependent diabetes mellitus (NIDDM) patients. In a study evaluating the rate of bacteriologic failure of amoxicillin- clavulanate in the treatment of acute otitis media, Patel et al. (1995) reveal that the bacteri- ologic failure was higher in nonwhite boys (p0.026) and in subjects with a history of three or more previous episodes of acute otitis media (p0.008). These statements indi- cated that a difference at least as great as the observed would occur in less than 1 in 100 tri- als if a 1% level of significance were chosen or in less than 1 in 20 trials if a 5% level of significance were selected provided that the null hypothesis of no difference between treat- ments is true and the assumed statistical model is correct.
In practice, the smaller the p-value shows, the stronger the result is. However, the mean- ing of a p-value may not be well understood. The p-valueis a measure of the chance that the difference at least as great as the observed difference would occur if the null hypothesis is true. Therefore, if the p-value is small, then the null hypothesis is unlikely to be true by chance, and the observed difference is unlikely to occur due to chance alone. The p-value is usually derived from a statistical test that depends on the size and direction of the effect (a null hypothesis and an alternative hypothesis). To show this, consider testing the follow- ing hypotheses at the 5% level of significance:
H0: There is no difference;
vs. Ha: There is a difference. (2.6.1) The statistical test for the above hypotheses is usually referred to as a two-sided test. If the null hypothesis (i.e., H0) of no difference is rejected at the 5% level of significance, then we conclude there is a significant difference between the drug product and the placebo. In this case we may further evaluate whether the trial size is enough to effectively detect a clinically important difference (i.e., a difference that will lead the investigators to believe the drug is of clinical benefit and hence of effectiveness) when such difference exists. Typically, the FDA requires at least 80% power for detecting such difference. In other words, the FDA requires there be at least 80% chance of correctly detecting such difference when the difference indeed exists.
Figure 2.6.2 displays the sampling distribution of a two-sided test under the null hypothe- sis in (2.6.1). It can be seen from Figure 2.6.2 that a two-sided test has equal chance to show
HYPOTHESES TESTING AND p-VALUES 73
that the drug is either effective in one side or ineffective in the other side. In Figure 2.6.2, C and C are critical values. The area under the probability curve between C and C constitutes the so-called acceptance region for the null hypothesis. In other words, any observed difference in means in this region is a piece of supportive information of the null hypothesis. The area under the probability curve below Cand beyond Cis known as the rejection region. An observed difference in means in this region is a doubt of the null hypoth- esis. Based on this concept, we can statistically evaluate whether the null hypothesis is a true statement. Let àDand àPbe the population means of the primary efficacy variable of the drug product and the placebo, respectively. Under the null hypothesis of no difference (i.e., àDàP), a statistical test, say Tcan be derived. Suppose that t, the observed difference in means of the drug product and the placebo, is a realization of T. Under the null hypothesis we can expect that the majority of twill fall around the center,àDàP0. There is a 2.5%
chance that we would see twill fall in each tail. That is, there is a 2.5% chance that twill be either below the critical value Cor beyond the critical value C.If tfalls below C, then the drug is worse than the placebo. On the other hand, if tfalls beyond C, then the drug is supe- rior to the placebo. In both cases we would suspect the validity of the statement under the null hypothesis. Therefore we would reject the null hypothesis of no difference if
tC or t C.
Furthermore we may want to evaluate how strong the evidence is. In this case, we calculate the area under the probability curve beyond the point t. This area is known as the observed p-value.
Therefore the p-value is the probability that a result at least as extreme as that observed would occur by chance if the null hypothesis is true. It can be seen from Figure 2.6.2 that
pvalue 0.05 if and only if tC or tC.
A smaller p-value indicates that tis further away from the center (i.e.,àDàP0) and consequently provides stronger evidence that supports the alternative hypothesis of
Figure 2.6.2 Sampling distribution of two-sided test.
a difference. In practice, we can construct a confidence interval for àDàP0. If the constructed confidence interval does not contain 0, then we reject the null hypothesis of no difference at the 5% level of significance. It should be noted that the above evaluations for the null hypothesis reach the same conclusion regarding the rejection of the null hypothe- sis. However, a typical approach is to present the observed p-value. If the observed p-value is less than the level of significance, then the investigators would reject the null hypothesis in favor of the alternative hypothesis.
Although p-values measure the strength of evidence by indicating the probability that a result at least as extreme as that observed would occur due to random variation alone under the null hypothesis, they do not reflect sample size and the direction of treatment effect.
Ware et al. (1992) indicate that p-values are a way of reporting the results of statistical analyses. It may be misleading to equate p-values with decisions. Therefore, in addition to p-values, they recommend that the investigators also report summary statistics, confidence intervals, and the power of the tests used. Furthermore, the effects of selection or multi- plicity should also be reported.
Note that when a p-value is between 0.05 and 0.01, the result is usually called statisti- cally significant; when it is less than 0.01, the result is often called highly statistically significant.
One-Sided versus Two-Sided Hypotheses
For marketing approval of a drug product, current FDA regulations require that substantial evidence of effectiveness and safety of the drug product be provided. Substantial evidence can be obtained through the conduct of two adequate well-controlled clinical trials. The evidence is considered substantial if the results from the two adequate well-controlled studies are consistent in the positive direction. In other words, both trials show that the drug product is significantly different from the placebo in the positive direction. If the pri- mary objective of a clinical trial is to establish that the test drug under investigation is superior to an active control agent, it is referred to as a superiority trial (ICH E9, 1998).
However, the hypotheses given in (2.6.1) do not specify the direction once the null hypoth- esis is rejected. As an alternative, the following hypotheses are proposed:
H0: There is no difference;
vs. Ha: The drug is better than placebo. (2.6.2) The statistical test for the above hypotheses is known as one-sided test. If the null hypoth- esis of no difference is rejected at the 5% level of significance, then we conclude that the drug product is better than the placebo and hence is effective. Figure 2.6.3 gives the rejec- tion region of a one-sided test. To further compare a one-sided and a two-sided test, let’s consider the level of proof required for marketing approval of a drug product at the 5%
level of significance. For a given clinical trial, if a two-sided test is employed, the level of proof required is one out of 40. In other words, at the 5% level of significance, there is 2.5% chance (or one out of 40) that we may reject the null hypothesis of no difference in the positive direction and conclude the drug is effective at one side. On the other hand, if a one-sided test is used, the level of proof required is one out of 20. It turns out that the one- sided test allows more ineffective drugs to be approved because of chance as compared to the two-sided test. As indicated earlier, to demonstrate the effectiveness and safety of a drug product, FDA requires two adequate well-controlled clinical trials be conducted.
HYPOTHESES TESTING AND p-VALUES 75
Then the level of proof required should be squared regardless of which test is used. Table 2.6.2 summarizes the levels of proof required for the marketing approval of a drug product.
As Table 2.6.2 indicates, the levels of proof required for one-sided and two-sided tests are one out of 400 and one out of 1600, respectively. Fisher (1991) argues that the level of proof of one out of 400 is a strong proof and is sufficient to be considered as substantial evidence for marketing approval, so the one-sided test is appropriate. However, there is no universal agreement among the regulatory agency (e.g., FDA), academia, and the pharma- ceutical industry as to whether a one-sided test or a two-sided test should be used. The con- cern raised is based on the following two reasons:
1. Investigators would not run a trial if they thought the drug would be worse than the placebo. They would study the drug only if they believe that it might be of benefit.
2. When testing at the 0.05 significance level with 80% power, the sample size required is increased by 27% for the two-sided test as opposed to the one-sided test. As a result there is a substantial impact on cost when a one-sided test is used.
It should be noted that although investigators may believe that a drug is better than the placebo, it is never impossible that such belief might be unexpected (Fleiss, 1987).
Figure 2.6.3 Sampling distribution of one-sided test.
Table 2.6.2 Level of Proof Required for Clinical Investigation
Type of Tests
Number of Trials One-Sided Two-Sided
One trial 1/20 1/40
Two trials 1/400 1/1600
Ellenberg (1990) indicates that the use of a one-sided test is usually a signalthat the trial has too small a sample size and that the investigators are attempting to squeeze out a significant result by a statistical maneuver. These observations certainly argue against the use of one-sided test for the evaluation of effectiveness in clinical trials. Cochran and Cox (1957) suggest that a one-sided test is used when it is knownthat the drug must be at least as good as the placebo, while a two-sided test is used when it is not knownwhich treatment is better.
As indicated by Dubey (1991), the FDA tends to oppose the use of a one-sided test.
However, this position has been challenged by several drug sponsors on the Drug Efficacy Study Implementation (DESI) drugs at the administrative hearings. As an example, Dubey (1991) points out that several views that favor the use of one-sided test were discussed in an administrative hearing. Some drug sponsors argued that the one-sided test is appropriate in the following situations: (1) where there is truly only concern with outcomes in one tail and (2) where it is completely inconceivable that the results can go in the opposite direction. In this hearing the sponsors inferred that the prophylactic value of the combination drug is greater than that posted by the null hypothesis of equal incidence, and therefore the risk of finding an effect when none in fact exists is located only in the upper tail. As a result a one- sided test is called for. However, the FDA feels that a two-sided test should be applied to account for not only the possibility that the combination drugs are better than the single agent alone at preventing candidiasis but also the possibility that they are worse at doing so.
Dubey’s opinion is that one-sided tests may be justified in some situations such as toxi- city studies, safety evaluation, analysis of occurrences of adverse drug reactions data, risk evaluation, and laboratory research data. Fisher (1991) argues that one-sided tests are appropriate for drugs that are tested against placebos at the 0.05 level of significance for two well-controlled trials. If, on the other hand, only one clinical trial rather than two is conducted, a one-sided test should be applied at the 0.025 level of significance. However, Fisher agrees that two-sided tests are more appropriate for active control trials.
It is critical to specify hypotheses to be tested in the protocol. A one-sided test or two- sided test can then be justified based on the hypotheses. It should be noted that the FDA is against a post hoc decision to create significance or near significance on any parameters when significance did not previously exist. This critical switch cannot be adequately explained and hence is considered an invalid practice by the FDA. More discussion regard- ing the use of one-sided test versus two-sided test from the perspectives of the pharmaceu- tical industry, academe, an FDA Advisory Committee member, and the FDA can be found in Peace (1991), Koch (1991), Fisher (1991), and Dubey (1991), respectively.