As indicated in the hypotheses of (2.6.1), the objective of most clinical trials is to detect the existence of predefined clinical difference using a statistical testing procedure such as unpaired two-sample t-test. If this predefined difference is clinically meaningful, then it is of clinical significance. If the null hypothesis in (2.6.1) is rejected at the α level of signifi- cance, then we conclude that a statistically significant differenceexists between treatments.
In other words, an observed difference that is unlikely to occur by chance alone is consid- ered a statistically significant difference. However, a statistically significant difference depends on the sample size of the trial. A trial with a small sample size usually provides little information regarding the efficacy and safety of the test drug under investigation. On
CLINICAL SIGNIFICANCE AND CLINICAL EQUIVALENCE 77
the other hand, a trial with a large sample size provides substantial evidence of the efficacy of the safety of the test drug product. An observed statistically significant difference, which is of little or no clinical meaning and interpretation, will not be able to address the scien- tific/clinical questions that a clinical trial was intended to answer in the first place.
The magnitude of a clinically significant difference varies. In practice, no precise defi- nition exists for the clinically significant difference, which depends on the disease, indica- tion, therapeutic area, class of drugs, and primary efficacy and safety endpoints. For example, for antidepressant agents (e.g., Serzone), a change from a baseline of 8 in the Hamilton depression (Ham-D) scale or a 50% reduction from baseline in the Hamilton depression (Ham-D) scale with a baseline score over 20 may be considered of clinical importance. For antimicrobial agents (e.g., Cefil), a 15% reduction in bacteriologic eradi- cation rate could be considered a significant improvement. Similarly, we could also con- sider a reduction of 10 mm Hg in sitting diastolic blood pressure as clinically significant for ACE inhibitor agents in treating hypertensive patients.
The examples of clinical significance on antidepressant or antihypertensive agents are those of individual clinical significance, which can be applied to evaluation of the treat- ment for individual patients in usual clinical practice. Because individual clinical signifi- cance only reflects the clinical change after the therapy, it cannot be employed to compare the clinical change of a therapy to that of no therapy or of a different therapy. Temple (1982) pointed out that in evaluation of one of phase II clinical trials for an ACE inhibitor, although the ACE inhibitor at 150 mg t.i.d. can produce a mean reduction from baseline in diastolic blood pressure of 16 mm Hg, the corresponding mean reduction from baseline for the placebo is also 9 mm Hg . It is easy to see that a sizable proportion of the patients in the placebo group reached the level of individual clinical significance of 10 mm Hg . There- fore, this example illustrates a fact that individual clinical significance alone cannot be used to establish the effectiveness of a new treatment.
For assessment of efficacy/safety of a new treatment modality, it is, within the same trial, compared with either a placebo or another treatment, usually the standard therapy. If the concurrent competitor in the same study is placebo, the effectiveness of the new modality can then be established, based on some primary endpoints, by providing the evidence of an average difference between the new modality and placebo that is larger than some prespeci- fied difference of clinical importance to investigators or to the medical/scientific commu- nity. This observed average difference is said to be of the comparative clinical significance.
The ability of a placebo-controlled clinical trial to provide such observed difference of both comparative clinical significance and statistical significance is referred to as assay sensitiv- ity. A similar definition of assay sensitivity is also given in the ICH E10 guidance entitled, Choice of Control Group in Clinical Trials(ICH, 1999).
On the other hand, when the concurrent competitor in the trial is the standard treatment or other active treatment, then efficacy of the new treatment can be established by showing that the test treatment is as good as or at least no worse than standard treatment. However, under this situation, the proof of efficacy for the new treatment is based on a crucial assumption that the standard treatment or active competitor has established its own efficacy by demonstrating a difference of comparative clinical significance with respect to placebo in adequate placebo-controlled studies. This assumption is referred to as the sensitivity-to- drug-effects(ICH E10, 1999).
Table 2.7.1 presents the results first reported in Leber (1989), which was again used by Temple (1983) and Temple and Ellenberg (2000) to illustrate the issues and difficulties in
evaluating and interpreting the active controlled trials. All six trials compare nomifensine (a test antidepressant) to imipramine (a standard tricyclic antidepressant) concurrently with placebo. The common baseline means and 4-week adjusted group means based on the Hamilton depression scale are given in Table 2.7.1. Except for trial V311(2), based on the Hamilton depression scale, both nomifensine and imipramine showed more than 50%
mean reduction. However, magnitudes of average reduction on the Hamilton depression scale at 4 weeks for the placebo are almost the same as the other two active treatments for all five trials. Therefore, these five trials do not have assay sensitivity. It should be noted that trial V311(2) is the smallest trial, with a total sample size only of 22 patients. How- ever, it was the only trial in Table 2.7.1 that demonstrates that both nomifensine and imipramine are better than placebo in the sense of both comparative clinical significance and statistical significance.
Basically, there are four different outcomes for significant differences in a clinical trial.
The result may show that (1) the difference is both statistically and clinically significant, (2) there is a statistically significant difference yet the difference is not clinically significant, (3) the difference is of clinical significance yet not statistically significant, and (4) the difference is neither statistically significant nor clinically significant. If the difference is both clinically and statistically significant or if it is neither clinically nor statistically significant, then there is no confusion. The conclusion can be drawn based on the results from the clinical data.
However, in many cases a statistically significant difference does not agree with the clini- cally significant difference. For example, a statistical test may reveal that there is a statisti- cally significant difference. However, if the difference is too small (it may be due to a unusually small variability or a relatively large sample size) to be of any clinical impor- tance, then it is not clinically significant. In this case a small p-value may be instrumental in concluding the effectiveness of the treatment. On the other hand, the result may indicate that there is a clinically significant difference but the sample size is too small (or variability is too large) to claim a statistically significant difference. In this case the evidence of effective- ness is not substantial due to a large p-value. This inconsistency has created confusion / arguments among clinicians and biostatisticians in assessment of the efficacy and safety of clinical trials.
As indicated earlier, for the assessment of efficacy and safety of a drug product, a typical approach is to first demonstrate that there is a statistically significant difference between the
CLINICAL SIGNIFICANCE AND CLINICAL EQUIVALENCE 79 Table 2.7.1 Summary of Means of Hamilton Depression Scales of Six Trials Comparing Nomifensine, Imipramine, and Placebo
Common Baseline Four-week Adjusted Mean (Number of Subjects)
Study Mean Nomifensine Imipramine Placebo
R301 23.9 13.4(33) 12.8(33) 14.8(36)
G305 26.0 13.0(39) 13.4(30) 13.9(36)
C311(1) 28.1 19.4(11) 20.3(11) 18.9(13)
V311(2) 29.6 7.3(7) 9.5(8) 23.5(7)
F313 37.6 21.9(7) 21.9(8) 22.0(8)
K317 26.1 11.2(37) 10.8(32) 10.5(36)
Source:Temple and Ellenberg (2000).
drug products in terms of some clinical endpoints by testing hypotheses (2.6.1) repeated below:
H0: There is no difference;
vs. Ha: There is a difference.
Equivalently
H0:àDàP
vs. Ha:àDàP,
where àDand àPare the means of the primary clinical endpoint for the drug product and the placebo, respectively. If we reject the null hypothesis of no difference at the αlevel of sig- nificance, then there is a statistically significant difference between the drug product and the placebo in terms of the primary clinical endpoint. We then further evaluate whether there is sufficient power to correctly detect a clinically significant difference. If it does, then we can conclude that the drug product is effective and safe. Note that the above hypotheses are known as point hypotheses. In practice, it is recognized that no two treatments will have exactly the same mean responses. Therefore, if the mean responses of the two treatments differ by les than a meaningful limit (i.e., a clinically important difference), the two treat- ments can be considered clinically equivalent. Based on this idea, Schuirmann (1987) first introduces the use of interval hypotheses for assessing bioequivalence. The interval hypotheses for clinical equivalence can be formulated as
H0: The two drugs are not equivalent;
vs. Ha: The two drugs are equivalent. (2.7.1) Or put differently,
H0:àAàB L or àAàBU;
vs. Ha:LàAàBU,
where àAand àBare the means of the primary clinical endpoint for drugs A and B, respec- tively, and Land Uare some clinically meaningful limits. The concept and interval hypothe- ses (2.7.1) is to show equivalence by rejecting the null hypothesis of inequivalence. The above hypotheses can be decomposed into two sets of one-sided hypotheses
H01: Drug Ais superior to drug B(i.e.,àAàBU);
vs. Ha1: Drug Ais not superior to drug B;
and
H02: Drug Ais inferior to drug B(i.e.,àAàB L);
vs. Ha2: Drug Ais not inferior to drug B
The first set of hypotheses is to verify that drug Ais not superior to drug B, while the sec- ond set of hypotheses is to verify that drug Ais not worse than drug B. A relatively large or
small observed difference may refer to the concern of the comparability between the two drug products. Therefore the rejection of H01and H02will lead to the conclusion of clinical equivalence. This is equivalent to rejecting H0in (2.7.1). In practice, if Lis chosen to be U, then we can conclude clinical equivalent if
冟àA àB冟∆,
where ∆U Lis the clinically significant difference. For example, for the assessment of bioequivalence between a generic drug product and an innovator drug product (or refer- ence drug product), the bioequivalence limit ∆is often chosen to be 20% of the bioavail- ability of the reference product. In other words, in terms of the ratio of means àA/àB, the limits become L80% and U120%. When log-transformed data are analyzed, the FDA suggests using L80% and U125%. More detail on the assessment of bioequivalence between drug products can be found in Chow and Liu (2000).
When two drugs are shown to be clinically equivalent, they are comparable to each other.
Consequently they can be used as substitutes for each other. It should be noted that there is difference between the assessment of a possible difference and equivalence. Hypotheses (2.6.1) are set for assessment of a possible difference between treatments, while hypotheses (2.7.1) are for the assessment of equivalence. The demonstration of equality does not neces- sarily imply equivalence. This is because the selected sample size for testing equality may not be sufficient for assessing the equivalence. Besides, when we fail to reject the null hypothesis of equality, it does not imply that the two treatments are equivalent, even if there is sufficient power for the detection of a clinically significant difference.
Note that the current FDA regulations do not allow the sponsors to establish clinical equivalence/noninferiority based on clinical trials designed for the detection of existence of treatment differences. On the other hand, the ICH E9 guideline also stressed that it is inappropriate to conclude equivalence/noninferiority based on observing a statistically nonsignificant test result for null hypothesis (2.6.1) that there is no difference between the investigational drug and the active competitor. Clinical equivalence/noninferiority between two drug products must be established based on the interval hypothesis, as described in (2.7.1). Confidence approach in general is used to establish the clinical equivalence/
noninferiority. If the entire confidence interval for the average difference between the investigator product and active competitor is within some prespecified equivalence limit, clinical equivalence is inferred. Clinical noninferiority is concluded if an upper one-sided confidence limit is smaller than the prespecific limit. From the above discussion, the sam- ple size determination for equivalence/noninferiority trials should specify the value of
∆, which is the largest difference between the investigational product and the active com- petitor that can be judged as clinically acceptable. In addition, the power for concluding equivalence/noninferiority using a prespecified value of ∆ should be given. For testing interval hypothesis, several statistical procedures have been proposed. See, for example, Blackwelder (1982), Wellek (1993), Jennison and Turnball (1993), and Liu (1995a).
Equivalence/noninferiority trials without inclusion of a placebo group are not internally valid and rely on external validation of the assumed sensitivity-to-drug effects. Further- more, selection of equivalence limits is also a very controversial issue and recently sparks heated arguments between the sponsors and the regulatory agencies. See Jones et al.
(1996), Rohmel (1998), Ebbut and Firth (1998), Fisher et al. (2001), Fleming (2000), Siegel (2000), Temple and Ellenberg (2000), and Ellenberg and Temple (2000). More details are given in Chapter 7.
CLINICAL SIGNIFICANCE AND CLINICAL EQUIVALENCE 81