327 10.6. Example: A polymorphism in the estrogen receptor gene
Table 10.1. Effect of estrogen receptor genotype on age at diagnosis among 59 breast cancer patients (Parl et al., 1989).
Genotype∗
1.6/1.6 1.6/0.7 0.7/0.7 Total
Number of patients 14 29 16 59
Age at breast cancer diagnosis
Mean 64.643 64.379 50.375 60.644
Standard deviation 11.18 13.26 10.64 13.49
95% confidence interval
Equation (10.3) (58.1–71.1) (59.9–68.9) (44.3–56.5)
Equation (10.4) (58.2–71.1) (59.3–69.4) (44.7–56.0) (57.1–4.2)
∗The numbers 0.7 and 1.6 identify the alleles of the estrogen receptor genes that were studied (see text). Patients were either homozygous for the 1.6 kb pattern allele (had two copies of the same allele), were heterozygous (had one copy of each allele), or were homozygous for the 0.7 kb pattern allele.
age at diagnosis does not vary with genotype, we perform a one-way analysis of variance on the ages of patients in these three groups using model (10.1).
In this analysis,n=59,k=3, and β1, β2 andβ3 represent the expected age of breast cancer diagnosis among patients with the 1.6/1.6, 1.6/0.7, and 0.7/0.7 genotypes, respectively. The estimates of these parameters are the av- erage ages given in Table 10.1. TheF test from this analysis equals 7.86. This statistic hask−1=2 andn−k=56 degrees of freedom. The P value as- sociated with this test equals 0.001. Hence, we can reject the null hypothesis that these three population means are equal.
The root MSE estimate of σfrom this analysis of variance is s=
√147.25=12.135. The critical valuet56,0.025equals 2.003. Substituting these values into equation (10.3) gives that a 95% confidence interval for the age of diagnosis of women with the 1.6/0.7 genotype is 64.38±2.003×12.135/
√29=(59.9, 68.9). The within-group standard deviations shown in this table are quite similar, and Bartlett’s test for equal standard deviations is not significant (P=0.58). Hence, it is reasonable to use equation (10.3) rather than equation (10.4) to calculate the confidence intervals for the mean age at diagnosis for each genotype. In Table 10.1, these intervals are calculated for each genotype using both of these equations. Note that, in this example, these equations produce similar estimates. If the equal standard deviation assumption is true, then equation (10.3) will provide more accurate confi- dence intervals than equation (10.4) since it uses all of the data to calculate
328 10. Fixed effects analysis of variance
Table 10.2. Comparison of mean age of breast cancer diagnosis among patients with the three estrogen receptor genotypes studied by Parl et al.
(1989). The one-way analysis of variance of these data shows that there is a significant difference between the mean age of diagnosis among women with these three genotypes (P=0.001).
Difference in
mean age of 95% confidence
Comparison diagnosis interval
P value Eq. (10.6) Rank-sum∗ 1.6/0.7 vs. 1.6/1.6 −0.264 (−8.17 to 7.65) 0.95 0.96 0.7/0.7 vs. 1.6/1.6 −14.268 (−23.2 to−5.37) 0.002 0.003 0.7/0.7 vs. 1.6/0.7 −14.004 (−21.6 to−6.43) <0.0005 0.002
∗Wilcoxon–Mann–Whitney rank-sum test
the common standard deviation estimate s. However, equation (10.4) is more robust than equation (10.3) since it does not make any assumptions about the standard deviation within each patient group.
The F test from the analysis of variance permits us to reject the null hy- pothesis that the mean age of diagnosis is the same for each group. Hence, it is reasonable to investigate if there are pair-wise differences in these ages (see Section 10.2). This can be done using either independentt-tests or equa- tion (10.6). For example, the difference in average age of diagnosis between women with the 0.7/0.7 genotype and those with the 1.6/1.6 genotype is –14.268. From equation (10.6), thetstatistic to test whether this difference is significantly different from zero ist= −14.268/(12.135√
1/14+1/16)=
−3.21. ThePvalue for this statistic, which has 56 degrees of freedom, is 0.002. The 95% confidence interval for this difference using equa- tion (10.7) is −14.268±2.003×12.135×√
1/14+1/16=(−23.2,
−5.37). Table 10.2 gives estimates of the difference between the mean ages of these three groups. In this table, confidence intervals are derived using equation (10.7) andPvalues are calculated using equation (10.6). It is clear that the age of diagnosis among women who are homozygous for the 0.7 kb pattern allele is less that that of women with the other two genotypes.
Figure 10.1 shows box plots for the age at diagnoses for the three geno- types. The vertical lines under these plots indicate the ages of diagnosis. The number of line segments equals the number of women diagnosed at each age. Although these plots are mildly asymmetric, they indicate that these age distributions are sufficiently close to normal to justify the analysis of variance given above. Of course, the Kruskal–Wallis analysis of variance is
329 10.7. One-way analyses of variance using Stata
Age at Breast Cancer Diagnosis
35 40 45 50 55 60 65 70 75 80 85
0.7/0.7
1.6/0.7 1.6/1.6
Figure 10.1 Box plots of age at breast cancer diagnosis subdivided by estrogen receptor genotype in the study by Parl et al. (1989). The vertical lines under each box plot mark the actual ages of diagnosis. The number of line segments equals the number of women diagnosed at each age. Women who were homozygous for the 0.7 pattern allele had a significantly younger age of breast cancer diagnosis than did women in the other two groups.
also valid and avoids these normality assumptions. The Kruskal–Wallis test statistic for these data isH=12.1. Under the null hypothesis that the age distributions of the three patient groups are the same, H will have a chi- squared distribution withk−1=2 degrees of freedom. The P value for this test is 0.0024, which allows us to reject this hypothesis. Note that this P value is larger (less statistically significant) than that obtained from the analogous conventional analysis of variance. This illustrates the slight loss of statistical power of the Kruskal–Wallis test, which is the cost of avoiding the normality assumptions of the conventional analysis of variance. Table 10.2 also gives thePvalues from pair-wise comparisons of the three groups using the Wilcoxon–Mann–Whitney rank sum test. These tests lead to the same conclusions that we obtained from the conventional analysis of variance.