Similar caveats hold for the parametric ANOVA approach to the analysis of two-factor experimental design with two additions:
1. The sample sizes must be the same in each cell; that is, the design must be balanced.
2. A test for interaction must precede any test for main effects.
Imbalance in the design will result in the confounding of main effects with interactions. Consider the following two-factor model for crop yield:
Now suppose that the observations in a two-factor experimental design are normally distributed as in the following diagram taken from Cornfield and Tukey (1956):
There are no main effects in this example—both row means and both column means have the same expectations, but there is a clear interaction represented by the two nonzero off-diagonal elements.
If the design is balanced, with equal numbers per cell, the lack of signif- icant main effects and the presence of a significant interaction should and will be confirmed by our analysis. But suppose that the design is not in balance, that for every 10 observations in the first column, we have only one observation in the second. Because of this imbalance, when we use the F ratio or equivalent statistic to test for the main effect, we will uncover a false “row” effect that is actually due to the interaction between rows and columns. The main effect is confounded with the interaction.
If a design is unbalanced as in the preceding example, we cannot test for a “pure” main effect or a “pure” interaction. But we may be able to test for the combination of a main effect with an interaction by using the statistic that we would use to test for the main effect alone. This com- bined effect will not be confounded with the main effects of other unre- lated factors.
Whether or not the design is balanced, the presence of an interaction may zero out a cofactor-specific main effect or make such an effect impos-
N N
N N
0 1 2 1
2 1 0 1
, ,
, ,
( ) ( )
( ) ( )
Xijk =m a+ i+bj+gij+ejjk
CHAPTER 5 TESTING HYPOTHESES: CHOOSING A TEST STATISTIC 65
sible to detect. More important, the presence of a significant interaction may render the concept of a single “main effect” meaningless. For example, suppose we decide to test the effect of fertilizer and sunlight on plant growth. With too little sunlight, a fertilizer would be completely ineffective. Its effects only appear when sufficient sunlight is present.
Aspirin and warfarin can both reduce the likelihood of repeated heart attacks when used alone; you don’t want to mix them!
Gunter Hartel offers the following example: Using five observations per cell and random normals as indicated in Cornfield and Tukey’s diagram, a two-way ANOVA without interaction yields the following results:
Source df Sum of Squares F Ratio Prob >F
Row 1 0.15590273 0.0594 0.8104
Col 1 0.10862944 0.0414 0.8412
Error 17 44.639303
Adding the interaction term yields
Source df Sum of Squares F Ratio Prob >F
Row 1 0.155903 0.1012 0.7545
Col 1 0.108629 0.0705 0.7940
Row*col 1 19.986020 12.9709 0.0024
Error 16 24.653283
Expanding the first row of the experiment to have 80 observations rather than 10, the main effects only table becomes
Source df Sum of Squares F Ratio Prob >F
Row 1 0.080246 0.0510 0.8218
Col 1 57.028458 36.2522 <.0001
Error 88 138.43327
But with the interaction term it is:
Source df Sum of Squares F Ratio Prob >F
Row 1 0.075881 0.0627 0.8029
Col 1 0.053909 0.0445 0.8333
row*col 1 33.145790 27.3887 <.0001
Error 87 105.28747
Independent Tests
Normally distributed random variables (as in Figure 7.1) have some remarkable properties:
• The sum (or difference) of two independent normally distributed random variables is a normally distributed random variable.
• The square of a normally distributed random variable has the chi- square distribution (to within a multiplicative constant); the sum of two variables with the chi-square distribution also has a chi- square distribution (with additional degrees of freedom).
• A variable with the chi-square distribution can be decomposed into the sum of several independent chi-square variables.
As a consequence of these properties, the variance of a sum of indepen- dent normally distributed random variables can be decomposed into the sum of a series of independent chi-square variables. We use these indepen- dent variables in the analysis of variance (ANOVA) to construct a series of independent tests of the model parameters.
Unfortunately, even slight deviations from normality negate these prop- erties; not only are ANOVA pvalues in error because they are taken from the wrong distribution, but they are in error because the various tests are interdependent.
When constructing a permutation test for multifactor designs, we must also proceed with great caution for fear that the resulting tests will be interdependent.
The residuals in a two-way complete experimental design are not exchangeable even if the design is balanced as they are both correlated and functions of all the data (Lehmann and D’Abrera, 1988). To
see this, suppose our model is Xijk=m+ai+bj+gij+eijk, where Sai= Sbj= Sigij= Sjgij=0.
Eliminating the main effects in the traditional manner, that is, setting X¢ijk=Xijk- i..- .j. + ..., one obtains the test statistic
first derived by Still and White [1981]. A permutation test based on the statistic Iwill not be exact because even if the error terms {eijk} are exchangeable, the residuals X¢ijk=eijk- i..- .j.+ ...are weakly correlated, with the correlation depending on the subscripts.
Nonetheless, the literature is filled with references to permutation tests for the two-way and higher-order designs that produce misleading values.
Included in this category are those permutation tests based on the ranks of the observations that may be found in many statistics software packages.
e e e I =ÂiÂj( ÂkX¢ijk)2
X X X
CHAPTER 5 TESTING HYPOTHESES: CHOOSING A TEST STATISTIC 67
The recent efforts of Salmaso [2003] and Pesarin [2001] have resulted in a breakthrough that extends to higher-order designs. The key lies in the concept of weak exchangeabilitywith respect to a subset of the possible permutations. The simplified discussion of weak exchangeability presented here is abstracted from Good [2003].
Think of the set of observations {Xijk} in terms of a rectangular lattice Lwith Kcolored, shaped balls at each vertex. All the balls in the same column have the same color initially, a color which is distinct from the color of the balls in any other column. All the balls in the same row have the same shape initially, a shape which is distinct from the shape of the balls in any other row. See Fig. 5.1.
Let Pdenote the set of rearrangements or permutations that preserve the number of balls at each row and column of the lattice. Pis a group.7
Let PRdenote the set of exchanges of balls among rows and within columns which (a) preserve the number of balls at each row and column of the lattice and (b) result in the numbers of each shape within each row being the same in each column. PRis the basis of a subgroup of P. See Fig. 5.2.
Let PCdenote the set of exchanges of balls among columns and within rows which (a) preserve the number of balls at each row and column of the lattice and (b) result in the numbers of each color within each column being the same in each row. PCis the basis of a subgroup of P. See Fig.
5.3.
Let PRCdenote the set of exchanges of balls that preserve the number of balls at each row and column of the lattice and which result in (a) an
FIGURE 5.1 A 2 ¥ 3 Design with Three Observations per Cell.
7 See Hungerford [1974] or http://members.tripod.com/~dogschool/ for a thorough dis- cussion of algebraic group properties.
FIGURE 5.2 A 2 ¥3 Design with Three Observations per Cell after pŒPR.
exchange of balls between both rows and columns (or no exchange at all), (b) the numbers of each color within each column being the same in each row, and (c) the numbers of each shape within each row being the same in each column. PRCis the basis of a subgroup of P.
The only element these three subgroups PRC, PR,and PChave in common is the rearrangement that leaves the observations with the same row and column labels they had to begin with. As a result, tests based on these three different subsets of permutations are independent of one another.
For testing H3: gij= 0 for all iand j, determine the distribution of the values of S= S1êiêiđêI1S1êjêjđêI2(Xij+ Xiđjđ- Xiđj- Xijđ) with respect to the rearrangements in PRC. If the value of S for the observations as they were originally labeled is notan extreme value of this permutation distribution, then we can accept the hypothesis H3of no interactions and proceed to test for main effects.
For testing H1: ai=0 for all i, choose one of the following test statistics as we did in the section on one-way analysis, F12= Si(SjSkxijk)2, F11= Si|SjSkxijk|, or R1= Sjg[i]SjSkxijk, where g[i] is a monotone function of i, and determine the distribution of its values with respect to the rearrange- ments in PR.
For testing H2: bj=0 for all j, choose one of the following test statistics as we did in the section on one-way analysis, F22= Sj(SiSkxijk)2, F21= Sj|SiSkxijk|, or R2= Sig[j]SiSkxijk, where g[j] is a monotone function of j, and determine the distribution of its values with respect to the rearrange- ments in PC.
Tests for the parameters of three-way and higher-order experimental designs can be obtained via the same approach; use a multidimensional lattice and such additional multivalued properties of the balls as charm and spin. Proofs may be seen at http://users.oco.net/drphilgood/resamp.htm.
Unbalanced Designs
Unbalanced designs with unequal numbers per cell may result from unanticipated losses during the conduct of an experiment or survey (or from an extremely poor initial design). There are two approaches to their analysis:
CHAPTER 5 TESTING HYPOTHESES: CHOOSING A TEST STATISTIC 69 FIGURE 5.3 A 2 ¥3 Design with Three Observations per Cell p ŒPc.
Permutation tests can be applied to unbalanced as well as balanced exper- imental designs, providing only that are sufficient observations in each cell to avoid confounding of the main effects and interactions. Even in this latter case, exact permutation tests are available; see Pesarin [2001, p. 237], obser- vations, recognizing that the results may be somewhat tainted.
Second, we might bootstrap along one of the following lines:
• If only one or two observations are missing, create a balanced design by discarding observations at random; repeat to obtain a distribution of pvalues (Baker, 1995).
• If there are actual holes in the design, so that there are missing combinations, create a test statistic that does not require the missing data. Obtain its distribution by bootstrap means. See Good [2000, pp. 68–70] for an example.