C5 Analysis of Variance

Analysis of Variance: Weight Gain, Foster Feeding in Rats, Water Hardness and Male Egyptian Skulls 5.1 Introduction The data in Table 5.1 from Hand et al., 1994 arise from an experiment

Trang 1

Analysis of Variance: Weight Gain, Foster Feeding in Rats, Water

Hardness and Male Egyptian Skulls

5.1 Introduction

The data in Table 5.1 (from Hand et al., 1994) arise from an experiment to study the gain in weight of rats fed on four different diets, distinguished by amount of protein (low and high) and by source of protein (beef and cereal) Ten rats are randomised to each of the four treatments and the weight gain

in grams recorded The question of interest is how diet affects weight gain

by the amount of protein (type) and source of protein (source)

source type weightgain source type weightgain

Trang 2

80 ANALYSIS OF VARIANCE

The data in Table 5.2 are from a foster feeding experiment with rat mothers and litters of four different genotypes: A, B, I and J (Hand et al., 1994) The measurement is the litter weight (in grams) after a trial feeding period Here the investigator’s interest lies in uncovering the effect of genotype of mother and litter on litter weight

different genotypes of the litter (litgen) and mother (motgen)

litgen motgen weight litgen motgen weight

Trang 3

The data in Table 5.3 (from Hand et al., 1994) give four measurements made

on Egyptian skulls from five epochs The data has been collected with a view

to deciding if there are any differences between the skulls from the five epochs The measurements are:

mb: maximum breadths of the skull,

bh: basibregmatic heights of the skull,

bl: basialiveolar length of the skull, and

nh: nasal heights of the skull

Non-constant measurements of the skulls over time would indicate interbreed-ing with immigrant populations

from Egyptian skulls of five periods

c4000BC 131 138 89 49 c4000BC 125 131 92 48 c4000BC 131 132 99 50 c4000BC 119 132 96 44 c4000BC 136 143 100 54 c4000BC 138 137 89 56 c4000BC 139 130 108 48 c4000BC 125 136 93 48 c4000BC 131 134 102 51 c4000BC 134 134 99 51 c4000BC 129 138 95 50 c4000BC 134 121 95 53 c4000BC 126 129 109 51 c4000BC 132 136 100 50 c4000BC 141 140 100 51 c4000BC 131 134 97 54 c4000BC 135 137 103 50 c4000BC 132 133 93 53 c4000BC 139 136 96 50 c4000BC 132 131 101 49 c4000BC 126 133 102 51 c4000BC 135 135 103 47 c4000BC 134 124 93 53

. . . .

Trang 4

5.2 Analysis of Variance

For each of the data sets described in the previous section, the question of interest involves assessing whether certain populations differ in mean value for, inTables 5.1and5.2,a single variable, and in Table 5.3,for a set of four

variables In the first two cases we shall use analysis of variance (ANOVA) and in the last multivariate analysis of variance (MANOVA) method for the analysis of this data Both Tables 5.1 and 5.2 are examples of factorial designs,

with the factors in the first data set being amount of protein with two levels, and source of protein also with two levels In the second, the factors are the genotype of the mother and the genotype of the litter, both with four levels The analysis of each data set can be based on the same model (see below) but

the two data sets differ in that the first is balanced, i.e., there are the same number of observations in each cell, whereas the second is unbalanced having

different numbers of observations in the 16 cells of the design This distinction leads to complications in the analysis of the unbalanced design that we will come to in the next section But the model used in the analysis of each is

yijk= µ + γi+ βj+ (γβ)ij+ εijk where yijk represents the kth measurement made in cell (i, j) of the factorial design, µ is the overall mean, γi is the main effect of the first factor, βj is the main effect of the second factor, (γβ)ij is the interaction effect of the two factors and εijk is the residual or error term assumed to have a normal distribution with mean zero and variance σ2 In R, the model is specified by

a model formula The two-way layout with interactions specified above reads

y ~ a + b + a:b

where the variable a is the first and the variable b is the second factor The

interaction term (γβ)ij is denoted by a:b An equivalent model formula is

y ~ a * b

Note that the mean µ is implicitly defined in the formula shown above In case

µ = 0, one needs to remove the intercept term from the formula explicitly,

i.e.,

y ~ a + b + a:b - 1

For a more detailed description of model formulae we refer to R Development Core Team (2009a) and help("lm")

The model as specified above is overparameterised, i.e., there are infinitely many solutions to the corresponding estimation equations, and so the param-eters have to be constrained in some way, commonly by requiring them to sum to zero – see Everitt (2001) for a full discussion The analysis of the rat weight gain data below explains some of these points in more detail (see also Chapter 6)

The model given above leads to a partition of the variation in the observa-tions into parts due to main effects and interaction plus an error term that enables a series of F -tests to be calculated that can be used to test hypotheses about the main effects and the interaction These calculations are generally

Trang 5

set out in the familiar analysis of variance table The assumptions made in

deriving the F -tests are:

• The observations are independent of each other,

• The observations in each cell arise from a population having a normal dis-tribution, and

• The observations in each cell are from populations having the same vari-ance

The multivariate analysis of variance, or MANOVA, is an extension of the univariate analysis of variance to the situation where a set of variables are measured on each individual or object observed For the data in Table 5.3 there is a single factor, epoch, and four measurements taken on each skull; so

we have a one-way MANOVA design The linear model used in this case is

yijh= µh+ γjh+ εijh where µh is the overall mean for variable h, γjh is the effect of the jth level

of the single factor on the hth variable, and εijhis a random error term The vector ε⊤

ij = (εij1, εij2, , εijq) where q is the number of response variables (four in the skull example) is assumed to have a multivariate normal distri-bution with null mean vector and covariance matrix, Σ, assumed to be the same in each level of the grouping factor The hypothesis of interest is that the population mean vectors for the different levels of the grouping factor are the same

In the multivariate situation, when there are more than two levels of the grouping factor, no single test statistic can be derived which is always the most

powerful, for all types of departures from the null hypothesis of the equality

of mean vector A number of different test statistics are available which may give different results when applied to the same data set, although the final conclusion is often the same The principal test statistics for the multivariate

analysis of variance are Hotelling-Lawley trace, Wilks’ ratio of determinants, Roy’s greatest root, and the Pillai trace Details are given in Morrison (2005).

5.3 Analysis Using R

5.3.1 Weight Gain in Rats

Before applying analysis of variance to the data inTable 5.1we should try to summarise the main features of the data by calculating means and standard deviations and by producing some hopefully informative graphs The data is

available in the data.frame weightgain The following R code produces the

required summary statistics

R> data("weightgain", package = "HSAUR2")

R> tapply(weightgain$weightgain,

+ list(weightgain$source, weightgain$type), mean)

Trang 6

84 ANALYSIS OF VARIANCE R> plot.design(weightgain)

Factors

Beef

Cereal

High

Low

Figure 5.1 Plot of mean weight gain for each level of the two factors

R> tapply(weightgain$weightgain,

+ list(weightgain$source, weightgain$type), sd)

Cereal 15.02184 15.70881

The cell variances are relatively similar and there is no apparent relationship between cell mean and cell variance so the homogeneity assumption of the analysis of variance looks like it is reasonable for these data The plot of cell means in Figure 5.1 suggests that there is a considerable difference in weight gain for the amount of protein factor with the gain for the high-protein diet

Trang 7

being far more than for the low-protein diet A smaller difference is seen for the source factor with beef leading to a higher gain than cereal

To apply analysis of variance to the data we can use the aov function in R and then the summary method to give us the usual analysis of variance table

The model formula specifies a two-way layout with interaction terms, where

the first factor is source, and the second factor is type

R> wg_aov <- aov(weightgain ~ source * type, data = weightgain) R> summary(wg_aov)

Figure 5.2 R output of the ANOVA fit for the weightgain data

The resulting analysis of variance table in Figure 5.2 shows that the main effect of type is highly significant confirming what was seen in Figure 5.1 The main effect of source is not significant But interpretation of both these main effects is complicated by the type × source interaction which approaches significance at the 5% level To try to understand this interaction effect it will

be useful to plot the mean weight gain for low- and high-protein diets for each level of source of protein, beef and cereal The required R code is given with Figure 5.3 From the resulting plot we see that for low-protein diets, the use

of cereal as the source of the protein leads to a greater weight gain than using beef For high-protein diets the reverse is the case with the beef/high diet leading to the highest weight gain

The estimates of the intercept and the main and interaction effects can be extracted from the model fit by

R> coef(wg_aov)

sourceCereal:typeLow

18.8

Note that the model was fitted with the restrictions γ1= 0 (corresponding to Beef) and β1= 0 (corresponding to High) because treatment contrasts were used as default as can be seen from

R> options("contrasts")

$contrasts

Thus, the coefficient for source of −14.1 can be interpreted as an estimate of the difference γ2− γ1 Alternatively, we can use the restrictionP

iγi= 0 by

Trang 8

86 ANALYSIS OF VARIANCE R> interaction.plot(weightgain$type, weightgain$source,

weightgain$type

weightgain$source

Beef Cereal

Figure 5.3 Interaction plot of type and source

R> coef(aov(weightgain ~ source + type + source:type,

+ data = weightgain, contrasts = list(source = contr.sum)))

source1:typeLow

-9.40

5.3.2 Foster Feeding of Rats of Different Genotype

As in the previous subsection we will begin the analysis of the foster feeding data in Table 5.2with a plot of the mean litter weight for the different

Trang 9

R> plot.design(foster)

Factors

A B

I J

A B

I

J

Figure 5.4 Plot of mean litter weight for each level of the two factors for the

foster data

types of mother and litter (see Figure 5.4) The data are in the data.frame

foster

R> data("foster", package = "HSAUR2")

Figure 5.4 indicates that differences in litter weight for the four levels of mother’s genotype are substantial; the corresponding differences for the geno-type of the litter are much smaller

As in the previous example we can now apply analysis of variance using the aovfunction, but there is a complication caused by the unbalanced nature

of the data Here where there are unequal numbers of observations in the 16 cells of the two-way layout, it is no longer possible to partition the variation

in the data into non-overlapping or orthogonal sums of squares representing

main effects and interactions In an unbalanced two-way layout with factors

Trang 10

A and B there is a proportion of the variance of the response variable that can be attributed to either A or B The consequence is that A and B together explain less of the variation of the dependent variable than the sum of which each explains alone The result is that the sum of squares corresponding to

a factor depends on which other terms are currently in the model for the observations, so the sums of squares depend on the order in which the factors are considered and represent a comparison of models For example, for the order a, b, a × b, the sums of squares are such that

• SSa: compares the model containing only the a main effect with one con-taining only the overall mean

• SSb|a: compares the model including both main effects, but no interaction, with one including only the main effect of a

• SSab|a, b: compares the model including an interaction and main effects with one including only main effects

The use of these sums of squares (sometimes known as Type I sums of squares) in a series of tables in which the effects are considered in different

orders provides the most appropriate approach to the analysis of unbalanced designs

We can derive the two analyses of variance tables for the foster feeding example by applying the R code

R> summary(aov(weight ~ litgen * motgen, data = foster))

to give

and then the code

R> summary(aov(weight ~ motgen * litgen, data = foster))

to give

There are (small) differences in the sum of squares for the two main effects and, consequently, in the associated F -tests and p-values This would not be true if in the previous example in Subsection 5.3.1 we had used the code R> summary(aov(weightgain ~ type * source, data = weightgain)) instead of the code which produced Figure 5.2(readers should confirm that this is the case)

Although for the foster feeding data the differences in the two analyses of variance with different orders of main effects are very small, this may not

Trang 11

always be the case and care is needed in dealing with unbalanced designs For

a more complete discussion see Nelder (1977) and Aitkin (1978)

Both ANOVA tables indicate that the main effect of mother’s genotype is highly significant and that genotype B leads to the greatest litter weight and genotype J to the smallest litter weight

We can investigate the effect of genotype B on litter weight in more detail by

the use of multiple comparison procedures (see Everitt, 1996, and Chapter 14).

Such procedures allow a comparison of all pairs of levels of a factor whilst maintaining the nominal significance level at its specified value and producing adjusted confidence intervals for mean differences One such procedure is called

Tukey honest significant differences suggested by Tukey (1953); see Hochberg and Tamhane (1987) also Here, we are interested in simultaneous confidence intervals for the weight differences between all four genotypes of the mother First, an ANOVA model is fitted

R> foster_aov <- aov(weight ~ litgen * motgen, data = foster) which serves as the basis of the multiple comparisons, here with all pair-wise differences by

R> foster_hsd <- TukeyHSD(foster_aov, "motgen")

R> foster_hsd

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = weight ~ litgen * motgen, data = foster)

$motgen

J-B -9.896537 -17.197624 -2.5954489 0.0040509

A convenient plot method exists for this object and we can get a graphical representation of the multiple confidence intervals as shown inFigure 5.5 It appears that there is only evidence for a difference in the B and J genotypes Note that the particular method implemented in TukeyHSD is applicable only

to balanced and mildly unbalanced designs (which is the case here) Alterna-tive approaches, applicable to unbalanced designs and more general research questions, will be introduced and discussed in Chapter 14

5.3.3 Water Hardness and Mortality

The water hardness and mortality data for 61 large towns in England and Wales (see Table 3.3) was analysed in Chapter 3 and here we will extend the analysis by an assessment of the differences of both hardness and mortality

Định dạng
Số trang	18
Dung lượng	191,23 KB