the slope for group A may be partly or wholly due to a curvature in the regression: there is a suggestion that the mean value of y at high values of x is lower than is predicted bythe li
Trang 111.4 Regression in groups
Frequently data are classified into groups, and within each group a linearregression of y on x may be postulated For example, the regression of forcedexpiratory volume on age may be considered separately for men in differentoccupational groups Possible differences between the regression lines are thenoften of interest
In this section we consider comparisons of the slopes of the regression lines Ifthe slopes clearly differ from one group to another, then so, of course, must themean values of yÐat least for some values of x In Fig 11.4(a), the slopes of theregression lines differ from group to group The lines for groups (i) and (ii) cross.Those for (i) and (iii), and for (ii) and (iii) would also cross if extended sufficientlyfar, but here there is some doubt as to whether the linear regressions wouldremain valid outside the range of observed values of x
If the slopes do not differ, the lines are parallel, as in Fig 11.4(b) and (c), andhere it becomes interesting to ask whether, as in (b), the lines differ in theirheight above the x axis (which depends on the coefficient a in the equation
in slope and position in (a), differ only in position in (b), and coincide in (c) (ÐÐ) (i); (- - -) (ii); (- -) (iii).
Trang 2E y a bx), or whether, as in (c), the lines coincide In practice, the fittedregression lines would rarely have precisely the same slope or position, and thequestion is to what extent differences between the lines can be attributed torandom variation Differences in position between parallel lines are discussed in
§11.5 In this section we concentrate on the question of differences betweenslopes
Suppose that there are k groups, with ni pairs of observations in the ithgroup Denote the mean values of x and y in the ith group by xiand yi, and theregression line, calculated as in §7.2, by
If all the ni are reasonably large, a satisfactory approach is to estimate thevariance of each bi by (7.16) and to ignore the imprecision in these estimates ofvariance Changing the notation of (7.16) somewhat, we shall denote the residualmean square for the ith group by s2
i and the sum of squares of x about xi byP
i x xi2 Note that the parenthesized suffix i attached to the summationsign indicates summation only over the specified group i; that is,
with an estimated variance
Trang 3var b 1=Pwi: 11:17The sampling variation of b is approximately normal.
It is difficult to say how large the nimust be for this `large-sample' approach
to be used with safety There would probably be little risk in adopting it if none
of the nifell below 20
A more exact treatment is available provided that an extra assumption ismadeÐthat the residual variances s2
i are all equal Suppose the common value
is s2 We consider first the situation where k 2, as a comparison of two slopes can
be effected by use of the t distribution For k > 2 an analysis of variance is required
If a common value is assumed for the regression slope in the two groups, itsvalue b may be estimated by
Trang 4Equations (11.21) and (11.22) can easily be seen to be equivalent to (11.16) and(11.17) if, in the calculation of wi in (11.14), the separate estimates of residualvariance s2
i are replaced by the common estimate s2 For tests or the calculation
of confidence limits for b using (11.22), the t distribution on n1 n2 4 DFshould be used Where a common slope is accepted it would be more usual toestimate s2 as the residual mean square about the parallel lines (11.34), whichwould have n1 n2 3 DF
We shall first illustrate the calculations for two groups by amalgamating groups A1
and A2(denoting the pooled group by A) and comparing groups A and B
The sums of squares and products of deviations about the mean, and the separateslopes biare as follows:
A 1 40 439738 236385 265812 00538
B 2 44 619716 189712 206067 00306Total 1059454 426097 471879 ( 00402)The SSq about the regressions are
P
1 y Y12 265812 2363852=439738 138741and
P
2 y Y22 206067 1897122=619716 147991:
Thus,
s2 138741 147991= 40 44 4 03584,and, for the difference between b1and b2using (11.19) and (11.20),
Trang 5Table 11.4 Ages and vital capacities for three groups of workers in the cadmium industry x, age last birthday (years); y, vital capacity (litres).
Group A 1 ,
exposed > 10 years
Group A 2 , exposed < 10 years
Group B, not exposed
Trang 6B Separate slope Parallel lines Separate slope
}
Fig 11.5 Scatter diagram showing age and vital capacity of 84 men working in the cadmium industry, divided into three groups (Table 11.4) (d) Group A 1 ; ( ) Group A 2 ; (s) Group B.
the slope for group A may be partly or wholly due to a curvature in the regression: there is
a suggestion that the mean value of y at high values of x is lower than is predicted bythe linear regressions (see p 330) Alternatively it may be that a linear regression isappropriate for each group, but that for group A the vital capacity declines more rapidlywith age than for group B
More than two groups
With any number of groups the pooled slope b is given by a generalization of(11.21):
Trang 7Again, it can be shown that the sums of squares of these components can beadded in the same way This means that
Within-Groups SSq Residual SSq about separate lines
SSq due to differences between the bi and b
SSq due to fitting common slope b: 11:26The middle term on the right is the one that particularly concerns us now It can
be obtained by noting that the SSq due to the common slope is
iSxxi : 11:30
It should be noted that (11.30) is equivalent to
P
i Wib2 i
PiWibi2P
iWi
where Wi Sxxi s2=var bi, and that the pooled slope b equalsP
Wibi=PWi, the weighted mean of the bi The SSq due to differences in slope
is thus essentially a weighted sum of squares of the biabout their weighted mean
b, the weights being (as usual) inversely proportional to the sampling variances(see §8.2)
The analysis is summarized in Table 11.5 There is only one DF for thecommon slope, since the SSq is proportional to the square of one linear contrast,
b The k 1 DF for the second line follow because the SSq measures differencesbetween k independent slopes, bi The residual DF follow because there are
ni 2 DF for the ith group andPi ni 2 n 2k The total DF within groupsare, correctly, n k The F test for differences between slopes follows immedi-ately
Trang 8Table 11.5 Analysis of variance for differences between regression slopes.
We now test the significance between the three slopes The sums of squares and products
of deviations about the mean, and the separate slopes, are as follows:
Group i ni Sxxi Sxyi Syyi bi
1897122619716
3735742939212 24995:The analysis of variance may now be completed
Trang 9SSq DF MSq VRCommon slope 148590 1 148590 4209 P < 0001Between slopes 24995 2 12498 354 P 0034Separate residuals 275351 78 03530
Within groups 448936 81
The differences between slopes are more significant than in the two-group analysis Theestimates of the separate slopes, with their standard errors calculated in terms of theResidual MSq on 78 DF, are
bA1 00851 00197, bA2 00465 00124, bB 00306 00075:The most highly exposed group, A1, provides the steepest slope Figure 11.6 shows theseparate regressions as well as the three parallel lines
The doubt about linearity suggests further that a curvilinear regression might be moresuitable; however, analysis with a quadratic regression line (see §11.1) shows the non-linearity to be quite non-significant
The analysis of variance test can, of course, be applied even for k 2 Theresults will be entirely equivalent to the t test described at the beginning ofthis section, the value of F being, as usual, the square of the correspondingvalue of t
Trang 1011.5 Analysis of covariance
If, after an analysis of the type described in the last section, there is no strongreason for postulating differences between the slopes of the regression lines inthe various groups, the following questions arise What can be said about therelative position of parallel regression lines? Is there good reason to believethat the true lines differ in position, as in Fig 11.4(b), or could theycoincide, as in Fig 11.4(c)? What sampling error is to be attached to anestimate of the difference in positions of lines for two particular groups? Theset of techniques associated with these questions is called the analysis of covar-iance
Before describing technical details, it may be useful to note some importantdifferences in the purposes of the analysis of covariance and in the circumstances
in which it may be used
1 Main purpose
(a) To correct for bias If it is known that changes in x affect the mean value
of y, and that the groups under comparison differ in their values of x, it willfollow that some of the differences between the values of y can be ascribedpartly to differences between the xs We may want to remove this effect as far
as possible For example, if y is forced expiratory volume (FEV) and x isage, a comparison of mean FEVs for men in different occupational groupsmay be affected by differences in their mean ages A comparison would bedesirable of the mean FEVs at the same age If the regressions are linear andparallel, this means a comparison of the relative position of the regressionlines
(b) To increase precision Even if the groups have very similar values of x,precision in the comparison of values of y can be increased by using theresidual variation of y about regression on x rather than by analysing the ysalone
2 Type of investigation
(a) Uncontrolled study In many situations the observations will be made onunits which fall naturally into the groups in questionÐwith no element ofcontrolled allocation Indeed, it will often be this lack of control which leads
to the bias discussed in 1(a)
(b) Controlled study In a planned experiment, in which experimental unitsare allocated randomly to the different groups, the differences between values
of x in the various groups will be no greater in the long run than would beexpected by sampling theory Of course, there will occasionally be largefortuitous differences in the xs; it may then be just as important to correctfor their effect as it would be in an uncontrolled study In any case, even withvery similar values of x, the extra precision referred to in 1(b) may well beworth acquiring
Trang 11Two groups
If the t test based on (11.20) reveals no significant difference in slopes, twoparallel lines may be fitted with a common slope b given by (11.23) Theequations of the two parallel lines are (as in (11.24)),
Yc1 y1 b x x1and
var d var y1 var y2 x1 x22var b
c in (11.34) is taken about parallel lines (since parallelism is aninitial assumption in the analysis of covariance), and would be obtainedfrom Table 11.5 by pooling the second and third lines of the analysis Theresultant DF would be k 1 n 2k n k 1, which gives the
n1 n2 3 n 3 of (11.34) when k 2
The standard error (SE) of d, the square root of (11.33), may be used in a ttest On the null hypothesis that the regression lines coincide, E d 0, and
Trang 12t d=SE d 11:35has n1 n2 3 DF Confidence limits for the true difference, E(d), are obtained
in the usual way
The relative positions of the lines are conveniently expressed by the tion of a corrected or adjusted mean value of y for each group Suppose thatgroup 1 had had a mean value of x equal to some arbitrary value x0rather than
calcula-x1 Often x0is given the value of the mean of x over all the groups From (11.24)
we should estimate that the mean y would have been
y0
with a similar expression for y0
2 The difference between y0
1and y0
2can easily beseen to be equal to d given by (11.32) If the regression lines coincide, y0
1and y0 2
will be equal If the line for group 1 lies above that for group 2, at a fixed value of
Trang 13which varies with group not only through ni but also because of the term
x0 xi2, which increases as xi gets further from x0 in either direction Thearbitrary choice of x0is avoided by concentrating on the difference between thecorrected means, which is equal to d for all values of x0
Example 11.4
The data of Example 11.3 will now be used in an analysis of covariance We shall firstillustrate the calculations for two groups, the pooled exposed group A and the unexposedgroup B
Using the sums of squares and products given earlier we have:
t 00835=01334 063 on 81 DF:
The difference d is clearly not significant (P 053)
Corrected values can be calculated, using (11.36), with x0 405, the overall mean.Thus,
Trang 14This analysis is based on the assumption that the regression lines are parallel forgroups A and B but, as noted in Example 11.3, there is at least suggestive evidence that theslopes differ Suppose we abandon the assumption of parallelism and fit lines withseparate slopes, bA and bB The most pronounced difference between predicted valuesoccurs at high ages The difference at, say, age 60 is
d0 yA yB bA 60 xA bB 60 xB
05306,and
var d0 var yA var yB 60 xA2var bA 60 xB2var bB
More than two groups
Parallel lines may be fitted to several groups, as indicated in §11.4 The pooledslope, b, is given by (11.23) and the line for the ith group is given by (11.24).The relative positions of the lines are again conveniently expressed in terms ofcorrected mean values (11.36) On the null hypothesis that the true regressionlines for the different groups coincide, the corrected means will differ purely bysampling error For two groups the test of this null hypothesis is the t test (11.35).With k >2 groups this test becomes an F test with k 1 DF in the numerator.The construction of this test is complicated by the fact that the sampling errors ofthe corrected means y0
i are not independent, since the random variable b entersinto each expression (11.36) The test is produced most conveniently using multi-ple regression, which is discussed in the next section, and we return to this topic
in §11.7 Other aspects of the analysis are similar to the analysis with two groups.The data of Example 11.4 are analysed in three groups as Example 11.7 (p 353)using multiple regression
In considering the possible use of the analysis of covariance with a particularset of data, special care should be given to the identification of the dependent andindependent variables If, in the analysis of covariance of y on x, there aresignificant differences between groups, it does not follow that the same will betrue of the regression of x on y In many cases this is an academic point because
Trang 15the investigator is clearly interested in differences between groups in the meanvalue of y, after correction for x, and not in the reverse problem Occasionally,when x and y have a symmetric type of relation to each other, as in §11.2, both ofthe analyses of covariance (of y on x, and of x on y) will be misleading Linesrepresenting the general trend of a functional relationship may well be coincident(as in Fig 11.8), and yet both sets of regression lines are non-coincident Here thedifficulties of §11.2 apply, and lines drawn by eye, or the other methods discussed
on p 319, may provide the most satisfactory description of the data For a fullerdiscussion, see Ehrenberg (1968)
The analysis of covariance described in this section is appropriate for dataforming a one-way classification into groups The method is essentially a com-bination of linear regression (§7.2) and a one-way analysis of variance (§8.1).Similar problems arise in the analysis of more complex data For example, in theanalysis of a variable y in a Latin square one may wish to adjust the apparenttreatment effects to correct for variation in another variable x In particular, xmay be some pretreatment characteristic known to be associated with y; thecovariance adjustment would then be expected to increase the precision withwhich the treatments can be compared The general procedure is an extension ofthat considered above and the details will not be given
The methods presented in this and the previous section were developedbefore the use of computers became widespread, and were designed to simplify
y
x
Fig 11.8 Scatter diagram showing a common trend for two groups of observations but with coincident regression lines (Ð Ð) Regression of y on x; (± ± ±) regression of x on y.
Trang 16non-the calculations Nowadays this is not a problem and non-the fitting of regressions ingroups, using either separate non-parallel lines or parallel lines, as in the analysis
of covariance, may be accomplished using multiple regression, as discussed in thenext two sections
11.6 Multiple regression
In the earlier discussions of regression in Chapter 7 and the earlier sections of thischapter, we have been concerned with the relationship between the mean value ofone variable and the value of another variable, concentrating particularly on thesituation in which this relationship can be represented by a straight line
It is often useful to express the mean value of one variable in terms not of oneother variable but of several others Some examples will illustrate some slightlydifferent purposes of this approach
1 The primary purpose may be to study the effect on variable y of changes in aparticular single variable x1, but it may be recognized that y may be affected
by several other variables x2, x3, etc The effect on y of simultaneous changes
in x1, x2, x3, etc., must therefore be studied In the analysis of data onrespiratory function of workers in a particular industry, such as those con-sidered in Examples 11.3 and 11.4, the effect of duration of exposure to ahazard may be of primary interest However, respiratory function is affected
by age, and age is related to duration of exposure The simultaneous effect onrespiratory function of age and exposure must therefore be studied so that theeffect of exposure on workers of a fixed age may be estimated
2 One may wish to derive insight into some causative mechanism by ing which of a set of variables x1, x2, , has apparently most influence on adependent variable y For example, the stillbirth rate varies considerably indifferent towns in Britain By relating the stillbirth rate simultaneously to alarge number of variables describing the townsÐeconomic, social, meterolo-gical or demographic variables, for instanceÐit may be possible to findwhich factors exert particular influence on the stillbirth rate (see Sutherland,1946) Another example is in the study of variations in the cost per patient indifferent hospitals This presumably depends markedly on the `patient mix'Ðthe proportions of different types of patient admittedÐas well as on otherfactors A study of the simultaneous effects of many such variables mayexplain much of the variation in hospital costs and, by drawing attention toparticular hospitals whose high or low costs are out of line with the predic-tion, may suggest new factors of importance
discover-3 To predict the value of the dependent variable in future individuals Aftertreatment of patients with advanced breast cancer by ablative procedures,prognosis is very uncertain If future progress can be shown to depend onseveral variables available at the time of the operation, it may be possible to
Trang 17predict which patients have a poor prognosis and to consider alternativemethods of treatment for them (Armitage et al., 1969).
The appropriate technique is called multiple regression In general, theapproach is to express the mean value of the dependent variable in terms ofthe values of a set of other variables, usually called independent variables Thenomenclature is confusing, since some of the latter variables may be eitherclosely related to each other logically (e.g one might be age and another thesquare of the age) or highly correlated (e.g height and arm length) It is prefer-able to use the terms predictor or explanatory variables, or covariates, and weshall usually follow this practice
The data to be analysed consist of observations on a set of n individuals, eachindividual providing a value of the dependent variable, y, and a value of each ofthe predictor variables, x1, x2, , xp The number of predictor variables, p,should preferably be considerably less than the number of observations, n, andthe same p predictor variables must be available for each individual in any oneanalysis
Suppose that, for particular values of x1, x2, , xp, an observed value of y isspecified by the linear model:
y b0 b1x1 b2x2 bpxp e, 11:38where e is an error term The various values of e for different individuals aresupposed to be independently normally distributed with zero mean and variance
s2 The constants b1, b2, , bp are called partial regression coefficients; b0 issometimes called the intercept The coefficient b1 is the amount by which ychanges on the average when x1changes by one unit and all the other xis remainconstant In general, b1will be different from the ordinary regression coefficient
of y on x1because the latter represents the effect of changes in x1on the averagevalues of y with no attempt to keep the other variables constant
The coefficients b0, b1, b2, , bp are idealized quantities, measurable onlyfrom an infinite number of observations In practice, from n observations, wehave to obtain estimates of the coefficients and thus an estimated regressionequation:
Y b0 b1x1 b2x2 bpxp: 11:39Statistical theory tells us that a satisfactory method of obtaining the estimatedregression equation is to choose the coefficients such that the sum of squares ofresiduals,P y Y2, is minimized, that is, by the method of least squares, whichwas introduced in §7.2 Note that here y is an observed value and Y is the valuepredicted by (11.39) in terms of the predictor variables A consequence of thisapproach is that the regression equation (11.39) is satisfied if all the variables aregiven their mean values Thus,
Trang 18y b0 b1x1 b2x2 bpxp,and consequently b0can be replaced in (11.39) by
y b1x1 b2x2 bpxp
to give the following form to the regression equation:
Y y b1 x1 x1 b2 x2 x2 bp xp xp: 11:40The equivalent result for simple regression was given at (7.4)
We are now left with the problem of finding the partial regression cients, bi An extension of the notation introduced in §11.4 will be used; forinstance,
coeffi-Sxjxj P xj xj2
Px2 j
The method of least squares gives a set of p simultaneous linear equations inthe p unknowns, b1, b2, , bp as follows:
Sx1x1b1 Sx1x2b2 Sx1xpbp Sx1y
Sx2x1b1 Sx2x2b2 Sx2xpbp Sx2y
Sxpx1b1 Sxpx2b2 Sxpxpbp Sxpy: 11:41These are the so-called normal equations; they are the multivariate extension
of the equation for b in §7.2 In general, there is a unique solution The numericalcoefficients of the left side of (11.41) form a matrix which is symmetric aboutthe diagonal running from top left to bottom right, since, for example,
Sx 1 x 2 Sx 2 x 1 These coefficients involve only the xs The terms on the rightalso involve the ys
The complexity of the calculations necessary to solve the normal equationsincreases rapidly with the value of p but, since standard computer programs areavailable for multiple regression analysis, this need not trouble the investigator.Those familiar with matrix algebra will recognize this problem as being soluble interms of the inverse matrix We shall return to the matrix representation ofmultiple regression later in this section but, for now, shall simply note that the
Trang 19equations have a solution which may be obtained by using a computer withstatistical software.
Sampling variation
As in simple regression, the Total SSq of y may be divided into the SSq due toregression and the SSq about regression with (11.2) and (11.3) applying andillustrated by Fig 11.1, although in multiple regression the corresponding figurecould not be drawn, since it would be in (p 1)-dimensional space Thus, again
we have
Total SSq SSq about regression SSq due to regression,
which provides the opportunity for an analysis of variance The subdivision of
DF is as follows:
Total About regression Due to regression
The variance ratio
F MSq due to regressionMSq about regression 11:42
provides a composite test of the null hypothesis that b1 b2 bp 0, i.e.that all the predictor variables are irrelevant
We have so far discussed the significance of the joint relationship of y withthe predictor variables It is usually interesting to study the sampling variation ofeach bjseparately This not only provides information about the precision of thepartial regression coefficients, but also enables each of them to be tested for asignificant departure from zero
The variance and standard error of bj are obtained by the multivariateextension of (7.16) as
Trang 20If a particular bj is not significantly different from zero, it may be thoughtsensible to call it zero (i.e to drop it from the regression equation) to make theequation as simple as possible It is important to realize, though, that if this isdone the remaining bjs would be changed; in general, new values would beobtained by doing a new analysis on the remaining xjs.
R2 must increase as further variables are introduced into a regression andtherefore cannot be used to compare regressions with different numbers ofvariables The value of R2 may be adjusted to take account of the chancecontribution of each variable included by subtracting the value that would beexpected if none of the variables was associated with y The adjusted value, R2
a, iscalculated from
Example 11.5
The data shown in Table 11.6 are taken from a clinical trial to compare two hypotensivedrugs used to lower the blood pressure during operations (Robertson & Armitage, 1959).The dependent variable, y, is the `recovery time' (in minutes) elapsing between the time atwhich the drug was discontinued and the time at which the systolic blood pressure had
Trang 21Table 11.6 Data on the use of a hypotensive drug: x 1 , log (quantity of drug used, mg); x 2 , mean level
of systolic blood pressure during hypotension (mmHg); y, recovery time (min).
x1: log (quantity of drug used, mg);
x2: mean level of systolic blood pressure during hypotension (mmHg)
Group A: `Minor' non-thoracic (n A 20)
Trang 22The 53 patients are divided into three groups according to the type of operation Initially
we shall ignore this classification, the data being analysed as a single group Possibledifferences between groups are considered in §11.7
Table 11.6 shows the values of x1, x2and y for the 53 patients The columns headed Yand y Y will be referred to later Below the data are shown the means and standarddeviations of the three variables
Using standard statistical software for multiple regression, the regression equation is
Y 23011 23639x1 071467x2:The recovery time increases on the average by about 24 min for each increase of 1 in thelog dose (i.e each 10-fold increase in dose), and decreases by 071 min for every increase of
1 mmHg in the mean blood pressure during hypotension
The analysis of variance of y is
Due to regression 2 7832 2 13916 632 (P 0004)
About regression 11 0079 50 2202
Total 13 7912 52
The variance ratio is highly significant and there is thus little doubt that either x1 or
x2 is, or both are, associated with y The squared multiple correlation coefficient, R2,
is 27832/137912 02018; R 02018 045p Of the total sum of squares of y,about 80% (080 1 R2) is still present after prediction of y from x1and x2 The predic-tive value of x1and x2is thus rather low, even though it is highly significant
From the analysis of variance, s2 2202 so that the SD about the regression, s, is148 The standard errors of the partial regression coefficients are
contri-Analysis of variance test for deletion of variables
The t test for a particular regression coefficient, say, bj, tests whether thecorresponding predictor variable, xj, can be dropped from the regression equa-tion without any significant effect on the variation of y
Trang 23Sometimes we may wish to test whether variability is significantly affected bythe deletion of a group of predictor variables For example, in a clinical studythere may be three variables concerned with body size: height (x1), weight (x2)and chest measurement (x3) If all other variables represent quite differentcharacteristics, it may be useful to know whether all three of the size variablescan be dispensed with Suppose that q variables are to be deleted, out of a total of
p If two multiple regressions are done, (a) with all p variables, and (b) with thereduced set of p q variables, the following analysis of variance is obtained:
(iii) Due to deletion of q variables q
(iv) Residual about regression (a) n p 1
The SSq for (i) and (iv) are obtained from regression (a), that for (ii) fromregression (b) and that for (iii) by subtraction: (iii) (i) (ii) The varianceratio from (iii) and (iv) provides the required F test
It is usually very simple to arrange for regressions to be done on severaldifferent subsets of predictor variables, using a multiple regression package on acomputer The `TEST' option within the SAS program PROC REG gives the Ftest for the deletion of the variables
When only one variable, say, the jth, is to be deleted, the same procedurecould, in principle, be followed instead of the t test The analysis of variancewould be as above with q 1 If this were done, it would be found that thevariance ratio test, F, would be equal to t2, thus giving the familiar equivalencebetween an F test on 1 and n p 1 DF, and a t test on n p 1 DF
It will sometimes happen that two or more predictor variables all give significant partial regression coefficients, and yet the deletion of the whole grouphas a significant effect by the F test This often happens when the variableswithin the group are highly correlated; any one of them can be dispensed with,without appreciably affecting the prediction of y; the remaining variables in thegroup act as effective substitutes If the whole group is omitted, though, theremay be no other variables left to do the job With a large number of interrelatedpredictor variables it often becomes quite difficult to sort out the meaning of thevarious partial regression coefficients (see p 358)
Trang 24follow the general multiple regression method, replacing all sums likePxjandP
y byPwxjandPwy (the subscript i, identifying the individual observation,has been dropped here to avoid confusion with j, which identifies a particularexplanatory variable) Similarly, all sums of squares and products are weighted:P
y2is replaced byPwy2,PxjxkbyPwxjxk, and in calculating corrected sums
of squares and products n is replaced byPw (see (8.15) and (8.16))
The standard t tests, F tests and confidence intervals are then valid TheResidual MSq is an estimate of s2, and may, in certain situations, be checkedagainst an independent estimate of s2 For example, in the situation discussed on
p 360, where the observations fall into groups with particular combinations ofvalues of predictor variables, yimay be taken to be the mean of niobservations at
a specified combination of xs The variance of yiis then s2=ni, and the analysismay be carried out by weighted regression, with wi ni The Residual MSq will
be the same as that derived from line (iii) of the analysis on p 360, and may becompared (as indicated in the previous discussion) against the Within-GroupsMSq in line (iv)
Matrix representation
As noted, the normal equations (11.41) may be concisely written as a matrixequation, and the solution may be expressed in terms of an inverse matrix Infact, the whole theory of multiple regression may be expressed in matrix notationand this approach will be indicated briefly A fuller description is given inChapters 4 and 5 of Draper and Smith (1998); see also Healy (2000) for a generaldescription of matrices as applied in statistics Readers unfamiliar with matrixnotation could omit study of the details in this subsection, noting only that thematrix notation is a shorthand method used to represent a multivariate set ofequations With the more complex models that are considered in Chapter 12, theuse of matrix notation is essential
A matrix is a rectangular array of elements The size of the array is expressed
as the number of rows and the number of columns so that an r c matrixconsists of r rows and c columns A matrix with one column is termed a vector
We shall write matrices and vectors in bold type, with vectors in lower case andother matrices in upper case
A theory of matrix algebra has been developed which defines operations thatmay be carried out Two matrices, A and B may be multiplied together providedthat the number of columns of A is the same as the number of rows of B If A is
an r s matrix and B is an s t matrix, then their product C is an r t matrix Amatrix with the same number of rows and columns is termed a square matrix If
A is a square r r matrix then it has an inverse A 1such that the products AA 1
and A 1A both equal the r r identity matrix, I, which has ones on the diagonaland zeros elsewhere The identity matrix has the property that when it is
Trang 25multiplied by any other matrix the product is the same as the original matrix,that is, IB B.
It is convenient to combine the intercept b0 and the regression coefficientsinto a single set so that the (p 1 1 vector b represents the coefficients,
b0, b1, b2, , bp Correspondingly a dummy variable x0 is defined whichtakes the value of 1 for every individual The data for each individual are the xvariables, x0, x1, x2, , xp, and the whole of the data may be written as the
n p 1 matrix, X, with element (i, j 1) being the variable xjfor individual i
If y represents the vector of n observations on y then the multiple regressionequation (11.38) may be written as
inter-The solution of (11.48) is then given by
In matrix notation the SSq about the regression may be written as
y XbT y Xb yTy bTXTy 11:50and dividing this by n p 1 gives the residual mean square, s2, as an estimate
of the variance about the regression The estimated variance±covariance matrix
of b is
It can be shown that (11.51) gives the same variances and covariances for
b1, b2, , bp as (11.43) and (11.44), even though (11.51) involves the inverse of
a p 1 square matrix of uncorrected sums of squares and products, whilst(11.43) and (11.44) involve the inverse of a p square matrix of corrected sums
of squares and products
Using (11.49), the right side of (11.50) may be rewritten as
Trang 26yTy yTHy yT I Hy 11:52where the n n matrix H X XTX 1XT, and I is the (n n) identity matrix.Weighted regression may be expressed in matrix notations as follows Firstdefine a square n n matrix, V, with diagonal elements equal to 1=wi, where wiisthe weight for the ith subject, and all off-diagonal terms zero Then the normalequations (11.48) become
Further generalizations are possible Suppose that the usual tion that the error terms, e, of different subjects are independent does notapply, and that there are correlations between them Then the matrix Vcould represent this situation by non-zero off-diagonal terms corresponding tothe covariances between subjects The matrix V is termed the dispersionmatrix Then (11.53) and (11.54) still apply and the method is termed generalizedleast squares This will not be pursued here but will be taken up again in thecontext of multilevel models (§12.5) and in the analysis of longitudinal data(§12.6)
assump-11.7 Multiple regression in groups
When the observations fall into k groups by a one-way classification, questions
of the types discussed in §§11.4 and 11.5 may arise Can equations with the samepartial regression coefficients (bis), although perhaps different intercepts (b0s), befitted to different groups, or must each group have its own set of bis? (This is ageneralization of the comparison of slopes in §11.4.) If the same bis are appro-priate for all groups, can the same b0 be used (thus leading to one equation forthe whole data), or must each group have its own b0? (This is a generalization ofthe analysis of covariance, §11.5.)
The methods of approach are rather straightforward developments of thoseused previously and will be indicated only briefly Suppose there are, in all, nobservations falling into k groups, with p predictor variables observed through-out To test whether the same bis are appropriate for all groups, an analysis ofvariance analogous to Table 11.5 may be derived, with the following subdivision
of DF:
Trang 27DF(i) Due to regression with common slopes (bis) p
(ii) Differences between slopes (bis) p k 1
(iii) Residual about separate regressions n p 1k
The DF agree with those of Table 11.5 when p 1 The SSq within groups isexactly the same as in Table 11.5 The SSq for (iii) is obtained by fitting aseparate regression equation to each of the k groups and adding the resultingResidual SSq The residual for the ith group has ni p 1 DF, and these add
to n p 1k The SSq for (i) is obtained by a simple multiple regressioncalculation using the pooled sums of squares and products within groupsthroughout; this is the appropriate generalization of the first line of Table 11.5.The SSq for (ii) is obtained by subtraction The DF, obtained also by subtrac-tion, are plausible as this SSq represents differences between k values of b1,between k values of b2, and so on; there are p predictor variables, each corres-ponding to k 1 DF
It may be more useful to have a rather more specific comparison of someregression coefficients than is provided by the composite test described above.For a particular coefficient, bj, for instance, the k separate multiple regressionswill provide k values, each with its standard error Straightforward comparisons
of these, e.g using (11.15) with weights equal to the reciprocals of the variances
of the separate values, will often suffice
The analysis of covariance assumes common values for the bis and tests fordifferences between the b0s The analysis of covariance has the following DF:
DF
(v) Residual about within-groups regression n k p
The corrected Total SSq (vi) is obtained from a single multiple regressioncalculation for the whole data; the DF are n p 1 as usual That for (v) isobtained as the residual for the regression calculation using within-groups sums
of squares and products; it is in fact the residual corresponding to the regressionterm (i) in the previous table, and is the sum of the SSq for (ii) and (iii) in thattable That for (iv) is obtained by subtraction
As indicated in §11.5, multiple regression techniques offer a convenientapproach to the analysis of covariance, enabling the whole analysis to be done
by one application of multiple regression Consider first the case of two groups.Let us introduce a new variable, z, which is given the value 1 for all observations
Trang 28in group 1 and 0 for all observations in group 2 As a model for the data as awhole, suppose that
E y b0 dz b1x1 b2x2 bpxp: 11:55Because of the definition of z, (11.55) is equivalent to assuming that
E y b0 d b1x1 b2x2 bpxp for group 1
by a single multiple regression of y on z, x1, x2, , xp The new variable z iscalled a dummy, or indicator, variable The coefficient d is the partial regressioncoefficient of y on z, and is estimated in the usual way by the multipleregression analysis, giving an estimate d, say The variance of d is estimated asusual from (11.43) or (11.51), and the appropriate tests and confidence limitsfollow by use of the t distribution Note that the Residual MSq has n p 2 DF(since the introduction of z increases the number of predictor variables from p to
p 1), and that this agrees with (v) on p 348 (putting k 2)
When k > 2, the procedure described above is generalized by the tion of k 1 dummy variables These can be defined in many equivalent ways.One convenient method is as follows The table shows the values taken by each
introduc-of the dummy variables for all observations in each group
Y b0 d1z1 dk 1 zk 1 b1x1 bpxp: 11:58
Trang 29The regression coefficients d1, d2, , dk 1represent contrasts between the meanvalues of y for groups 1, 2, , k 1 and that for group k, after correction fordifferences in the xs The overall significance test for the null hypothesis that
d1 d2 dk 1 0 corresponds to the F test on k 1 and n k p DF((iv) and (v) on p 348) The equivalent procedure here is to test the compositesignificance of d1, d2, , dk 1by deleting the dummy variables from the analysis(p 344) This leads to exactly the same F test
Corrected means analogous to (11.36) are obtained as
y0
1 y0
2 y1 y2 P
j bj x1j x2j: 11:61Its estimated variance is obtained from the variances and covariances of the ds.For a contrast between group k and one of the other groups, i, say, the appro-priate diand its standard error are given directly by the regression analysis; diis
in fact the same as the difference between corrected means y0 y0
k For a contrastbetween two groups other than group k, say, groups 1 and 2, we use the factthat
Example 11.6
The data of Table 11.6 have so far been analysed as a single group However, Table 11.6shows a grouping of the patients according to the type of operation, and we should clearlyenquire whether a single multiple regression equation is appropriate for all three groups.Indeed, this question should be raised at an early stage of the analysis, before too mucheffort is expended on examining the overall regression
An exploratory analysis can be carried out by examining the residuals, y Y, in Table11.6 The mean values of the residuals in the three groups are:
A: 252 B: 046 C: 460
Trang 30These values suggest the possibility that the mean recovery time, at given values of x1
and x2, increases from group A, through group B, to group C However, the residuals arevery variable, and it is not immediately clear whether these differences are significant Afurther question is whether the regression coefficients are constant from group to group
As a first step in a more formal analysis, we calculate separate multiple regressions inthe three groups The results are summarized as follows:
Group A Group B Group C
and x2 Predicted values from the three multiple regressions, for values of x1and x2equal
to the overall means, 19925 and 66340, respectively, are:
A: 1903 B: 2244 C: 2624,closer together than the unadjusted means shown in the table above
There are substantial differences between the estimated regression coefficients acrossthe three groups, but the standard errors are large Using the methods of §11.4, we can testfor homogeneity of each coefficient separately, the reciprocals of the variances being used
as weights This gives, as approximate x2statistics on 2 DF, 186 for b1and 462 for b2,neither of which is significant at the usual levels We cannot add these x2
2statistics, since
b1and b2are not independent For a more complete analysis of the whole data set we need
a further analysis, using a model like (11.57) We shall need two dummy variables, z1
(taking the value 1 in group A and 0 elsewhere) and z2(1 in group B and 0 elsewhere).Combining the Residual SSqs from the initial overall regression, the new regression withtwo dummy variables, and the regressions within separate groups (pooling the threeresiduals), we have:
Residual from overall regression 50 11 0079
Between intercepts 2 4213 2107 096Residual from (11.57) 48 10 5866 2206
Between slopes 4 1 1717 2929 137Residual from separate regressions 44 9 4149 2140
Trang 31In the table on p.351, the SSq between intercepts and between slopes are obtained bysubtraction Neither term is at all significant Note that the test for differences betweenslopes is approximately equivalent to x2
4 4 137 548, rather less than the sum ofthe two x2
2statistics given earlier, confirming the non-independence of the two previoustests Note also that even if the entire SSq between intercepts were ascribed to a trendacross the groups, with 1 DF, the variance ratio (VR) would be only 42133/2206 191,far from significant The point can be made in another way, by comparing the intercepts,
in the model of (11.57), for the two extreme groups, A and C These groupsdiffer in the values taken by z1(1 and 0, respectively) In the analysis of the model (11.57),the estimate d1, of the regression coefficient on z1is 739 539, again not significant
Of course, the selection of the two groups with the most extreme contrast biases theanalysis in favour of finding a significant difference, but, as we see, even this contrast isnot significantly large
In summary, it appears that the overall multiple regression fitted initially to the data ofTable 11.6 can be taken to apply to all three groups of patients
The between-slopes SSq was obtained as the difference between the residualfitting (11.57) and the sum of the three separate residuals fitting regressions foreach group separately It is often more convenient to do the computations asanalyses of the total data set, as follows
Consider first two groups with a dummy variable for group 1, z, defined asearlier Now define new variables wj, for j 1 to p, by wj zxj Since z is zero ingroup 2, then all the wjare also zero in group 2; z 1 in group 1 and therefore
wj xj in group 2 Consider the following model
E y b0 dz b1x1 bpxp g1w1 gpwp: 11:64
Because of the definitions of z and the wj, (11.64) is equivalent to
E y b0 d b1 g1x1 bp gpxp for group 1
b0 b1x1 bpxp for group 2,
11:65which gives lines of different slopes and intercepts for the two groups Thecoefficient gj is the difference between the slopes on xj in the two groups.The overall significance test for the null hypothesis that g1 g2 gp 0
is tested by deleting the wj variables from the regression to give the F test on pand n 2p 2 DF
When k > 2, the above procedure is generalized by deriving p k 1 ables, wij zixj, and an overall test of parallel regressions in all the groups isgiven by the composite test of the regression coefficients for all the wij This is an
vari-F test with p k 1 and n p 1k Dvari-F
In the above procedure the order of introducing (or deleting) the variables iscrucial The wijshould only be included in regressions in which the correspond-
Trang 32ing ziand xjvariables are also included Three regressions are fitted in sequence
on the following variables:
(i) x1, , xp,(ii) x1, , xp, z1, , zk,(iii) x1, , xp, z1, , zk, w1 1, , wkp:These three regressions in turn correspond to Fig 11.4(c), 11.4(b) and 11.4(a),respectively
Example 11.7
The three-group analysis of Examples 11.3 and 11.4 may be done by introducing twodummy variables: z1, taking the value 1 in Group A1and 0 otherwise; and z2, taking thevalue 1 in group A2and 0 otherwise As before, y represents vital capacity and x age Twonew variables are derived: w1 xz1and w2 xz2 The multiple regressions of y on x, of y
on x, z1and of z2, and of y on x, z1, z2, w1and w2give the following analysis of variancetable
x, z1and z2
3 176063(5) Due to introduction of
w1and w2( 6 4) 2 24994 12497 354
(6) Due to regression on
x, z1, z2w1and w2
5 201057(7) Residual about
Trang 33The coefficients d1and d2are estimates of the corrected differences between Groups A1
and A2, respectively, and Group B; neither is significant The coefficient b1, representingthe age effect, is highly significant
For the model with separate slopes the coefficients are (using c to represent estimates
00465, and 00306, respectively, agreeing with the values found earlier from fittingthe regressions for each group separately
11.8 Multiple regression in the analysis of
non-orthogonal data
The analysis of variance was used in Chapter 9 to study the separate effects ofvarious factors, for data classified in designs exhibiting some degree of balance.These so-called orthogonal designs enable sums of squares representing differentsources of variation to be presented simultaneously in the same analysis.Many sets of data follow too unbalanced a design for any of the standardforms of analysis to be appropriate The trouble here is that the various linearcontrasts which together represent the sources of variation in which we areinterested may not be orthogonal in the sense of §8.4, and the correspondingsums of squares do not add up to the Total SSq We referred briefly to theanalysis of unbalanced block designs in §9.2, and also to some devices that can beused if a design is balanced except for a small number of missing readings Non-orthogonality in a design often causes considerable complication in the analysis.Multiple regression, introduced in §11.6, is a powerful method of studying thesimultaneous effect on a random variable of various predictor variables, and nospecial conditions of balance are imposed on their values The effect of anyvariable, or any group of variables, can, as we have seen, be exhibited by ananalysis of variance We might, therefore, expect that the methods of multipleregression would be useful for the analysis of data classified in an unbalanceddesign In §11.7 the analysis of covariance was considered as a particular instance
of multiple regression, the one-way classification into groups being represented
by a system of dummy variables This approach can be adopted for any factor
Trang 34with two or more levels A significance test for any factor is obtained byperforming two multiple regressions: first, including the dummy variables repre-senting the factor and, secondly, without those variables For a full analysis ofany set of data, many multiple regressions may be needed.
Many statistical packages have programs for performing this sort of analysis.The name general linear model is often used, and should be distinguished fromthe more complex generalized linear model to be described in Chapters 12 and 14
In interpreting the output of such an analysis, care must be taken to ensure thatthe test for the effect of a particular factor is not distorted by the presence orabsence of correlated factors In particular, for data with a factorial structure,the warning given on p 255, against the testing of main effects in the presence ofinteractions involving the relevant factors, should be heeded
Freund et al (1986) describe four different ways in which sums of squares can
be calculated in general linear models, and the four different types are larly relevant to output from PROC GLM of SAS Briefly, Type I is appropriatewhen factors are introduced in a predetermined order; Type II shows the con-tribution of each factor in the presence of all others except interactions involvingthat factor; in Types III and IV, any interactions defined in the model areretained even for the testing of main effects
particu-In the procedure recommended on p 353 to distinguish betweendifferent forms of multiple regression within groups, we noted that the order ofintroducing the variables is critical This procedure corresponds to Type I SSqsince the variables are introduced in a predetermined order The Type I SSq fordifferences between groups, in the analysis of covariance model with no interac-tions between the effects of the covariates and the groups, is also a Type II SSq,since the group differences are then tested in a model that excludes interactions.Type I SSq are also useful in split-unit designs, where it is necessary to introduceterms for the treatments allocated to main unit, before terms for the subunittreatments and interactions (§9.6)
The rationale of Type III and IV SSq is to try and reproduce the same type ofanalysis for main effects and interactions that would have occurred if the designhad been balanced In a balanced design, as we saw in §9.3, the SSq for maineffects and interactions are orthogonal and this means that the value of the SSqfor a main effect is the same whether an interaction involving that main effect isincluded or not Type III SSq arise when contrasts are created corresponding tomain effects and interactions, and these contrasts are defined to be orthogonal.Another feature of an orthogonal design is that if there is an interaction involv-ing a main effect, then the estimate of that main effect is effectively the averagevalue of the effect of that factor over the levels of the other factors with which itinteracts Type IV SSq are defined to reproduce this property Type III and IVSSq will be identical unless there are some combinations of factors with noobservations
Trang 35If there are interactions then the Type III and IV SSq for main effectscorrespond to averages of heterogeneous effects The use of Type II SSq corres-ponds to a more commendable strategy of testing interactions in the presence ofmain effects, and either testing main effects without interactions (if the latter cansafely be ignored) or not testing main effects at all (if interactions are regarded asbeing present), since it would rarely be useful to correct a main effect for aninteraction Type I SSq are also useful since they allow an effect to be correctedfor any other effects as required in the context of the nature of the variables andthe purpose of the analysis; in some cases the extraction of all the required Type ISSq may require several analyses, with the variables introduced in differentorders To summarize, in our view, appropriately chosen Type I SSq and Type
II SSq are useful, but Types III and IV are unnecessary
11.9 Checking the model
In this section we consider methods that can be used to check that a fittedregression model is valid in a statistical sense, that is, that the values of theregression coefficients, their standard errors, and inferences made from teststatistics may be accepted
Selecting the best regression
First we consider how the `best' regression model may be identified For a fullerdiscussion of this topic see Berry and Simpson (1998), Draper and Smith (1998,Chapter 15) or Kleinbaum et al (1998, Chapter 16)
The general form of the multiple regression model (11.38), where thevariables are labelled x1, x2, , xp, includes all the explanatory variables on
an apparently equal basis In practice it would rarely, if ever, be the case that allvariables had equal status One or more of the explanatory variables may be ofparticular importance because they relate to the main objective of the analysis.Such a variable or variables should always be retained in the regression model,since estimation of their regression coefficients is of interest irrespective ofwhether or not they are statistically significant Other variables may be includedonly because they may influence the response variable and, if so, it is important
to correct for their effects but otherwise they may be excluded Other variablesmight be created to represent an interactive effect between two other explanatoryvariablesÐfor example, the wivariables in (11.64)Ðand these variables must beassessed before the main effects of the variables contributing to the interactionmay be considered All of these considerations imply that the best regression isnot an unambiguous concept that could be obtained by a competent analystwithout consideration of what the variables represent Rather, determination ofthe best regression requires considerable input based on the purpose of the
Trang 36analysis, the nature of the possible explanatory variables and existing knowledge
of smoking may increase with years of smoking, and hence with age, so that there
is an interactive effect of age and smoking on FEV This would be included bycreating an interaction variable defined as the product of the age and smokingvariables
Automatic selection procedures
A number of procedures have been developed whereby the computer selects the
`best' subset of predictor variables, the criterion of optimality being somewhatarbitrary There are four main approaches
1 Step-up (forward-entry) procedure The computer first tries all the p simpleregressions with just one predictor variable, choosing that which provides thehighest Regression SSq Retaining this variable as the first choice, it now triesall the p 1 two-variable regressions obtained by the various possibilities forthe second variable, choosing that which adds the largest increment to theRegression SSq The process continues, all variables chosen at any stagebeing retained at subsequent stages The process stops when the increments
to the Regression SSq cease to be (in some sense) large in comparison withthe Residual SSq
2 Step-down (backward-elimination) procedure The computer first does theregression on all predictor variables It then eliminates the least significantand does a regression on the remaining p 1 variables The process stopswhen all the retained regression coefficients are (in some sense) significant
3 Stepwise procedure This is an elaboration of the step-up procedure (1), butallowing elimination, as in the step-down procedure (2) After each change inthe set of variables included in the regression, the contribution of eachvariable is assessed and, if the least significant makes insufficient contribu-tion, by some criterion, it is eliminated It is thus possible for a variableincluded at some stage to be eliminated at a later stage because other vari-ables, introduced since it was included, have made it unnecessary Thecriterion for inclusion and elimination of variables could be, for example,that a variable will be included if its partial regression coefficient is significant
Trang 37at the 005 level and eliminated if its partial regression coefficient fails to besignificant at the 01 level.
4 Best-subset selection procedure Methods 1, 2 and 3 do not necessarily reach thesame final choice, even if they end with the same number of retained variables.None will necessarily choose the best possible regression (i.e that with thelargest Regression SSq) for any given number of predictor variables Computeralgorithms are available for selecting the best subset of variables, where `best'may be defined as the regression with the largest adjusted R2 (11.46) or therelated Mallows Cpstatistic (see Draper & Smith, 1998, §15.1)
All these methods of model selection are available in the SAS program PROCREG under the SELECTION option; method 4 requires much more computertime than the others and may not be feasible for more than a few explanatoryvariables
None of these methods provides infallible tactics in the difficult problem ofselecting predictor variables As discussed earlier in this section, sometimescertain variables should be retained even though they have non-significanteffects, because of their logical importance in the particular problem Sometimeslogical relationships between some of the variables suggest that a particular oneshould be retained in preference to another In some cases a set of variables has
to be included or eliminated as a group, rather than individuallyÐfor example, aset of dummy variables representing a single characteristic (§11.7) In other casessome variables can only logically be included if one or more other variables arealso included; for example, an interaction term should only be included in thepresence of the corresponding main effects (§11.8) Many statistical computingpackages do not allow the specification of all of these constraints and a sequence
of applications may be necessary to produce a legitimate model Nevertheless,automatic selection is often a useful exploratory device, even when the selectedset of variables has to be modified on common-sense grounds
Collinearity
We mentioned on p 344 that difficulties may arise if some of the explanatoryvariables are highly correlated This situation is known as collinearity or multi-collinearity More generally, collinearity arises if there is an almost linear rela-tionship between some of the explanatory variables in the regression In this case,large changes in one explanatory variable can be effectively compensated for
by large changes in other variables, so that very different sets of regressioncoefficients provide very nearly the same residual sum of squares This leads tothe consequences that the regression coefficients will have large standard errorsand, in extreme cases, the computations, even with the precision achieved bycomputers, may make nonsense of the analysis The regression coefficients may
be numerically quite implausible and perhaps largely useless
Trang 38Collinearity is a feature of the explanatory variables independent of thevalues of the dependent variable Often collinearity can be recognized from thecorrelation matrix of the explanatory variables If two variables are highlycorrelated then this implies collinearity between those two variables On theother hand, it is possible for collinearity to occur between a set of three ormore variables without any of the correlations between these variables beingparticularly large, so it is useful to have a more formal check in the regressioncalculations A measure of the collinearity between xiand the other explanatoryvariables is provided by the proportion of the variability in xithat is explained bythe other variables, R2
i, when xiis the dependent variable in a regression on all theother xs The variance of bi, in the regression of y on x1to xp, is proportional tothe reciprocal of 1 R2
i (see Wetherill et al., 1986, §4.3), so values of R2
i close to 1will lead to an increased variance for the estimate of bi The variance inflationfactor (VIF) for variable i is defined as
The problems of collinearity may be overcome in several ways In somesituations the collinearity has arisen purely as a computational problem andmay be solved by alternative definitions of some of the variables For example, ifboth x and x2 are included as explanatory variables and all the values of x arepositive, then x and x2are likely to be highly correlated This can be overcome byredefining the quadratic term as x x2, which will reduce the correlation whilstleading to an equivalent regression This device is called centring When thecollinearity is purely computational, values of the VIF much greater than 10can be accepted before there are any problems of computational accuracy, using
a modern regression package, such as the SAS program This is demonstrated inthe following example
Example 11.8
In Example 11.6 when the analysis is carried out using the model defined by (11.64),extended to three groups, then there are high correlations between wij, ziand xj For thefull model with eight variables fitted, the values of the VIF for the ziwere both greaterthan 140, and two of the wijhad similar VIFs, whilst the other two had VIFs of 86 and 92.Thus, six of the eight VIFs were particularly large and this might cause some concern
Trang 39about collinearity These high VIFs can be avoided by centring x1 and x2, when themaximum VIF was 6.8 However, the original analysis lost no important precision, usingSAS regression PROC REG, so that the collinearity was of no practical concern.
In other situations the correlation is an intrinsic feature of the variables; forexample, if both diastolic and systolic blood pressure were included, then a highcorrelation between these two measures may lead to a collinearity problem Theappropriate action is to use only one of the measures or possibly replace them bythe mean of the pair This leads to no real loss because the high correlationeffectively means that only one measure is needed to use almost all the information
In most situations the reasons for collinearity will be readily identified fromthe variance inflation factors and the nature of the variables, but where this is notthe case more complex methods based on the principal components (§13.2) of theexplanatory variables may be used (see Draper & Smith, 1998, §§16.4±16.5, orKleinbaum et al., 1998, §12.5.2)
Another approach, which we shall not pursue, is to use ridge regression, atechnique which tends to give more stable estimates of regression coefficients,usually closer to zero, than the least squares estimates For a fuller discussion, seeDraper and Smith (1998, Chapter 17) As they point out, this is an entirelyreasonable approach from a Bayesian standpoint if one takes the view thatnumerically large coefficients are intrinsically implausible, and prior knowledge,
or belief, on the distribution of the regression coefficients may be formallyincorporated into the method In other circumstances the method might beinappropriate and we agree with Draper and Smith's advice against its indis-criminate use
DF
Trang 40The SSq for (iii) is obtained by subtraction, (i) (ii), and the adequacy of themodel is tested by the variance ratio from lines (iii) and (iv).
In general, the above approach will not be feasible since the observations willnot fall into groups of replicates
Residual plots
Much information may be gained by graphical study of the residuals, y Y.These values, and the predicted values Y, are often printed in computer output.The values for the data in Example 11.5 are shown in Table 11.6 We nowdescribe some potentially useful scatter diagrams involving the residuals, illus-trating these by Fig 11.9, which relates to Example 11.5
1 Plot of y Y against Y (Fig 11.9a) The residuals are always uncorrelatedwith the predicted value Nevertheless, the scatter diagram may provide someuseful pointers The distribution of the residuals may be markedly non-normal; in Fig 11.9(a) there is some suggestion of positive skewness Thispositive skewness and other departures from normality can also be detected
by constructing a histogram of the residuals or a normal plot (see p 371;Example 11.10 is for the data of Example 11.5) The variability of theresiduals may not be constant; in Fig 11.9(a) it seems to increase as Yincreases Both these deficiencies may sometimes be remedied by transform-ation of the y variable and reanalysis (see §10.8) There is some indication inthis example that it would be appropriate to repeat the analysis after alogarithmic transformation (§2.5) of y and this is explored in §11.10, Insome cases the trend in variability may call for a weighted analysis (§11.6).Even though the correlation is zero there may be a marked non-linear trend;
if so, it is likely to appear also in the plots of type 2 below
2 Plot of y Y against xj (Fig 11.9b,c) The residuals may be plotted againstthe values of any or all of the predictor variables Again, the correlation willalways be zero There may, however, be a non-linear trendÐfor example,with the residuals tending to rise to a maximum somewhere near the mean, xj,and falling away on either side, as is perhaps suggested in Fig 11.9(c); orshowing a trend with a minimum value near xj Such trends suggest that theeffect of xj is not adequately expressed by the linear term in the model Thesimplest suggestion would be to add a term involving the square of xj as anextra predictor variable This so-called quadratic regression is described in
§12.1
3 Plot of y Y against product xjxh(Fig 11.9d) The model (11.38) postulates
no interaction between the xs, in the sense of §9.3 That is, the effect ofchanging one predictor variable is independent of the values taken by anyother This would not be so if a term xjxhwere introduced into the model Ifsuch a term is needed, but has been omitted, the residuals will tend to be