1. Trang chủ
  2. » Y Tế - Sức Khỏe

Statistical Methods in Medical Research - part 5 pptx

83 287 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 83
Dung lượng 554,15 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

the slope for group A may be partly or wholly due to a curvature in the regression: there is a suggestion that the mean value of y at high values of x is lower than is predicted bythe li

Trang 1

11.4 Regression in groups

Frequently data are classified into groups, and within each group a linearregression of y on x may be postulated For example, the regression of forcedexpiratory volume on age may be considered separately for men in differentoccupational groups Possible differences between the regression lines are thenoften of interest

In this section we consider comparisons of the slopes of the regression lines Ifthe slopes clearly differ from one group to another, then so, of course, must themean values of yÐat least for some values of x In Fig 11.4(a), the slopes of theregression lines differ from group to group The lines for groups (i) and (ii) cross.Those for (i) and (iii), and for (ii) and (iii) would also cross if extended sufficientlyfar, but here there is some doubt as to whether the linear regressions wouldremain valid outside the range of observed values of x

If the slopes do not differ, the lines are parallel, as in Fig 11.4(b) and (c), andhere it becomes interesting to ask whether, as in (b), the lines differ in theirheight above the x axis (which depends on the coefficient a in the equation

in slope and position in (a), differ only in position in (b), and coincide in (c) (ÐÐ) (i); (- - -) (ii); (-  -) (iii).

Trang 2

E…y† ˆ a ‡ bx), or whether, as in (c), the lines coincide In practice, the fittedregression lines would rarely have precisely the same slope or position, and thequestion is to what extent differences between the lines can be attributed torandom variation Differences in position between parallel lines are discussed in

§11.5 In this section we concentrate on the question of differences betweenslopes

Suppose that there are k groups, with ni pairs of observations in the ithgroup Denote the mean values of x and y in the ith group by xiand yi, and theregression line, calculated as in §7.2, by

If all the ni are reasonably large, a satisfactory approach is to estimate thevariance of each bi by (7.16) and to ignore the imprecision in these estimates ofvariance Changing the notation of (7.16) somewhat, we shall denote the residualmean square for the ith group by s2

i and the sum of squares of x about xi byP

…i†…x xi†2 Note that the parenthesized suffix i attached to the summationsign indicates summation only over the specified group i; that is,

with an estimated variance

Trang 3

var…b† ˆ 1=Pwi: …11:17†The sampling variation of b is approximately normal.

It is difficult to say how large the nimust be for this `large-sample' approach

to be used with safety There would probably be little risk in adopting it if none

of the nifell below 20

A more exact treatment is available provided that an extra assumption ismadeÐthat the residual variances s2

i are all equal Suppose the common value

is s2 We consider first the situation where k ˆ 2, as a comparison of two slopes can

be effected by use of the t distribution For k > 2 an analysis of variance is required

If a common value is assumed for the regression slope in the two groups, itsvalue b may be estimated by

Trang 4

Equations (11.21) and (11.22) can easily be seen to be equivalent to (11.16) and(11.17) if, in the calculation of wi in (11.14), the separate estimates of residualvariance s2

i are replaced by the common estimate s2 For tests or the calculation

of confidence limits for b using (11.22), the t distribution on n1‡ n2 4 DFshould be used Where a common slope is accepted it would be more usual toestimate s2 as the residual mean square about the parallel lines (11.34), whichwould have n1‡ n2 3 DF

We shall first illustrate the calculations for two groups by amalgamating groups A1

and A2(denoting the pooled group by A) and comparing groups A and B

The sums of squares and products of deviations about the mean, and the separateslopes biare as follows:

A 1 40 439738 236385 265812 00538

B 2 44 619716 189712 206067 00306Total 1059454 426097 471879 ( 00402)The SSq about the regressions are

P

…1†…y Y1†2ˆ 265812 … 236385†2=439738 ˆ 138741and

P

…2†…y Y2†2ˆ 206067 … 189712†2=619716 ˆ 147991:

Thus,

s2ˆ …138741 ‡ 147991†=…40 ‡ 44 4† ˆ 03584,and, for the difference between b1and b2using (11.19) and (11.20),

Trang 5

Table 11.4 Ages and vital capacities for three groups of workers in the cadmium industry x, age last birthday (years); y, vital capacity (litres).

Group A 1 ,

exposed > 10 years

Group A 2 , exposed < 10 years

Group B, not exposed

Trang 6

B Separate slope Parallel lines Separate slope

}

Fig 11.5 Scatter diagram showing age and vital capacity of 84 men working in the cadmium industry, divided into three groups (Table 11.4) (d) Group A 1 ; ( ) Group A 2 ; (s) Group B.

the slope for group A may be partly or wholly due to a curvature in the regression: there is

a suggestion that the mean value of y at high values of x is lower than is predicted bythe linear regressions (see p 330) Alternatively it may be that a linear regression isappropriate for each group, but that for group A the vital capacity declines more rapidlywith age than for group B

More than two groups

With any number of groups the pooled slope b is given by a generalization of(11.21):

Trang 7

Again, it can be shown that the sums of squares of these components can beadded in the same way This means that

Within-Groups SSq ˆ Residual SSq about separate lines

‡ SSq due to differences between the bi and b

‡ SSq due to fitting common slope b: …11:26†The middle term on the right is the one that particularly concerns us now It can

be obtained by noting that the SSq due to the common slope is

iSxxi : …11:30†

It should be noted that (11.30) is equivalent to

P

i Wib2 i

…PiWibi†2P

iWi

where Wiˆ Sxxi ˆ s2=var…bi†, and that the pooled slope b equalsP

Wibi=PWi, the weighted mean of the bi The SSq due to differences in slope

is thus essentially a weighted sum of squares of the biabout their weighted mean

b, the weights being (as usual) inversely proportional to the sampling variances(see §8.2)

The analysis is summarized in Table 11.5 There is only one DF for thecommon slope, since the SSq is proportional to the square of one linear contrast,

b The k 1 DF for the second line follow because the SSq measures differencesbetween k independent slopes, bi The residual DF follow because there are

ni 2 DF for the ith group andPi…ni 2† ˆ n 2k The total DF within groupsare, correctly, n k The F test for differences between slopes follows immedi-ately

Trang 8

Table 11.5 Analysis of variance for differences between regression slopes.

We now test the significance between the three slopes The sums of squares and products

of deviations about the mean, and the separate slopes, are as follows:

Group i ni Sxxi Sxyi Syyi bi

… 189712†2619716

… 373574†2939212 ˆ 24995:The analysis of variance may now be completed

Trang 9

SSq DF MSq VRCommon slope 148590 1 148590 4209 …P < 0001†Between slopes 24995 2 12498 354 …P ˆ 0034†Separate residuals 275351 78 03530

Within groups 448936 81

The differences between slopes are more significant than in the two-group analysis Theestimates of the separate slopes, with their standard errors calculated in terms of theResidual MSq on 78 DF, are

bA1ˆ 00851  00197, bA2ˆ 00465  00124, bBˆ 00306  00075:The most highly exposed group, A1, provides the steepest slope Figure 11.6 shows theseparate regressions as well as the three parallel lines

The doubt about linearity suggests further that a curvilinear regression might be moresuitable; however, analysis with a quadratic regression line (see §11.1) shows the non-linearity to be quite non-significant

The analysis of variance test can, of course, be applied even for k ˆ 2 Theresults will be entirely equivalent to the t test described at the beginning ofthis section, the value of F being, as usual, the square of the correspondingvalue of t

Trang 10

11.5 Analysis of covariance

If, after an analysis of the type described in the last section, there is no strongreason for postulating differences between the slopes of the regression lines inthe various groups, the following questions arise What can be said about therelative position of parallel regression lines? Is there good reason to believethat the true lines differ in position, as in Fig 11.4(b), or could theycoincide, as in Fig 11.4(c)? What sampling error is to be attached to anestimate of the difference in positions of lines for two particular groups? Theset of techniques associated with these questions is called the analysis of covar-iance

Before describing technical details, it may be useful to note some importantdifferences in the purposes of the analysis of covariance and in the circumstances

in which it may be used

1 Main purpose

(a) To correct for bias If it is known that changes in x affect the mean value

of y, and that the groups under comparison differ in their values of x, it willfollow that some of the differences between the values of y can be ascribedpartly to differences between the xs We may want to remove this effect as far

as possible For example, if y is forced expiratory volume (FEV) and x isage, a comparison of mean FEVs for men in different occupational groupsmay be affected by differences in their mean ages A comparison would bedesirable of the mean FEVs at the same age If the regressions are linear andparallel, this means a comparison of the relative position of the regressionlines

(b) To increase precision Even if the groups have very similar values of x,precision in the comparison of values of y can be increased by using theresidual variation of y about regression on x rather than by analysing the ysalone

2 Type of investigation

(a) Uncontrolled study In many situations the observations will be made onunits which fall naturally into the groups in questionÐwith no element ofcontrolled allocation Indeed, it will often be this lack of control which leads

to the bias discussed in 1(a)

(b) Controlled study In a planned experiment, in which experimental unitsare allocated randomly to the different groups, the differences between values

of x in the various groups will be no greater in the long run than would beexpected by sampling theory Of course, there will occasionally be largefortuitous differences in the xs; it may then be just as important to correctfor their effect as it would be in an uncontrolled study In any case, even withvery similar values of x, the extra precision referred to in 1(b) may well beworth acquiring

Trang 11

Two groups

If the t test based on (11.20) reveals no significant difference in slopes, twoparallel lines may be fitted with a common slope b given by (11.23) Theequations of the two parallel lines are (as in (11.24)),

Yc1ˆ y1‡ b…x x1†and

var…d† ˆ var…y1† ‡ var…y2† ‡ …x1 x2†2var…b†

c in (11.34) is taken about parallel lines (since parallelism is aninitial assumption in the analysis of covariance), and would be obtainedfrom Table 11.5 by pooling the second and third lines of the analysis Theresultant DF would be …k 1† ‡ …n 2k† ˆ n k 1, which gives the

n1‡ n2 3 …ˆ n 3† of (11.34) when k ˆ 2

The standard error (SE) of d, the square root of (11.33), may be used in a ttest On the null hypothesis that the regression lines coincide, E…d† ˆ 0, and

Trang 12

t ˆ d=SE…d† …11:35†has n1‡ n2 3 DF Confidence limits for the true difference, E(d), are obtained

in the usual way

The relative positions of the lines are conveniently expressed by the tion of a corrected or adjusted mean value of y for each group Suppose thatgroup 1 had had a mean value of x equal to some arbitrary value x0rather than

calcula-x1 Often x0is given the value of the mean of x over all the groups From (11.24)

we should estimate that the mean y would have been

y0

with a similar expression for y0

2 The difference between y0

1and y0

2can easily beseen to be equal to d given by (11.32) If the regression lines coincide, y0

1and y0 2

will be equal If the line for group 1 lies above that for group 2, at a fixed value of

Trang 13

which varies with group not only through ni but also because of the term

…x0 xi†2, which increases as xi gets further from x0 in either direction Thearbitrary choice of x0is avoided by concentrating on the difference between thecorrected means, which is equal to d for all values of x0

Example 11.4

The data of Example 11.3 will now be used in an analysis of covariance We shall firstillustrate the calculations for two groups, the pooled exposed group A and the unexposedgroup B

Using the sums of squares and products given earlier we have:

t ˆ 00835=01334 ˆ 063 on 81 DF:

The difference d is clearly not significant (P ˆ 053)

Corrected values can be calculated, using (11.36), with x0ˆ 405, the overall mean.Thus,

Trang 14

This analysis is based on the assumption that the regression lines are parallel forgroups A and B but, as noted in Example 11.3, there is at least suggestive evidence that theslopes differ Suppose we abandon the assumption of parallelism and fit lines withseparate slopes, bA and bB The most pronounced difference between predicted valuesoccurs at high ages The difference at, say, age 60 is

d0ˆ yA yB‡ bA…60 xA† bB…60 xB†

ˆ 05306,and

var…d0† ˆ var…yA† ‡ var…yB† ‡ …60 xA†2var…bA† ‡ …60 xB†2var…bB†

More than two groups

Parallel lines may be fitted to several groups, as indicated in §11.4 The pooledslope, b, is given by (11.23) and the line for the ith group is given by (11.24).The relative positions of the lines are again conveniently expressed in terms ofcorrected mean values (11.36) On the null hypothesis that the true regressionlines for the different groups coincide, the corrected means will differ purely bysampling error For two groups the test of this null hypothesis is the t test (11.35).With k…>2† groups this test becomes an F test with k 1 DF in the numerator.The construction of this test is complicated by the fact that the sampling errors ofthe corrected means y0

i are not independent, since the random variable b entersinto each expression (11.36) The test is produced most conveniently using multi-ple regression, which is discussed in the next section, and we return to this topic

in §11.7 Other aspects of the analysis are similar to the analysis with two groups.The data of Example 11.4 are analysed in three groups as Example 11.7 (p 353)using multiple regression

In considering the possible use of the analysis of covariance with a particularset of data, special care should be given to the identification of the dependent andindependent variables If, in the analysis of covariance of y on x, there aresignificant differences between groups, it does not follow that the same will betrue of the regression of x on y In many cases this is an academic point because

Trang 15

the investigator is clearly interested in differences between groups in the meanvalue of y, after correction for x, and not in the reverse problem Occasionally,when x and y have a symmetric type of relation to each other, as in §11.2, both ofthe analyses of covariance (of y on x, and of x on y) will be misleading Linesrepresenting the general trend of a functional relationship may well be coincident(as in Fig 11.8), and yet both sets of regression lines are non-coincident Here thedifficulties of §11.2 apply, and lines drawn by eye, or the other methods discussed

on p 319, may provide the most satisfactory description of the data For a fullerdiscussion, see Ehrenberg (1968)

The analysis of covariance described in this section is appropriate for dataforming a one-way classification into groups The method is essentially a com-bination of linear regression (§7.2) and a one-way analysis of variance (§8.1).Similar problems arise in the analysis of more complex data For example, in theanalysis of a variable y in a Latin square one may wish to adjust the apparenttreatment effects to correct for variation in another variable x In particular, xmay be some pretreatment characteristic known to be associated with y; thecovariance adjustment would then be expected to increase the precision withwhich the treatments can be compared The general procedure is an extension ofthat considered above and the details will not be given

The methods presented in this and the previous section were developedbefore the use of computers became widespread, and were designed to simplify

y

x

Fig 11.8 Scatter diagram showing a common trend for two groups of observations but with coincident regression lines (Ð Ð) Regression of y on x; (± ± ±) regression of x on y.

Trang 16

non-the calculations Nowadays this is not a problem and non-the fitting of regressions ingroups, using either separate non-parallel lines or parallel lines, as in the analysis

of covariance, may be accomplished using multiple regression, as discussed in thenext two sections

11.6 Multiple regression

In the earlier discussions of regression in Chapter 7 and the earlier sections of thischapter, we have been concerned with the relationship between the mean value ofone variable and the value of another variable, concentrating particularly on thesituation in which this relationship can be represented by a straight line

It is often useful to express the mean value of one variable in terms not of oneother variable but of several others Some examples will illustrate some slightlydifferent purposes of this approach

1 The primary purpose may be to study the effect on variable y of changes in aparticular single variable x1, but it may be recognized that y may be affected

by several other variables x2, x3, etc The effect on y of simultaneous changes

in x1, x2, x3, etc., must therefore be studied In the analysis of data onrespiratory function of workers in a particular industry, such as those con-sidered in Examples 11.3 and 11.4, the effect of duration of exposure to ahazard may be of primary interest However, respiratory function is affected

by age, and age is related to duration of exposure The simultaneous effect onrespiratory function of age and exposure must therefore be studied so that theeffect of exposure on workers of a fixed age may be estimated

2 One may wish to derive insight into some causative mechanism by ing which of a set of variables x1, x2, , has apparently most influence on adependent variable y For example, the stillbirth rate varies considerably indifferent towns in Britain By relating the stillbirth rate simultaneously to alarge number of variables describing the townsÐeconomic, social, meterolo-gical or demographic variables, for instanceÐit may be possible to findwhich factors exert particular influence on the stillbirth rate (see Sutherland,1946) Another example is in the study of variations in the cost per patient indifferent hospitals This presumably depends markedly on the `patient mix'Ðthe proportions of different types of patient admittedÐas well as on otherfactors A study of the simultaneous effects of many such variables mayexplain much of the variation in hospital costs and, by drawing attention toparticular hospitals whose high or low costs are out of line with the predic-tion, may suggest new factors of importance

discover-3 To predict the value of the dependent variable in future individuals Aftertreatment of patients with advanced breast cancer by ablative procedures,prognosis is very uncertain If future progress can be shown to depend onseveral variables available at the time of the operation, it may be possible to

Trang 17

predict which patients have a poor prognosis and to consider alternativemethods of treatment for them (Armitage et al., 1969).

The appropriate technique is called multiple regression In general, theapproach is to express the mean value of the dependent variable in terms ofthe values of a set of other variables, usually called independent variables Thenomenclature is confusing, since some of the latter variables may be eitherclosely related to each other logically (e.g one might be age and another thesquare of the age) or highly correlated (e.g height and arm length) It is prefer-able to use the terms predictor or explanatory variables, or covariates, and weshall usually follow this practice

The data to be analysed consist of observations on a set of n individuals, eachindividual providing a value of the dependent variable, y, and a value of each ofthe predictor variables, x1, x2, , xp The number of predictor variables, p,should preferably be considerably less than the number of observations, n, andthe same p predictor variables must be available for each individual in any oneanalysis

Suppose that, for particular values of x1, x2, , xp, an observed value of y isspecified by the linear model:

y ˆ b0‡ b1x1‡ b2x2‡ ‡ bpxp‡ e, …11:38†where e is an error term The various values of e for different individuals aresupposed to be independently normally distributed with zero mean and variance

s2 The constants b1, b2, , bp are called partial regression coefficients; b0 issometimes called the intercept The coefficient b1 is the amount by which ychanges on the average when x1changes by one unit and all the other xis remainconstant In general, b1will be different from the ordinary regression coefficient

of y on x1because the latter represents the effect of changes in x1on the averagevalues of y with no attempt to keep the other variables constant

The coefficients b0, b1, b2, , bp are idealized quantities, measurable onlyfrom an infinite number of observations In practice, from n observations, wehave to obtain estimates of the coefficients and thus an estimated regressionequation:

Y ˆ b0‡ b1x1‡ b2x2‡ ‡ bpxp: …11:39†Statistical theory tells us that a satisfactory method of obtaining the estimatedregression equation is to choose the coefficients such that the sum of squares ofresiduals,P…y Y†2, is minimized, that is, by the method of least squares, whichwas introduced in §7.2 Note that here y is an observed value and Y is the valuepredicted by (11.39) in terms of the predictor variables A consequence of thisapproach is that the regression equation (11.39) is satisfied if all the variables aregiven their mean values Thus,

Trang 18

y ˆ b0‡ b1x1‡ b2x2‡ ‡ bpxp,and consequently b0can be replaced in (11.39) by

y b1x1 b2x2 bpxp

to give the following form to the regression equation:

Y ˆ y ‡ b1…x1 x1† ‡ b2…x2 x2† ‡ ‡ bp…xp xp†: …11:40†The equivalent result for simple regression was given at (7.4)

We are now left with the problem of finding the partial regression cients, bi An extension of the notation introduced in §11.4 will be used; forinstance,

coeffi-Sxjxj ˆP…xj xj†2

ˆPx2 j

The method of least squares gives a set of p simultaneous linear equations inthe p unknowns, b1, b2, , bp as follows:

…Sx1x1†b1‡ …Sx1x2†b2‡ ‡ …Sx1xp†bpˆ Sx1y

…Sx2x1†b1‡ …Sx2x2†b2‡ ‡ …Sx2xp†bpˆ Sx2y

…Sxpx1†b1‡ …Sxpx2†b2‡ ‡ …Sxpxp†bpˆ Sxpy: …11:41†These are the so-called normal equations; they are the multivariate extension

of the equation for b in §7.2 In general, there is a unique solution The numericalcoefficients of the left side of (11.41) form a matrix which is symmetric aboutthe diagonal running from top left to bottom right, since, for example,

Sx 1 x 2 ˆ Sx 2 x 1 These coefficients involve only the xs The terms on the rightalso involve the ys

The complexity of the calculations necessary to solve the normal equationsincreases rapidly with the value of p but, since standard computer programs areavailable for multiple regression analysis, this need not trouble the investigator.Those familiar with matrix algebra will recognize this problem as being soluble interms of the inverse matrix We shall return to the matrix representation ofmultiple regression later in this section but, for now, shall simply note that the

Trang 19

equations have a solution which may be obtained by using a computer withstatistical software.

Sampling variation

As in simple regression, the Total SSq of y may be divided into the SSq due toregression and the SSq about regression with (11.2) and (11.3) applying andillustrated by Fig 11.1, although in multiple regression the corresponding figurecould not be drawn, since it would be in (p ‡ 1)-dimensional space Thus, again

we have

Total SSq ˆ SSq about regression ‡ SSq due to regression,

which provides the opportunity for an analysis of variance The subdivision of

DF is as follows:

Total ˆ About regression ‡ Due to regression

The variance ratio

F ˆMSq due to regressionMSq about regression …11:42†

provides a composite test of the null hypothesis that b1ˆ b2ˆ ˆ bpˆ 0, i.e.that all the predictor variables are irrelevant

We have so far discussed the significance of the joint relationship of y withthe predictor variables It is usually interesting to study the sampling variation ofeach bjseparately This not only provides information about the precision of thepartial regression coefficients, but also enables each of them to be tested for asignificant departure from zero

The variance and standard error of bj are obtained by the multivariateextension of (7.16) as

Trang 20

If a particular bj is not significantly different from zero, it may be thoughtsensible to call it zero (i.e to drop it from the regression equation) to make theequation as simple as possible It is important to realize, though, that if this isdone the remaining bjs would be changed; in general, new values would beobtained by doing a new analysis on the remaining xjs.

R2 must increase as further variables are introduced into a regression andtherefore cannot be used to compare regressions with different numbers ofvariables The value of R2 may be adjusted to take account of the chancecontribution of each variable included by subtracting the value that would beexpected if none of the variables was associated with y The adjusted value, R2

a, iscalculated from

Example 11.5

The data shown in Table 11.6 are taken from a clinical trial to compare two hypotensivedrugs used to lower the blood pressure during operations (Robertson & Armitage, 1959).The dependent variable, y, is the `recovery time' (in minutes) elapsing between the time atwhich the drug was discontinued and the time at which the systolic blood pressure had

Trang 21

Table 11.6 Data on the use of a hypotensive drug: x 1 , log (quantity of drug used, mg); x 2 , mean level

of systolic blood pressure during hypotension (mmHg); y, recovery time (min).

x1: log (quantity of drug used, mg);

x2: mean level of systolic blood pressure during hypotension (mmHg)

Group A: `Minor' non-thoracic (n A ˆ 20)

Trang 22

The 53 patients are divided into three groups according to the type of operation Initially

we shall ignore this classification, the data being analysed as a single group Possibledifferences between groups are considered in §11.7

Table 11.6 shows the values of x1, x2and y for the 53 patients The columns headed Yand y Y will be referred to later Below the data are shown the means and standarddeviations of the three variables

Using standard statistical software for multiple regression, the regression equation is

Y ˆ 23011 ‡ 23639x1 071467x2:The recovery time increases on the average by about 24 min for each increase of 1 in thelog dose (i.e each 10-fold increase in dose), and decreases by 071 min for every increase of

1 mmHg in the mean blood pressure during hypotension

The analysis of variance of y is

Due to regression 2 7832 2 13916 632 (P ˆ 0004)

About regression 11 0079 50 2202

Total 13 7912 52

The variance ratio is highly significant and there is thus little doubt that either x1 or

x2 is, or both are, associated with y The squared multiple correlation coefficient, R2,

is 27832/137912 ˆ 02018; R ˆ 02018 ˆ 045p Of the total sum of squares of y,about 80% (080 ˆ 1 R2) is still present after prediction of y from x1and x2 The predic-tive value of x1and x2is thus rather low, even though it is highly significant

From the analysis of variance, s2ˆ 2202 so that the SD about the regression, s, is148 The standard errors of the partial regression coefficients are

contri-Analysis of variance test for deletion of variables

The t test for a particular regression coefficient, say, bj, tests whether thecorresponding predictor variable, xj, can be dropped from the regression equa-tion without any significant effect on the variation of y

Trang 23

Sometimes we may wish to test whether variability is significantly affected bythe deletion of a group of predictor variables For example, in a clinical studythere may be three variables concerned with body size: height (x1), weight (x2)and chest measurement (x3) If all other variables represent quite differentcharacteristics, it may be useful to know whether all three of the size variablescan be dispensed with Suppose that q variables are to be deleted, out of a total of

p If two multiple regressions are done, (a) with all p variables, and (b) with thereduced set of p q variables, the following analysis of variance is obtained:

(iii) Due to deletion of q variables q

(iv) Residual about regression (a) n p 1

The SSq for (i) and (iv) are obtained from regression (a), that for (ii) fromregression (b) and that for (iii) by subtraction: (iii) ˆ (i) (ii) The varianceratio from (iii) and (iv) provides the required F test

It is usually very simple to arrange for regressions to be done on severaldifferent subsets of predictor variables, using a multiple regression package on acomputer The `TEST' option within the SAS program PROC REG gives the Ftest for the deletion of the variables

When only one variable, say, the jth, is to be deleted, the same procedurecould, in principle, be followed instead of the t test The analysis of variancewould be as above with q ˆ 1 If this were done, it would be found that thevariance ratio test, F, would be equal to t2, thus giving the familiar equivalencebetween an F test on 1 and n p 1 DF, and a t test on n p 1 DF

It will sometimes happen that two or more predictor variables all give significant partial regression coefficients, and yet the deletion of the whole grouphas a significant effect by the F test This often happens when the variableswithin the group are highly correlated; any one of them can be dispensed with,without appreciably affecting the prediction of y; the remaining variables in thegroup act as effective substitutes If the whole group is omitted, though, theremay be no other variables left to do the job With a large number of interrelatedpredictor variables it often becomes quite difficult to sort out the meaning of thevarious partial regression coefficients (see p 358)

Trang 24

follow the general multiple regression method, replacing all sums likePxjandP

y byPwxjandPwy (the subscript i, identifying the individual observation,has been dropped here to avoid confusion with j, which identifies a particularexplanatory variable) Similarly, all sums of squares and products are weighted:P

y2is replaced byPwy2,PxjxkbyPwxjxk, and in calculating corrected sums

of squares and products n is replaced byPw (see (8.15) and (8.16))

The standard t tests, F tests and confidence intervals are then valid TheResidual MSq is an estimate of s2, and may, in certain situations, be checkedagainst an independent estimate of s2 For example, in the situation discussed on

p 360, where the observations fall into groups with particular combinations ofvalues of predictor variables, yimay be taken to be the mean of niobservations at

a specified combination of xs The variance of yiis then s2=ni, and the analysismay be carried out by weighted regression, with wiˆ ni The Residual MSq will

be the same as that derived from line (iii) of the analysis on p 360, and may becompared (as indicated in the previous discussion) against the Within-GroupsMSq in line (iv)

Matrix representation

As noted, the normal equations (11.41) may be concisely written as a matrixequation, and the solution may be expressed in terms of an inverse matrix Infact, the whole theory of multiple regression may be expressed in matrix notationand this approach will be indicated briefly A fuller description is given inChapters 4 and 5 of Draper and Smith (1998); see also Healy (2000) for a generaldescription of matrices as applied in statistics Readers unfamiliar with matrixnotation could omit study of the details in this subsection, noting only that thematrix notation is a shorthand method used to represent a multivariate set ofequations With the more complex models that are considered in Chapter 12, theuse of matrix notation is essential

A matrix is a rectangular array of elements The size of the array is expressed

as the number of rows and the number of columns so that an r  c matrixconsists of r rows and c columns A matrix with one column is termed a vector

We shall write matrices and vectors in bold type, with vectors in lower case andother matrices in upper case

A theory of matrix algebra has been developed which defines operations thatmay be carried out Two matrices, A and B may be multiplied together providedthat the number of columns of A is the same as the number of rows of B If A is

an r  s matrix and B is an s  t matrix, then their product C is an r  t matrix Amatrix with the same number of rows and columns is termed a square matrix If

A is a square r  r matrix then it has an inverse A 1such that the products AA 1

and A 1A both equal the r  r identity matrix, I, which has ones on the diagonaland zeros elsewhere The identity matrix has the property that when it is

Trang 25

multiplied by any other matrix the product is the same as the original matrix,that is, IB ˆ B.

It is convenient to combine the intercept b0 and the regression coefficientsinto a single set so that the (p ‡ 1†  1 vector b represents the coefficients,

b0, b1, b2, , bp Correspondingly a dummy variable x0 is defined whichtakes the value of 1 for every individual The data for each individual are the xvariables, x0, x1, x2, , xp, and the whole of the data may be written as the

n  …p ‡ 1† matrix, X, with element (i, j ‡ 1) being the variable xjfor individual i

If y represents the vector of n observations on y then the multiple regressionequation (11.38) may be written as

inter-The solution of (11.48) is then given by

In matrix notation the SSq about the regression may be written as

…y Xb†T…y Xb† ˆ yTy bTXTy …11:50†and dividing this by n p 1 gives the residual mean square, s2, as an estimate

of the variance about the regression The estimated variance±covariance matrix

of b is

It can be shown that (11.51) gives the same variances and covariances for

b1, b2, , bp as (11.43) and (11.44), even though (11.51) involves the inverse of

a p ‡ 1 square matrix of uncorrected sums of squares and products, whilst(11.43) and (11.44) involve the inverse of a p square matrix of corrected sums

of squares and products

Using (11.49), the right side of (11.50) may be rewritten as

Trang 26

yTy yTHy ˆ yT…I H†y …11:52†where the …n  n† matrix H ˆ X…XTX† 1XT, and I is the (n  n) identity matrix.Weighted regression may be expressed in matrix notations as follows Firstdefine a square n  n matrix, V, with diagonal elements equal to 1=wi, where wiisthe weight for the ith subject, and all off-diagonal terms zero Then the normalequations (11.48) become

Further generalizations are possible Suppose that the usual tion that the error terms, e, of different subjects are independent does notapply, and that there are correlations between them Then the matrix Vcould represent this situation by non-zero off-diagonal terms corresponding tothe covariances between subjects The matrix V is termed the dispersionmatrix Then (11.53) and (11.54) still apply and the method is termed generalizedleast squares This will not be pursued here but will be taken up again in thecontext of multilevel models (§12.5) and in the analysis of longitudinal data(§12.6)

assump-11.7 Multiple regression in groups

When the observations fall into k groups by a one-way classification, questions

of the types discussed in §§11.4 and 11.5 may arise Can equations with the samepartial regression coefficients (bis), although perhaps different intercepts (b0s), befitted to different groups, or must each group have its own set of bis? (This is ageneralization of the comparison of slopes in §11.4.) If the same bis are appro-priate for all groups, can the same b0 be used (thus leading to one equation forthe whole data), or must each group have its own b0? (This is a generalization ofthe analysis of covariance, §11.5.)

The methods of approach are rather straightforward developments of thoseused previously and will be indicated only briefly Suppose there are, in all, nobservations falling into k groups, with p predictor variables observed through-out To test whether the same bis are appropriate for all groups, an analysis ofvariance analogous to Table 11.5 may be derived, with the following subdivision

of DF:

Trang 27

DF(i) Due to regression with common slopes (bis) p

(ii) Differences between slopes (bis) p…k 1†

(iii) Residual about separate regressions n …p ‡ 1†k

The DF agree with those of Table 11.5 when p ˆ 1 The SSq within groups isexactly the same as in Table 11.5 The SSq for (iii) is obtained by fitting aseparate regression equation to each of the k groups and adding the resultingResidual SSq The residual for the ith group has ni …p ‡ 1† DF, and these add

to n …p ‡ 1†k The SSq for (i) is obtained by a simple multiple regressioncalculation using the pooled sums of squares and products within groupsthroughout; this is the appropriate generalization of the first line of Table 11.5.The SSq for (ii) is obtained by subtraction The DF, obtained also by subtrac-tion, are plausible as this SSq represents differences between k values of b1,between k values of b2, and so on; there are p predictor variables, each corres-ponding to k 1 DF

It may be more useful to have a rather more specific comparison of someregression coefficients than is provided by the composite test described above.For a particular coefficient, bj, for instance, the k separate multiple regressionswill provide k values, each with its standard error Straightforward comparisons

of these, e.g using (11.15) with weights equal to the reciprocals of the variances

of the separate values, will often suffice

The analysis of covariance assumes common values for the bis and tests fordifferences between the b0s The analysis of covariance has the following DF:

DF

(v) Residual about within-groups regression n k p

The corrected Total SSq (vi) is obtained from a single multiple regressioncalculation for the whole data; the DF are n p 1 as usual That for (v) isobtained as the residual for the regression calculation using within-groups sums

of squares and products; it is in fact the residual corresponding to the regressionterm (i) in the previous table, and is the sum of the SSq for (ii) and (iii) in thattable That for (iv) is obtained by subtraction

As indicated in §11.5, multiple regression techniques offer a convenientapproach to the analysis of covariance, enabling the whole analysis to be done

by one application of multiple regression Consider first the case of two groups.Let us introduce a new variable, z, which is given the value 1 for all observations

Trang 28

in group 1 and 0 for all observations in group 2 As a model for the data as awhole, suppose that

E…y† ˆ b0‡ dz ‡ b1x1‡ b2x2‡ ‡ bpxp: …11:55†Because of the definition of z, (11.55) is equivalent to assuming that

E…y† ˆ b0‡ d ‡ b1x1‡ b2x2‡ ‡ bpxp for group 1

by a single multiple regression of y on z, x1, x2, , xp The new variable z iscalled a dummy, or indicator, variable The coefficient d is the partial regressioncoefficient of y on z, and is estimated in the usual way by the multipleregression analysis, giving an estimate d, say The variance of d is estimated asusual from (11.43) or (11.51), and the appropriate tests and confidence limitsfollow by use of the t distribution Note that the Residual MSq has n p 2 DF(since the introduction of z increases the number of predictor variables from p to

p ‡ 1), and that this agrees with (v) on p 348 (putting k ˆ 2)

When k > 2, the procedure described above is generalized by the tion of k 1 dummy variables These can be defined in many equivalent ways.One convenient method is as follows The table shows the values taken by each

introduc-of the dummy variables for all observations in each group

Y ˆ b0‡ d1z1‡ ‡ dk 1 zk 1‡ b1x1‡ ‡ bpxp: …11:58†

Trang 29

The regression coefficients d1, d2, , dk 1represent contrasts between the meanvalues of y for groups 1, 2, , k 1 and that for group k, after correction fordifferences in the xs The overall significance test for the null hypothesis that

d1ˆ d2ˆ ˆ dk 1ˆ 0 corresponds to the F test on k 1 and n k p DF((iv) and (v) on p 348) The equivalent procedure here is to test the compositesignificance of d1, d2, , dk 1by deleting the dummy variables from the analysis(p 344) This leads to exactly the same F test

Corrected means analogous to (11.36) are obtained as

y0

1 y0

2ˆ …y1 y2† P

j bj…x1j x2j†: …11:61†Its estimated variance is obtained from the variances and covariances of the ds.For a contrast between group k and one of the other groups, i, say, the appro-priate diand its standard error are given directly by the regression analysis; diis

in fact the same as the difference between corrected means y0 y0

k For a contrastbetween two groups other than group k, say, groups 1 and 2, we use the factthat

Example 11.6

The data of Table 11.6 have so far been analysed as a single group However, Table 11.6shows a grouping of the patients according to the type of operation, and we should clearlyenquire whether a single multiple regression equation is appropriate for all three groups.Indeed, this question should be raised at an early stage of the analysis, before too mucheffort is expended on examining the overall regression

An exploratory analysis can be carried out by examining the residuals, y Y, in Table11.6 The mean values of the residuals in the three groups are:

A: 252 B: 046 C: 460

Trang 30

These values suggest the possibility that the mean recovery time, at given values of x1

and x2, increases from group A, through group B, to group C However, the residuals arevery variable, and it is not immediately clear whether these differences are significant Afurther question is whether the regression coefficients are constant from group to group

As a first step in a more formal analysis, we calculate separate multiple regressions inthe three groups The results are summarized as follows:

Group A Group B Group C

and x2 Predicted values from the three multiple regressions, for values of x1and x2equal

to the overall means, 19925 and 66340, respectively, are:

A: 1903 B: 2244 C: 2624,closer together than the unadjusted means shown in the table above

There are substantial differences between the estimated regression coefficients acrossthe three groups, but the standard errors are large Using the methods of §11.4, we can testfor homogeneity of each coefficient separately, the reciprocals of the variances being used

as weights This gives, as approximate x2statistics on 2 DF, 186 for b1and 462 for b2,neither of which is significant at the usual levels We cannot add these x2

…2†statistics, since

b1and b2are not independent For a more complete analysis of the whole data set we need

a further analysis, using a model like (11.57) We shall need two dummy variables, z1

(taking the value 1 in group A and 0 elsewhere) and z2(1 in group B and 0 elsewhere).Combining the Residual SSqs from the initial overall regression, the new regression withtwo dummy variables, and the regressions within separate groups (pooling the threeresiduals), we have:

Residual from overall regression 50 11 0079

Between intercepts 2 4213 2107 096Residual from (11.57) 48 10 5866 2206

Between slopes 4 1 1717 2929 137Residual from separate regressions 44 9 4149 2140

Trang 31

In the table on p.351, the SSq between intercepts and between slopes are obtained bysubtraction Neither term is at all significant Note that the test for differences betweenslopes is approximately equivalent to x2

…4†ˆ 4  137 ˆ 548, rather less than the sum ofthe two x2

…2†statistics given earlier, confirming the non-independence of the two previoustests Note also that even if the entire SSq between intercepts were ascribed to a trendacross the groups, with 1 DF, the variance ratio (VR) would be only 42133/2206 ˆ 191,far from significant The point can be made in another way, by comparing the intercepts,

in the model of (11.57), for the two extreme groups, A and C These groupsdiffer in the values taken by z1(1 and 0, respectively) In the analysis of the model (11.57),the estimate d1, of the regression coefficient on z1is 739  539, again not significant

Of course, the selection of the two groups with the most extreme contrast biases theanalysis in favour of finding a significant difference, but, as we see, even this contrast isnot significantly large

In summary, it appears that the overall multiple regression fitted initially to the data ofTable 11.6 can be taken to apply to all three groups of patients

The between-slopes SSq was obtained as the difference between the residualfitting (11.57) and the sum of the three separate residuals fitting regressions foreach group separately It is often more convenient to do the computations asanalyses of the total data set, as follows

Consider first two groups with a dummy variable for group 1, z, defined asearlier Now define new variables wj, for j ˆ 1 to p, by wjˆ zxj Since z is zero ingroup 2, then all the wjare also zero in group 2; z ˆ 1 in group 1 and therefore

wjˆ xj in group 2 Consider the following model

E…y† ˆ b0‡ dz ‡ b1x1‡ ‡ bpxp‡ g1w1‡ ‡ gpwp: …11:64†

Because of the definitions of z and the wj, (11.64) is equivalent to

E…y† ˆ b0‡ d ‡ …b1‡ g1†x1‡ ‡ …bp‡ gp†xp for group 1

b0‡ b1x1‡ ‡ bpxp for group 2,



…11:65†which gives lines of different slopes and intercepts for the two groups Thecoefficient gj is the difference between the slopes on xj in the two groups.The overall significance test for the null hypothesis that g1ˆ g2ˆ ˆ gpˆ 0

is tested by deleting the wj variables from the regression to give the F test on pand n 2p 2 DF

When k > 2, the above procedure is generalized by deriving p…k 1† ables, wijˆ zixj, and an overall test of parallel regressions in all the groups isgiven by the composite test of the regression coefficients for all the wij This is an

vari-F test with p…k 1† and n …p ‡ 1†k Dvari-F

In the above procedure the order of introducing (or deleting) the variables iscrucial The wijshould only be included in regressions in which the correspond-

Trang 32

ing ziand xjvariables are also included Three regressions are fitted in sequence

on the following variables:

(i) x1, , xp,(ii) x1, , xp, z1, , zk,(iii) x1, , xp, z1, , zk, w1 1, , wkp:These three regressions in turn correspond to Fig 11.4(c), 11.4(b) and 11.4(a),respectively

Example 11.7

The three-group analysis of Examples 11.3 and 11.4 may be done by introducing twodummy variables: z1, taking the value 1 in Group A1and 0 otherwise; and z2, taking thevalue 1 in group A2and 0 otherwise As before, y represents vital capacity and x age Twonew variables are derived: w1ˆ xz1and w2ˆ xz2 The multiple regressions of y on x, of y

on x, z1and of z2, and of y on x, z1, z2, w1and w2give the following analysis of variancetable

x, z1and z2

3 176063(5) Due to introduction of

w1and w2(ˆ …6† …4†) 2 24994 12497 354

(6) Due to regression on

x, z1, z2w1and w2

5 201057(7) Residual about

Trang 33

The coefficients d1and d2are estimates of the corrected differences between Groups A1

and A2, respectively, and Group B; neither is significant The coefficient b1, representingthe age effect, is highly significant

For the model with separate slopes the coefficients are (using c to represent estimates

ˆ 00465, and 00306, respectively, agreeing with the values found earlier from fittingthe regressions for each group separately

11.8 Multiple regression in the analysis of

non-orthogonal data

The analysis of variance was used in Chapter 9 to study the separate effects ofvarious factors, for data classified in designs exhibiting some degree of balance.These so-called orthogonal designs enable sums of squares representing differentsources of variation to be presented simultaneously in the same analysis.Many sets of data follow too unbalanced a design for any of the standardforms of analysis to be appropriate The trouble here is that the various linearcontrasts which together represent the sources of variation in which we areinterested may not be orthogonal in the sense of §8.4, and the correspondingsums of squares do not add up to the Total SSq We referred briefly to theanalysis of unbalanced block designs in §9.2, and also to some devices that can beused if a design is balanced except for a small number of missing readings Non-orthogonality in a design often causes considerable complication in the analysis.Multiple regression, introduced in §11.6, is a powerful method of studying thesimultaneous effect on a random variable of various predictor variables, and nospecial conditions of balance are imposed on their values The effect of anyvariable, or any group of variables, can, as we have seen, be exhibited by ananalysis of variance We might, therefore, expect that the methods of multipleregression would be useful for the analysis of data classified in an unbalanceddesign In §11.7 the analysis of covariance was considered as a particular instance

of multiple regression, the one-way classification into groups being represented

by a system of dummy variables This approach can be adopted for any factor

Trang 34

with two or more levels A significance test for any factor is obtained byperforming two multiple regressions: first, including the dummy variables repre-senting the factor and, secondly, without those variables For a full analysis ofany set of data, many multiple regressions may be needed.

Many statistical packages have programs for performing this sort of analysis.The name general linear model is often used, and should be distinguished fromthe more complex generalized linear model to be described in Chapters 12 and 14

In interpreting the output of such an analysis, care must be taken to ensure thatthe test for the effect of a particular factor is not distorted by the presence orabsence of correlated factors In particular, for data with a factorial structure,the warning given on p 255, against the testing of main effects in the presence ofinteractions involving the relevant factors, should be heeded

Freund et al (1986) describe four different ways in which sums of squares can

be calculated in general linear models, and the four different types are larly relevant to output from PROC GLM of SAS Briefly, Type I is appropriatewhen factors are introduced in a predetermined order; Type II shows the con-tribution of each factor in the presence of all others except interactions involvingthat factor; in Types III and IV, any interactions defined in the model areretained even for the testing of main effects

particu-In the procedure recommended on p 353 to distinguish betweendifferent forms of multiple regression within groups, we noted that the order ofintroducing the variables is critical This procedure corresponds to Type I SSqsince the variables are introduced in a predetermined order The Type I SSq fordifferences between groups, in the analysis of covariance model with no interac-tions between the effects of the covariates and the groups, is also a Type II SSq,since the group differences are then tested in a model that excludes interactions.Type I SSq are also useful in split-unit designs, where it is necessary to introduceterms for the treatments allocated to main unit, before terms for the subunittreatments and interactions (§9.6)

The rationale of Type III and IV SSq is to try and reproduce the same type ofanalysis for main effects and interactions that would have occurred if the designhad been balanced In a balanced design, as we saw in §9.3, the SSq for maineffects and interactions are orthogonal and this means that the value of the SSqfor a main effect is the same whether an interaction involving that main effect isincluded or not Type III SSq arise when contrasts are created corresponding tomain effects and interactions, and these contrasts are defined to be orthogonal.Another feature of an orthogonal design is that if there is an interaction involv-ing a main effect, then the estimate of that main effect is effectively the averagevalue of the effect of that factor over the levels of the other factors with which itinteracts Type IV SSq are defined to reproduce this property Type III and IVSSq will be identical unless there are some combinations of factors with noobservations

Trang 35

If there are interactions then the Type III and IV SSq for main effectscorrespond to averages of heterogeneous effects The use of Type II SSq corres-ponds to a more commendable strategy of testing interactions in the presence ofmain effects, and either testing main effects without interactions (if the latter cansafely be ignored) or not testing main effects at all (if interactions are regarded asbeing present), since it would rarely be useful to correct a main effect for aninteraction Type I SSq are also useful since they allow an effect to be correctedfor any other effects as required in the context of the nature of the variables andthe purpose of the analysis; in some cases the extraction of all the required Type ISSq may require several analyses, with the variables introduced in differentorders To summarize, in our view, appropriately chosen Type I SSq and Type

II SSq are useful, but Types III and IV are unnecessary

11.9 Checking the model

In this section we consider methods that can be used to check that a fittedregression model is valid in a statistical sense, that is, that the values of theregression coefficients, their standard errors, and inferences made from teststatistics may be accepted

Selecting the best regression

First we consider how the `best' regression model may be identified For a fullerdiscussion of this topic see Berry and Simpson (1998), Draper and Smith (1998,Chapter 15) or Kleinbaum et al (1998, Chapter 16)

The general form of the multiple regression model (11.38), where thevariables are labelled x1, x2, , xp, includes all the explanatory variables on

an apparently equal basis In practice it would rarely, if ever, be the case that allvariables had equal status One or more of the explanatory variables may be ofparticular importance because they relate to the main objective of the analysis.Such a variable or variables should always be retained in the regression model,since estimation of their regression coefficients is of interest irrespective ofwhether or not they are statistically significant Other variables may be includedonly because they may influence the response variable and, if so, it is important

to correct for their effects but otherwise they may be excluded Other variablesmight be created to represent an interactive effect between two other explanatoryvariablesÐfor example, the wivariables in (11.64)Ðand these variables must beassessed before the main effects of the variables contributing to the interactionmay be considered All of these considerations imply that the best regression isnot an unambiguous concept that could be obtained by a competent analystwithout consideration of what the variables represent Rather, determination ofthe best regression requires considerable input based on the purpose of the

Trang 36

analysis, the nature of the possible explanatory variables and existing knowledge

of smoking may increase with years of smoking, and hence with age, so that there

is an interactive effect of age and smoking on FEV This would be included bycreating an interaction variable defined as the product of the age and smokingvariables

Automatic selection procedures

A number of procedures have been developed whereby the computer selects the

`best' subset of predictor variables, the criterion of optimality being somewhatarbitrary There are four main approaches

1 Step-up (forward-entry) procedure The computer first tries all the p simpleregressions with just one predictor variable, choosing that which provides thehighest Regression SSq Retaining this variable as the first choice, it now triesall the p 1 two-variable regressions obtained by the various possibilities forthe second variable, choosing that which adds the largest increment to theRegression SSq The process continues, all variables chosen at any stagebeing retained at subsequent stages The process stops when the increments

to the Regression SSq cease to be (in some sense) large in comparison withthe Residual SSq

2 Step-down (backward-elimination) procedure The computer first does theregression on all predictor variables It then eliminates the least significantand does a regression on the remaining p 1 variables The process stopswhen all the retained regression coefficients are (in some sense) significant

3 Stepwise procedure This is an elaboration of the step-up procedure (1), butallowing elimination, as in the step-down procedure (2) After each change inthe set of variables included in the regression, the contribution of eachvariable is assessed and, if the least significant makes insufficient contribu-tion, by some criterion, it is eliminated It is thus possible for a variableincluded at some stage to be eliminated at a later stage because other vari-ables, introduced since it was included, have made it unnecessary Thecriterion for inclusion and elimination of variables could be, for example,that a variable will be included if its partial regression coefficient is significant

Trang 37

at the 005 level and eliminated if its partial regression coefficient fails to besignificant at the 01 level.

4 Best-subset selection procedure Methods 1, 2 and 3 do not necessarily reach thesame final choice, even if they end with the same number of retained variables.None will necessarily choose the best possible regression (i.e that with thelargest Regression SSq) for any given number of predictor variables Computeralgorithms are available for selecting the best subset of variables, where `best'may be defined as the regression with the largest adjusted R2 (11.46) or therelated Mallows Cpstatistic (see Draper & Smith, 1998, §15.1)

All these methods of model selection are available in the SAS program PROCREG under the SELECTION option; method 4 requires much more computertime than the others and may not be feasible for more than a few explanatoryvariables

None of these methods provides infallible tactics in the difficult problem ofselecting predictor variables As discussed earlier in this section, sometimescertain variables should be retained even though they have non-significanteffects, because of their logical importance in the particular problem Sometimeslogical relationships between some of the variables suggest that a particular oneshould be retained in preference to another In some cases a set of variables has

to be included or eliminated as a group, rather than individuallyÐfor example, aset of dummy variables representing a single characteristic (§11.7) In other casessome variables can only logically be included if one or more other variables arealso included; for example, an interaction term should only be included in thepresence of the corresponding main effects (§11.8) Many statistical computingpackages do not allow the specification of all of these constraints and a sequence

of applications may be necessary to produce a legitimate model Nevertheless,automatic selection is often a useful exploratory device, even when the selectedset of variables has to be modified on common-sense grounds

Collinearity

We mentioned on p 344 that difficulties may arise if some of the explanatoryvariables are highly correlated This situation is known as collinearity or multi-collinearity More generally, collinearity arises if there is an almost linear rela-tionship between some of the explanatory variables in the regression In this case,large changes in one explanatory variable can be effectively compensated for

by large changes in other variables, so that very different sets of regressioncoefficients provide very nearly the same residual sum of squares This leads tothe consequences that the regression coefficients will have large standard errorsand, in extreme cases, the computations, even with the precision achieved bycomputers, may make nonsense of the analysis The regression coefficients may

be numerically quite implausible and perhaps largely useless

Trang 38

Collinearity is a feature of the explanatory variables independent of thevalues of the dependent variable Often collinearity can be recognized from thecorrelation matrix of the explanatory variables If two variables are highlycorrelated then this implies collinearity between those two variables On theother hand, it is possible for collinearity to occur between a set of three ormore variables without any of the correlations between these variables beingparticularly large, so it is useful to have a more formal check in the regressioncalculations A measure of the collinearity between xiand the other explanatoryvariables is provided by the proportion of the variability in xithat is explained bythe other variables, R2

i, when xiis the dependent variable in a regression on all theother xs The variance of bi, in the regression of y on x1to xp, is proportional tothe reciprocal of 1 R2

i (see Wetherill et al., 1986, §4.3), so values of R2

i close to 1will lead to an increased variance for the estimate of bi The variance inflationfactor (VIF) for variable i is defined as

The problems of collinearity may be overcome in several ways In somesituations the collinearity has arisen purely as a computational problem andmay be solved by alternative definitions of some of the variables For example, ifboth x and x2 are included as explanatory variables and all the values of x arepositive, then x and x2are likely to be highly correlated This can be overcome byredefining the quadratic term as …x x†2, which will reduce the correlation whilstleading to an equivalent regression This device is called centring When thecollinearity is purely computational, values of the VIF much greater than 10can be accepted before there are any problems of computational accuracy, using

a modern regression package, such as the SAS program This is demonstrated inthe following example

Example 11.8

In Example 11.6 when the analysis is carried out using the model defined by (11.64),extended to three groups, then there are high correlations between wij, ziand xj For thefull model with eight variables fitted, the values of the VIF for the ziwere both greaterthan 140, and two of the wijhad similar VIFs, whilst the other two had VIFs of 86 and 92.Thus, six of the eight VIFs were particularly large and this might cause some concern

Trang 39

about collinearity These high VIFs can be avoided by centring x1 and x2, when themaximum VIF was 6.8 However, the original analysis lost no important precision, usingSAS regression PROC REG, so that the collinearity was of no practical concern.

In other situations the correlation is an intrinsic feature of the variables; forexample, if both diastolic and systolic blood pressure were included, then a highcorrelation between these two measures may lead to a collinearity problem Theappropriate action is to use only one of the measures or possibly replace them bythe mean of the pair This leads to no real loss because the high correlationeffectively means that only one measure is needed to use almost all the information

In most situations the reasons for collinearity will be readily identified fromthe variance inflation factors and the nature of the variables, but where this is notthe case more complex methods based on the principal components (§13.2) of theexplanatory variables may be used (see Draper & Smith, 1998, §§16.4±16.5, orKleinbaum et al., 1998, §12.5.2)

Another approach, which we shall not pursue, is to use ridge regression, atechnique which tends to give more stable estimates of regression coefficients,usually closer to zero, than the least squares estimates For a fuller discussion, seeDraper and Smith (1998, Chapter 17) As they point out, this is an entirelyreasonable approach from a Bayesian standpoint if one takes the view thatnumerically large coefficients are intrinsically implausible, and prior knowledge,

or belief, on the distribution of the regression coefficients may be formallyincorporated into the method In other circumstances the method might beinappropriate and we agree with Draper and Smith's advice against its indis-criminate use

DF

Trang 40

The SSq for (iii) is obtained by subtraction, (i) (ii), and the adequacy of themodel is tested by the variance ratio from lines (iii) and (iv).

In general, the above approach will not be feasible since the observations willnot fall into groups of replicates

Residual plots

Much information may be gained by graphical study of the residuals, y Y.These values, and the predicted values Y, are often printed in computer output.The values for the data in Example 11.5 are shown in Table 11.6 We nowdescribe some potentially useful scatter diagrams involving the residuals, illus-trating these by Fig 11.9, which relates to Example 11.5

1 Plot of y Y against Y (Fig 11.9a) The residuals are always uncorrelatedwith the predicted value Nevertheless, the scatter diagram may provide someuseful pointers The distribution of the residuals may be markedly non-normal; in Fig 11.9(a) there is some suggestion of positive skewness Thispositive skewness and other departures from normality can also be detected

by constructing a histogram of the residuals or a normal plot (see p 371;Example 11.10 is for the data of Example 11.5) The variability of theresiduals may not be constant; in Fig 11.9(a) it seems to increase as Yincreases Both these deficiencies may sometimes be remedied by transform-ation of the y variable and reanalysis (see §10.8) There is some indication inthis example that it would be appropriate to repeat the analysis after alogarithmic transformation (§2.5) of y and this is explored in §11.10, Insome cases the trend in variability may call for a weighted analysis (§11.6).Even though the correlation is zero there may be a marked non-linear trend;

if so, it is likely to appear also in the plots of type 2 below

2 Plot of y Y against xj (Fig 11.9b,c) The residuals may be plotted againstthe values of any or all of the predictor variables Again, the correlation willalways be zero There may, however, be a non-linear trendÐfor example,with the residuals tending to rise to a maximum somewhere near the mean, xj,and falling away on either side, as is perhaps suggested in Fig 11.9(c); orshowing a trend with a minimum value near xj Such trends suggest that theeffect of xj is not adequately expressed by the linear term in the model Thesimplest suggestion would be to add a term involving the square of xj as anextra predictor variable This so-called quadratic regression is described in

§12.1

3 Plot of y Y against product xjxh(Fig 11.9d) The model (11.38) postulates

no interaction between the xs, in the sense of §9.3 That is, the effect ofchanging one predictor variable is independent of the values taken by anyother This would not be so if a term xjxhwere introduced into the model Ifsuch a term is needed, but has been omitted, the residuals will tend to be

Ngày đăng: 10/08/2014, 15:20

TỪ KHÓA LIÊN QUAN