C6 Simple and Multiple Linear Regression

6.2 Simple Linear Regression Assume yi represents the value of what is generally known as the response variableon the ith individual and that xirepresents the individual’s values on what

Trang 1

CHAPTER 6

Simple and Multiple Linear Regression: How Old is the Universe and Cloud

Seeding

6.1 Introduction

Freedman et al (2001) give the relative velocity and the distance of 24 galaxies, according to measurements made using the Hubble Space Telescope – the data are contained in the gamair package accompanying Wood (2006), see Table 6.1 Velocities are assessed by measuring the Doppler red shift in the spectrum of light observed from the galaxies concerned, although some correction for ‘local’ velocity components is required Distances are measured using the known relationship between the period of Cepheid variable stars and their luminosity How can these data be used to estimate the age of the universe? Here we shall show how this can be done using simple linear regression

Table 6.1: hubble data Distance and velocity for 24 galaxies galaxy velocity distance galaxy velocity distance

Source : From Freedman W L., et al., The Astrophysical Journal, 553, 47–72,

2001 With permission

Trang 2

98 SIMPLE AND MULTIPLE LINEAR REGRESSION

Table 6.2: clouds data Cloud seeding experiments in Florida –

see above for explanations of the variables

seeding time sne cloudcover prewetness echomotion rainfall

Weather modification, or cloud seeding, is the treatment of individual clouds

or storm systems with various inorganic and organic materials in the hope of achieving an increase in rainfall Introduction of such material into a cloud that contains supercooled water, that is, liquid water colder than zero degrees

of Celsius, has the aim of inducing freezing, with the consequent ice particles growing at the expense of liquid droplets and becoming heavy enough to fall

as rain from clouds that otherwise would produce none

The data shown in Table 6.2 were collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide (100 to

1000 grams per cloud) in cloud seeding to increase rainfall (Woodley et al., 1977) In the experiment, which was conducted in an area of Florida, 24 days were judged suitable for seeding on the basis that a measured suitability

cri-terion, denoted S-Ne, was not less than 1.5 Here S is the ‘seedability’, the

difference between the maximum height of a cloud if seeded and the same cloud

if not seeded predicted by a suitable cloud model, and Ne is the number of

Trang 3

SIMPLE LINEAR REGRESSION 99 hours between 1300 and 1600 G.M.T with 10 centimetre echoes in the target; this quantity biases the decision for experimentation against naturally rainy days Consequently, optimal days for seeding are those on which seedability is large and the natural rainfall early in the day is small

On suitable days, a decision was taken at random as to whether to seed or not For each day the following variables were measured:

seeding: a factor indicating whether seeding action occurred (yes or no), time: number of days after the first day of the experiment,

cloudcover: the percentage cloud cover in the experimental area, measured using radar,

prewetness: the total rainfall in the target area one hour before seeding (in cubic metres ×107),

echomotion: a factor showing whether the radar echo was moving or station-ary,

rainfall: the amount of rain in cubic metres ×107,

sne: suitability criterion, see above

The objective in analysing these data is to see how rainfall is related to the explanatory variables and, in particular, to determine the effectiveness of

seeding The method to be used is multiple linear regression.

6.2 Simple Linear Regression

Assume yi represents the value of what is generally known as the response variableon the ith individual and that xirepresents the individual’s values on

what is most often called an explanatory variable The simple linear regression

model is

yi= β0+ β1xi+ εi where β0is the intercept and β1is the slope of the linear relationship assumed between the response and explanatory variables and εi is an error term (The

‘simple’ here means that the model contains only a single explanatory vari-able; we shall deal with the situation where there are several explanatory variables in the next section.) The error terms are assumed to be independent random variables having a normal distribution with mean zero and constant variance σ2

The regression coefficients, β0and β1, may be estimated as ˆβ0and ˆβ1using

least squares estimation, in which the sum of squared differences between the observed values of the response variable yi and the values ‘predicted’ by the

Trang 4

100 SIMPLE AND MULTIPLE LINEAR REGRESSION regression equation ˆyi= ˆβ0+ ˆβ1xi is minimised, leading to the estimates;

ˆ

β1 =

n P i=1 (yi− ¯y)(xi− ¯x) n

P i=1 (xi− ¯x)2 ˆ

β0 = y − ˆ¯ β1x¯ where ¯y and ¯x are the means of the response and explanatory variable, re-spectively

The predicted values of the response variable y from the model are ˆyi = ˆ

β0+ ˆβ1xi The variance σ2of the error terms is estimated as

ˆ

σ2= 1

n − 2

n X i=1 (yi− ˆyi)2 The estimated variance of the estimate of the slope parameter is

Var( ˆβ1) = σˆ

2 n P i=1 (xi− ¯x)2

,

whereas the estimated variance of a predicted value ypred at a given value of

x, say x0 is

Var(ypred) = ˆσ2

v u t

1

n+ 1 +

(x0− ¯x)2 n

P i=1 (xi− ¯x)2

In some applications of simple linear regression a model without an intercept

is required (when the data is such that the line must go through the origin), i.e., a model of the form

yi = β1xi+ εi

In this case application of least squares gives the following estimator for β1

ˆ

β1=

n P i=1

xiyi n P i=1

x2 i

6.3 Multiple Linear Regression

Assume yirepresents the value of the response variable on the ith individual, and that xi1, xi2, , xiq represents the individual’s values on q explanatory variables, with i = 1, , n The multiple linear regression model is given by

yi= β0+ β1xi1+ · · · + βqxiq+ εi

Trang 5

MULTIPLE LINEAR REGRESSION 101 The error terms εi, i = 1, , n, are assumed to be independent random variables having a normal distribution with mean zero and constant variance

σ2 Consequently, the distribution of the random response variable, y, is also normal with expected value given by the linear combination of the explanatory variables

E(y|x1, , xq) = β0+ β1x1+ · · · + βqxq and with variance σ2

The parameters of the model βk, k = 1, , q, are known as regression coefficients with β0 corresponding to the overall mean The regression coeffi-cients represent the expected change in the response variable associated with

a unit change in the corresponding explanatory variable, when the remaining

explanatory variables are held constant The linear in multiple linear

regres-sion applies to the regresregres-sion parameters, not to the response or explanatory variables Consequently, models in which, for example, the logarithm of a re-sponse variable is modelled in terms of quadratic functions of some of the explanatory variables would be included in this class of models

The multiple linear regression model can be written most conveniently for

all n individuals by using matrices and vectors as y = Xβ + ε where y⊤ = (y1, , yn) is the vector of response variables, β⊤ = (β0, β1, , βq) is the vector of regression coefficients, and ε⊤= (ε1, , εn) are the error terms The

design or model matrix X consists of the q continuously measured explanatory

variables and a column of ones corresponding to the intercept term

X=







1 x11 x12 x1q

1 x21 x22 x2q

. . .

1 xn1 xn2 xnq







In case one or more of the explanatory variables are nominal or ordinal vari-ables, they are represented by a zero-one dummy coding Assume that x1is a

factor at m levels, the submatrix of X corresponding to x1is a n × m matrix

of zeros and ones, where the jth element in the ith row is one when xi1 is at the jth level

Assuming that the cross-product X⊤Xis non-singular, i.e., can be inverted, then the least squares estimator of the parameter vector β is unique and can

be calculated by ˆβ = (X⊤X)−1X⊤y The expectation and covariance of this estimator ˆβ are given by E( ˆβ) = β and Var( ˆβ) = σ2(X⊤X)−1 The diagonal elements of the covariance matrix Var( ˆβ) give the variances of ˆβj, j = 0, , q, whereas the off diagonal elements give the covariances between pairs of ˆβj and ˆβk The square roots of the diagonal elements of the covariance matrix are thus the standard errors of the estimates ˆβj

If the cross-product X⊤Xis singular we need to reformulate the model to

y = XCβ⋆+ ε such that X⋆= XC has full rank The matrix C is called the

contrast matrix in S and R and the result of the model fit is an estimate ˆβ⋆

Trang 6

By default, a contrast matrix derived from treatment contrasts is used For the

theoretical details we refer to Searle (1971), the implementation of contrasts

in S and R is discussed by Chambers and Hastie (1992) and Venables and Ripley (2002)

The regression analysis can be assessed using the following analysis of vari-ance table (Table 6.3):

Table 6.3: Analysis of variance table for the multiple linear

re-gression model

Source of variation Sum of squares Degrees of freedom Regression

n P i=1 (ˆyi− ¯y)2 q Residual

n P i=1 (ˆyi− yi)2 n − q − 1 Total

n P i=1 (yi− ¯y)2 n − 1

where ˆyi is the predicted value of the response variable for the ith individual

ˆi= ˆβ0+ ˆβ1xi1+ · · · + ˆβqxq1 and ¯y =Pn

i=1yi/n is the mean of the response variable

The mean square ratio

F =

n P i=1 (ˆyi− ¯y)2/q n

P i=1 (ˆyi− yi)2/(n − q − 1) provides an F -test of the general hypothesis

H0: β1= · · · = βq = 0

Under H0, the test statistic F has an F -distribution with q and n − q − 1 degrees of freedom An estimate of the variance σ2 is

ˆ

σ2= 1

n − q − 1

n X i=1 (yi− ˆyi)2 The correlation between the observed values yi and the fitted values ˆyi is

known as the multiple correlation coefficient Individual regression coefficients

can be assessed by using the ratio t-statistics tj = ˆβj/

q Var( ˆ jj, although these ratios should be used only as rough guides to the ‘significance’ of the coefficients The problem of selecting the ‘best’ subset of variables to be in-cluded in a model is one of the most delicate ones in statistics and we refer

to Miller (2002) for the theoretical details and practical limitations (and see Exercise 6.4)

Trang 7

ANALYSIS USING R 103

6.3.1 Regression Diagnostics

The possible influence of outliers and the checking of assumptions made in fitting the multiple regression model, i.e., constant variance and normality of error terms, can both be undertaken using a variety of diagnostic tools, of which the simplest and most well known are the estimated residuals, i.e., the differences between the observed values of the response and the fitted values of the response In essence these residuals estimate the error terms in the simple and multiple linear regression model So, after estimation, the next stage in the analysis should be an examination of such residuals from fitting the chosen model to check on the normality and constant variance assumptions and to identify outliers The most useful plots of these residuals are:

• A plot of residuals against each explanatory variable in the model The pres-ence of a non-linear relationship, for example, may suggest that a higher-order term, in the explanatory variable should be considered

• A plot of residuals against fitted values If the variance of the residuals appears to increase with predicted value, a transformation of the response variable may be in order

• A normal probability plot of the residuals After all the systematic variation has been removed from the data, the residuals should look like a sample from a standard normal distribution A plot of the ordered residuals against the expected order statistics from a normal distribution provides a graphical check of this assumption

6.4 Analysis Using R

6.4.1 Estimating the Age of the Universe

Prior to applying a simple regression to the data it will be useful to look at a plot to assess their major features The R code given inFigure 6.1produces a scatterplot of velocity and distance The diagram shows a clear, strong rela-tionship between velocity and distance The next step is to fit a simple linear regression model to the data, but in this case the nature of the data requires

a model without intercept because if distance is zero so is relative speed So the model to be fitted to these data is

velocity = β1distance + ε

This is essentially what astronomers call Hubble’s Law and β1 is known as Hubble’s constant; β−1

1 gives an approximate age of the universe

To fit this model we are estimating β1 using formula (6.1) Although this operation is rather easy

R> sum(hubble$distance * hubble$velocity) /

+ sum(hubble$distance^2)

[1] 76.58117

it is more convenient to apply R’s linear modelling function

Trang 8

104 SIMPLE AND MULTIPLE LINEAR REGRESSION R> plot(velocity ~ distance, data = hubble)

distance

Figure 6.1 Scatterplot of velocity and distance

R> hmod <- lm(velocity ~ distance - 1, data = hubble)

Note that the model formula specifies a model without intercept We can now extract the estimated model coefficients via

R> coef(hmod)

distance

76.58117

and add this estimated regression line to the scatterplot; the result is shown

in Figure 6.2 In addition, we produce a scatterplot of the residuals yi−

ˆi against fitted values ˆyi to assess the quality of the model fit It seems that for higher distance values the variance of velocity increases; however, we are interested in only the estimated parameter ˆβ1 which remains valid under variance heterogeneity (in contrast to t-tests and associated p-values) Now we can use the estimated value of β1 to find an approximate value

Trang 9

ANALYSIS USING R 105 R> layout(matrix(1:2, ncol = 2))

R> plot(velocity ~ distance, data = hubble)

R> abline(hmod)

R> plot(hmod, which = 1)

distance

Fitted values

Residuals vs Fitted

15

3 16

Figure 6.2 Scatterplot of velocity and distance with estimated regression line

(left) and plot of residuals against fitted values (right)

for the age of the universe The Hubble constant itself has units of km × sec−1× Mpc−1 A mega-parsec (Mpc) is 3.09 × 1019km, so we need to divide the estimated value of β1by this amount in order to obtain Hubble’s constant with units of sec−1 The approximate age of the universe in seconds will then

be the inverse of this calculation Carrying out the necessary computations R> Mpc <- 3.09 * 10^19

R> ysec <- 60^2 * 24 * 365.25

R> Mpcyear <- Mpc / ysec

R> 1 / (coef(hmod) / Mpcyear)

distance

12785935335

gives an estimated age of roughly 12.8 billion years

6.4.2 Cloud Seeding

Again, a graphical display highlighting the most important aspects of the data will be helpful Here we will construct boxplots of the rainfall in each category

Trang 10

of the dichotomous explanatory variables and scatterplots of rainfall against each of the continuous explanatory variables

Both the boxplots (Figure 6.3) and the scatterplots (Figure 6.4) show some evidence of outliers The row names of the extreme observations in the clouds

data.frame can be identified via

R> rownames(clouds)[clouds$rainfall %in% c(bxpseeding$out,

[1] "1" "15"

where bxpseeding and bxpecho are variables created by boxplot in Fig-ure 6.3 Now we shall not remove these observations but bear in mind during the modelling process that they may cause problems

In this example it is sensible to assume that the effect that some of the other explanatory variables is modified by seeding and therefore consider a model that includes seeding as covariate and, furthermore, allows interaction terms for seeding with each of the covariates except time This model can

be described by the formula

R> clouds_formula <- rainfall ~ seeding +

+ seeding:(sne + cloudcover + prewetness + echomotion) +

and the design matrix X⋆ can be computed via

R> Xstar <- model.matrix(clouds_formula, data = clouds)

By default, treatment contrasts have been applied to the dummy codings of the factors seeding and echomotion as can be seen from the inspection of the contrasts attribute of the model matrix

R> attr(Xstar, "contrasts")

$seeding

[1] "contr.treatment"

$echomotion

[1] "contr.treatment"

The default contrasts can be changed via the contrasts.arg argument to model.matrixor the contrasts argument to the fitting function, for example

lmor aov as shown in Chapter 5

However, such internals are hidden and performed by high-level model-fitting functions such as lm which will be used to fit the linear model defined

by the formula clouds_formula:

R> clouds_lm <- lm(clouds_formula, data = clouds)

R> class(clouds_lm)

[1] "lm"

The results of the model fitting is an object of class lm for which a summary

method showing the conventional regression analysis output is available The

Định dạng
Số trang	20
Dung lượng	233,27 KB