6.2 Simple Linear Regression Assume yi represents the value of what is generally known as the response variableon the ith individual and that xirepresents the individual’s values on what
Trang 1CHAPTER 6
Simple and Multiple Linear Regression: How Old is the Universe and Cloud
Seeding
6.1 Introduction
Freedman et al (2001) give the relative velocity and the distance of 24 galaxies, according to measurements made using the Hubble Space Telescope – the data are contained in the gamair package accompanying Wood (2006), see Table 6.1 Velocities are assessed by measuring the Doppler red shift in the spectrum of light observed from the galaxies concerned, although some correction for ‘local’ velocity components is required Distances are measured using the known relationship between the period of Cepheid variable stars and their luminosity How can these data be used to estimate the age of the universe? Here we shall show how this can be done using simple linear regression
Table 6.1: hubble data Distance and velocity for 24 galaxies galaxy velocity distance galaxy velocity distance
Source : From Freedman W L., et al., The Astrophysical Journal, 553, 47–72,
2001 With permission
Trang 298 SIMPLE AND MULTIPLE LINEAR REGRESSION
Table 6.2: clouds data Cloud seeding experiments in Florida –
see above for explanations of the variables
seeding time sne cloudcover prewetness echomotion rainfall
Weather modification, or cloud seeding, is the treatment of individual clouds
or storm systems with various inorganic and organic materials in the hope of achieving an increase in rainfall Introduction of such material into a cloud that contains supercooled water, that is, liquid water colder than zero degrees
of Celsius, has the aim of inducing freezing, with the consequent ice particles growing at the expense of liquid droplets and becoming heavy enough to fall
as rain from clouds that otherwise would produce none
The data shown in Table 6.2 were collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide (100 to
1000 grams per cloud) in cloud seeding to increase rainfall (Woodley et al., 1977) In the experiment, which was conducted in an area of Florida, 24 days were judged suitable for seeding on the basis that a measured suitability
cri-terion, denoted S-Ne, was not less than 1.5 Here S is the ‘seedability’, the
difference between the maximum height of a cloud if seeded and the same cloud
if not seeded predicted by a suitable cloud model, and Ne is the number of
Trang 3SIMPLE LINEAR REGRESSION 99 hours between 1300 and 1600 G.M.T with 10 centimetre echoes in the target; this quantity biases the decision for experimentation against naturally rainy days Consequently, optimal days for seeding are those on which seedability is large and the natural rainfall early in the day is small
On suitable days, a decision was taken at random as to whether to seed or not For each day the following variables were measured:
seeding: a factor indicating whether seeding action occurred (yes or no), time: number of days after the first day of the experiment,
cloudcover: the percentage cloud cover in the experimental area, measured using radar,
prewetness: the total rainfall in the target area one hour before seeding (in cubic metres ×107),
echomotion: a factor showing whether the radar echo was moving or station-ary,
rainfall: the amount of rain in cubic metres ×107,
sne: suitability criterion, see above
The objective in analysing these data is to see how rainfall is related to the explanatory variables and, in particular, to determine the effectiveness of
seeding The method to be used is multiple linear regression.
6.2 Simple Linear Regression
Assume yi represents the value of what is generally known as the response variableon the ith individual and that xirepresents the individual’s values on
what is most often called an explanatory variable The simple linear regression
model is
yi= β0+ β1xi+ εi where β0is the intercept and β1is the slope of the linear relationship assumed between the response and explanatory variables and εi is an error term (The
‘simple’ here means that the model contains only a single explanatory vari-able; we shall deal with the situation where there are several explanatory variables in the next section.) The error terms are assumed to be independent random variables having a normal distribution with mean zero and constant variance σ2
The regression coefficients, β0and β1, may be estimated as ˆβ0and ˆβ1using
least squares estimation, in which the sum of squared differences between the observed values of the response variable yi and the values ‘predicted’ by the
Trang 4100 SIMPLE AND MULTIPLE LINEAR REGRESSION regression equation ˆyi= ˆβ0+ ˆβ1xi is minimised, leading to the estimates;
ˆ
β1 =
n P i=1 (yi− ¯y)(xi− ¯x) n
P i=1 (xi− ¯x)2 ˆ
β0 = y − ˆ¯ β1x¯ where ¯y and ¯x are the means of the response and explanatory variable, re-spectively
The predicted values of the response variable y from the model are ˆyi = ˆ
β0+ ˆβ1xi The variance σ2of the error terms is estimated as
ˆ
σ2= 1
n − 2
n X i=1 (yi− ˆyi)2 The estimated variance of the estimate of the slope parameter is
Var( ˆβ1) = σˆ
2 n P i=1 (xi− ¯x)2
,
whereas the estimated variance of a predicted value ypred at a given value of
x, say x0 is
Var(ypred) = ˆσ2
v u t
1
n+ 1 +
(x0− ¯x)2 n
P i=1 (xi− ¯x)2
In some applications of simple linear regression a model without an intercept
is required (when the data is such that the line must go through the origin), i.e., a model of the form
yi = β1xi+ εi
In this case application of least squares gives the following estimator for β1
ˆ
β1=
n P i=1
xiyi n P i=1
x2 i
6.3 Multiple Linear Regression
Assume yirepresents the value of the response variable on the ith individual, and that xi1, xi2, , xiq represents the individual’s values on q explanatory variables, with i = 1, , n The multiple linear regression model is given by
yi= β0+ β1xi1+ · · · + βqxiq+ εi
Trang 5MULTIPLE LINEAR REGRESSION 101 The error terms εi, i = 1, , n, are assumed to be independent random variables having a normal distribution with mean zero and constant variance
σ2 Consequently, the distribution of the random response variable, y, is also normal with expected value given by the linear combination of the explanatory variables
E(y|x1, , xq) = β0+ β1x1+ · · · + βqxq and with variance σ2
The parameters of the model βk, k = 1, , q, are known as regression coefficients with β0 corresponding to the overall mean The regression coeffi-cients represent the expected change in the response variable associated with
a unit change in the corresponding explanatory variable, when the remaining
explanatory variables are held constant The linear in multiple linear
regres-sion applies to the regresregres-sion parameters, not to the response or explanatory variables Consequently, models in which, for example, the logarithm of a re-sponse variable is modelled in terms of quadratic functions of some of the explanatory variables would be included in this class of models
The multiple linear regression model can be written most conveniently for
all n individuals by using matrices and vectors as y = Xβ + ε where y⊤ = (y1, , yn) is the vector of response variables, β⊤ = (β0, β1, , βq) is the vector of regression coefficients, and ε⊤= (ε1, , εn) are the error terms The
design or model matrix X consists of the q continuously measured explanatory
variables and a column of ones corresponding to the intercept term
X=
1 x11 x12 x1q
1 x21 x22 x2q
. . .
1 xn1 xn2 xnq
In case one or more of the explanatory variables are nominal or ordinal vari-ables, they are represented by a zero-one dummy coding Assume that x1is a
factor at m levels, the submatrix of X corresponding to x1is a n × m matrix
of zeros and ones, where the jth element in the ith row is one when xi1 is at the jth level
Assuming that the cross-product X⊤Xis non-singular, i.e., can be inverted, then the least squares estimator of the parameter vector β is unique and can
be calculated by ˆβ = (X⊤X)−1X⊤y The expectation and covariance of this estimator ˆβ are given by E( ˆβ) = β and Var( ˆβ) = σ2(X⊤X)−1 The diagonal elements of the covariance matrix Var( ˆβ) give the variances of ˆβj, j = 0, , q, whereas the off diagonal elements give the covariances between pairs of ˆβj and ˆβk The square roots of the diagonal elements of the covariance matrix are thus the standard errors of the estimates ˆβj
If the cross-product X⊤Xis singular we need to reformulate the model to
y = XCβ⋆+ ε such that X⋆= XC has full rank The matrix C is called the
contrast matrix in S and R and the result of the model fit is an estimate ˆβ⋆
Trang 6102 SIMPLE AND MULTIPLE LINEAR REGRESSION
By default, a contrast matrix derived from treatment contrasts is used For the
theoretical details we refer to Searle (1971), the implementation of contrasts
in S and R is discussed by Chambers and Hastie (1992) and Venables and Ripley (2002)
The regression analysis can be assessed using the following analysis of vari-ance table (Table 6.3):
Table 6.3: Analysis of variance table for the multiple linear
re-gression model
Source of variation Sum of squares Degrees of freedom Regression
n P i=1 (ˆyi− ¯y)2 q Residual
n P i=1 (ˆyi− yi)2 n − q − 1 Total
n P i=1 (yi− ¯y)2 n − 1
where ˆyi is the predicted value of the response variable for the ith individual
ˆi= ˆβ0+ ˆβ1xi1+ · · · + ˆβqxq1 and ¯y =Pn
i=1yi/n is the mean of the response variable
The mean square ratio
F =
n P i=1 (ˆyi− ¯y)2/q n
P i=1 (ˆyi− yi)2/(n − q − 1) provides an F -test of the general hypothesis
H0: β1= · · · = βq = 0
Under H0, the test statistic F has an F -distribution with q and n − q − 1 degrees of freedom An estimate of the variance σ2 is
ˆ
σ2= 1
n − q − 1
n X i=1 (yi− ˆyi)2 The correlation between the observed values yi and the fitted values ˆyi is
known as the multiple correlation coefficient Individual regression coefficients
can be assessed by using the ratio t-statistics tj = ˆβj/
q Var( ˆ jj, although these ratios should be used only as rough guides to the ‘significance’ of the coefficients The problem of selecting the ‘best’ subset of variables to be in-cluded in a model is one of the most delicate ones in statistics and we refer
to Miller (2002) for the theoretical details and practical limitations (and see Exercise 6.4)
Trang 7ANALYSIS USING R 103
6.3.1 Regression Diagnostics
The possible influence of outliers and the checking of assumptions made in fitting the multiple regression model, i.e., constant variance and normality of error terms, can both be undertaken using a variety of diagnostic tools, of which the simplest and most well known are the estimated residuals, i.e., the differences between the observed values of the response and the fitted values of the response In essence these residuals estimate the error terms in the simple and multiple linear regression model So, after estimation, the next stage in the analysis should be an examination of such residuals from fitting the chosen model to check on the normality and constant variance assumptions and to identify outliers The most useful plots of these residuals are:
• A plot of residuals against each explanatory variable in the model The pres-ence of a non-linear relationship, for example, may suggest that a higher-order term, in the explanatory variable should be considered
• A plot of residuals against fitted values If the variance of the residuals appears to increase with predicted value, a transformation of the response variable may be in order
• A normal probability plot of the residuals After all the systematic variation has been removed from the data, the residuals should look like a sample from a standard normal distribution A plot of the ordered residuals against the expected order statistics from a normal distribution provides a graphical check of this assumption
6.4 Analysis Using R
6.4.1 Estimating the Age of the Universe
Prior to applying a simple regression to the data it will be useful to look at a plot to assess their major features The R code given inFigure 6.1produces a scatterplot of velocity and distance The diagram shows a clear, strong rela-tionship between velocity and distance The next step is to fit a simple linear regression model to the data, but in this case the nature of the data requires
a model without intercept because if distance is zero so is relative speed So the model to be fitted to these data is
velocity = β1distance + ε
This is essentially what astronomers call Hubble’s Law and β1 is known as Hubble’s constant; β−1
1 gives an approximate age of the universe
To fit this model we are estimating β1 using formula (6.1) Although this operation is rather easy
R> sum(hubble$distance * hubble$velocity) /
+ sum(hubble$distance^2)
[1] 76.58117
it is more convenient to apply R’s linear modelling function
Trang 8104 SIMPLE AND MULTIPLE LINEAR REGRESSION R> plot(velocity ~ distance, data = hubble)
distance
Figure 6.1 Scatterplot of velocity and distance
R> hmod <- lm(velocity ~ distance - 1, data = hubble)
Note that the model formula specifies a model without intercept We can now extract the estimated model coefficients via
R> coef(hmod)
distance
76.58117
and add this estimated regression line to the scatterplot; the result is shown
in Figure 6.2 In addition, we produce a scatterplot of the residuals yi−
ˆi against fitted values ˆyi to assess the quality of the model fit It seems that for higher distance values the variance of velocity increases; however, we are interested in only the estimated parameter ˆβ1 which remains valid under variance heterogeneity (in contrast to t-tests and associated p-values) Now we can use the estimated value of β1 to find an approximate value
Trang 9ANALYSIS USING R 105 R> layout(matrix(1:2, ncol = 2))
R> plot(velocity ~ distance, data = hubble)
R> abline(hmod)
R> plot(hmod, which = 1)
distance
Fitted values
Residuals vs Fitted
15
3 16
Figure 6.2 Scatterplot of velocity and distance with estimated regression line
(left) and plot of residuals against fitted values (right)
for the age of the universe The Hubble constant itself has units of km × sec−1× Mpc−1 A mega-parsec (Mpc) is 3.09 × 1019km, so we need to divide the estimated value of β1by this amount in order to obtain Hubble’s constant with units of sec−1 The approximate age of the universe in seconds will then
be the inverse of this calculation Carrying out the necessary computations R> Mpc <- 3.09 * 10^19
R> ysec <- 60^2 * 24 * 365.25
R> Mpcyear <- Mpc / ysec
R> 1 / (coef(hmod) / Mpcyear)
distance
12785935335
gives an estimated age of roughly 12.8 billion years
6.4.2 Cloud Seeding
Again, a graphical display highlighting the most important aspects of the data will be helpful Here we will construct boxplots of the rainfall in each category
Trang 10106 SIMPLE AND MULTIPLE LINEAR REGRESSION
of the dichotomous explanatory variables and scatterplots of rainfall against each of the continuous explanatory variables
Both the boxplots (Figure 6.3) and the scatterplots (Figure 6.4) show some evidence of outliers The row names of the extreme observations in the clouds
data.frame can be identified via
R> rownames(clouds)[clouds$rainfall %in% c(bxpseeding$out,
[1] "1" "15"
where bxpseeding and bxpecho are variables created by boxplot in Fig-ure 6.3 Now we shall not remove these observations but bear in mind during the modelling process that they may cause problems
In this example it is sensible to assume that the effect that some of the other explanatory variables is modified by seeding and therefore consider a model that includes seeding as covariate and, furthermore, allows interaction terms for seeding with each of the covariates except time This model can
be described by the formula
R> clouds_formula <- rainfall ~ seeding +
+ seeding:(sne + cloudcover + prewetness + echomotion) +
and the design matrix X⋆ can be computed via
R> Xstar <- model.matrix(clouds_formula, data = clouds)
By default, treatment contrasts have been applied to the dummy codings of the factors seeding and echomotion as can be seen from the inspection of the contrasts attribute of the model matrix
R> attr(Xstar, "contrasts")
$seeding
[1] "contr.treatment"
$echomotion
[1] "contr.treatment"
The default contrasts can be changed via the contrasts.arg argument to model.matrixor the contrasts argument to the fitting function, for example
lmor aov as shown in Chapter 5
However, such internals are hidden and performed by high-level model-fitting functions such as lm which will be used to fit the linear model defined
by the formula clouds_formula:
R> clouds_lm <- lm(clouds_formula, data = clouds)
R> class(clouds_lm)
[1] "lm"
The results of the model fitting is an object of class lm for which a summary
method showing the conventional regression analysis output is available The