1 is 14:44, which corresponds to one of the four points below the regression line, namely the point x1; y 11; 87:6: At each of the x1 values in the data set we assume that the popul
Trang 1Massey University±Albany, Palmerston North, New Zealand
1.1 FITTING A MODEL TO DATA
1.1.1 What is Regression?
1.1.1.1 Historical Note
Regression is, arguably, the most commonly used
tech-nique in applied statistics It can be used with data that
are collected in a very structured way, such as sample
surveys or experiments, but it can also be applied to
observational data This ¯exibility is its strength but
also its weakness, if used in an unthinking manner
The history of the method can be traced to Sir
Francis Galton who published in 1885 a paper with
the title, ``Regression toward mediocrity in hereditary
stature.'' In essence, he measured the heights of
par-ents and found the median height of each mother±
father pair and compared these medians with the
height of their adult offspring He concluded that
those with very tall parents were generally taller
than average but were not as tall as the median height
of their parents; those with short parents tended to be
below average height but were not as short as the
median height of their parents Female offspring
were combined with males by multiplying female
heights by a factor of 1.08
Regression can be used to explain relationships or
to predict outcomes In Galton's data, the median
height of parents is the explanatory or predictor
vari-able, which we denote by X, while the response or
predicted variable is the height of the offspring,
denoted by Y While the individual value of Y cannot
be forecast exactly, the average value can be for a givenvalue of the explanatory variable, X
1.1.1.2 Brief OverviewUppermost in the minds of the authors of this chapter
is the desire to relate some basic theory to the tion and practice of regression In Sec 1.1, we set outsome terminology and basic theory Section 1.2 exam-ines statistics and graphs to explore how well theregression model ®ts the data Section 1.3 concentrates
applica-on variables and how to select a small but effectivemodel Section 1.4 looks to individual data pointsand seeks out peculiar observations
We will attempt to relate the discussion to somedata sets which are shown in Sec 1.5 Note that datamay have many different forms and the questionsasked of the data will vary considerably from oneapplication to another The variety of types of data
is evident from the description of some of these datasets
Example 1 Pairs (Triplets, etc.) of Variables (Sec.1.5.1): The Y-variable in this example is the heat devel-oped in mixing the components of certain cements whichhave varying amounts of four X-variables or chemicals inthe mixture There is no information about how the var-ious amounts of the X-variables have been chosen Allvariables are continuous variables
269
Trang 2Example 2 Grouping Variables (Sec 1.5.2):
Qualitative variables are introduced to indicate groups
allocated to different safety programs These qualitative
variables differ from other variables in that they only
take the values of 0 or 1
Example 3 A Designed Experiment (Sec 1.5.3): In
this example, the values of the X-variables have been
set in advance as the design of the study is structured
as a three-factor composite experimental design The
X-variables form a pattern chosen to ensure that they are
uncorrelated
1.1.1.3 What Is a Statistical Model?
A statistical model is an abstraction from the actual
data and refers to all possible values of Y in the
popu-lation and the repopu-lationship between Y and the
corre-sponding X in the model In practice, we only have
sample values, y and x, so that we can only check to
ascertain whether the model is a reasonable ®t to these
data values
In some area of science, there are laws such as the
relationship e mc2 in which it is assumed that the
model is an exact relationship In other words, this
law is a deterministic model in which there is no
error In statistical models, we assume that the model
is stochastic, by which we mean that there is an error
term, e, so that the model can be written as
Y f X x e
In a regression model, f : indicates a linear function ofthe X-terms The error term is assumed to be randomwith a mean of zero and a variance which is constant,that is, it does not depend on the value taken by the X-term It may re¯ect error in the measurement of the Y-variable or by variables or conditions not de®ned inthe model The X-variable, on the other hand, isassumed to be measured without error
In Galton's data on heights of parents and spring, the error term may be due to measurementerror in obtaining the heights or the natural variationthat is likely to occur in the physical attributes of off-spring compared with their parents
off-There is a saying that ``No model is correct butsome are useful.'' In other words, no model will exactlycapture all the peculiarities of a data set but somemodels will ®t better than others
1.1.2 How to Fit a Model1.1.2.1 Least-Squares Method
We consider Example 1, but concentrate on the effect
of the ®rst variable, x1, which is tricalcium aluminate,
on the response variable, which is the heat generated.The plot of heat on tricalcium aluminate, with theleast-squares regression line, is shown in Fig 1 Theleast-squares line is shown by the solid line and can bewritten as
^y f X x1 a bx1 81:5 1:87x1 1where ^y is the predicted value of y for the given value
x1of the variable X1
Figure 1 Plot of heat, y, on tricalcium aluminate, x1
Trang 3All the points represented by x1; y do not fall on
the line but are scattered about it The vertical distance
between each observation, y, and its respective
pre-dicted value, ^y, is called the residual, which we denote
by e The residual is positive if the observed value of y
falls above the line and negative if below it Notice in
Sec 1.5.1 that for the fourth row in the table, the ®tted
value is 102.04 and the residual (shown by e inFig 1)
is 14:44, which corresponds to one of the four points
below the regression line, namely the point x1; y
11; 87:6:
At each of the x1 values in the data set we assume
that the population values of Y can be written as a
linear model, by which we mean that the model is
linear in the parameters For convenience, we drop
the subscript in the following discussion
More correctly, Y should be written as Y j x, which is
read as ``Y given X x.''
Notice that a model, in this case a regression model,
is a hypothetical device which explains relationships in
the population for all possible values of Y for given
values of X The error (or deviation) term, ", is
assumed to have for each point in the sample a
popu-lation mean of zero and a constant variance of 2 so
that for X a particular value x, Y has the following
distribution:
Y j x is distributed with mean x and variance
2
It is also assumed that for any two points in the
sam-ple, i and j, the deviations "i and "j are uncorrelated
The method of least squares uses the sample of n
( 13 here) values of x and y to ®nd the least-squares
estimates, a and b, of the population parameters and
by minimizing the deviations More speci®cally, we
seek to minimize the sum of squares of e, which we
denote by S2, which can be written as
S2Xe2Xy f x2Xy a bx2
3
The symbol Pindicates the summation over the n
13 points in the sample
1.1.2.2 Normal Equations
The values of the coef®cients a and b which minimize
S2 can be found by solving the following, which are
called normal equations We do not prove this
state-ment but the reader may refer to a textbook on
regres-sion, such as Brook and Arnold [1]
X
y a bx 0 or na bXx XyX
3 From Sec 1.5.1, we see that the mean of x is 7.5and of y is 95.4
The normal equations become13a 97b 1240:5
Simple arithmetic gives the solutions as a 81:5 and
b 1:87
1.1.3 Simple Transformations1.1.3.1 Scaling
The size of the coef®cients in a ®tted model will depend
on the scales of the variables, predicted and predictor
In the cement example, the X variables are measured ingrams Clearly, if these variables were changed to kilo-grams, the values of the X would be divided by 1000and, consequently, the sizes of the least squares coef®-cients would be multiplied by 1000 In this example,the coef®cients would be large and it would be clumsy
to use such a transformation
In some examples, it is not clear what scales should
be used To measure the consumption of petrol (gas), it
is usual to quote the number of miles per gallon, butfor those countries which use the metric system, it isthe inverse which is often quoted, namely the number
of liters per 100 km travelled
1.1.3.2 Centering of Data
In some situations, it may be an advantage to change x
to its deviation from its mean, that is, x x The ®ttedequation becomes
Trang 4^y a b x x
but these values of x and b may differ from Eq (1)
Notice that the sum of the x x terms is zero as
which can be shown to be the same as in Eq (5)
The ®tted line is
^y 95:42 1:87 x x
If the y variable is also centered and the two centered
variables are denoted by y and x, the ®tted line is
y 1:87x
The important point of this section is that the inclusion
of a constant term in the model leads to the same
coef®cient of the X term as transforming X to be
cen-tered about its mean In practice, we do not need to
perform this transformation of centering as the
inclu-sion of a constant term in the model leads to the same
estimated coef®cient for the X variable
1.1.4 Correlations
Readers will be familiar with the correlation coef®cient
between two variables In particular the correlation
between y and x is given by
rxy Sxy=q SxxSyy 8
There is a duality in this formula in that interchanging
x and y would not change the value of r The
relation-ship between correlation and regression is that the
coef®cient b in the simple regression line above can
be written as
b rqSyy=Sxx 9
In regression, the duality of x and y does not hold A
regression line of y on x will differ from a regression
line of x and y
1.1.5 Vectors1.1.5.1 Vector NotationThe data for the cement example (Sec 1.5) appear asequal-length columns This is typical of data sets inregression analysis Each column could be considered
as a column vector with 13 components We focus onthe three variables y (heat generated), ^y(FITS1 predicted values of y), and e (RESI1 residuals)
Notice that we represent a vector by bold types: y, ^y,and e
The vectors simplify the columns of data to twoaspects, the lengths and directions of the vectors and,hence, the angles between them The length of a vectorcan be found by the inner, or scalar, product Thereader will recall that the inner product of y is repre-sented as y y or yTy, which is simply the sum of thesquares of the individual elements
Of more interest is the inner product of ^y with e,which can be shown to be zero These two vectors aresaid to be orthogonal or ``at right angles'' as indicated
in Fig 2
We will not go into many details about the try of the vectors, but it is usual to talk of ^y being theprojection of y in the direction of x Similarly, e is theprojection of y in a direction orthogonal to x, ortho-gonal being a generalization to many dimensions of ``atright angles to,'' which becomes clear when the angle
geome-is considered
Notice that e and ^y are ``at right angles'' or gonal.'' It can be shown that a necessary and suf®cientcondition for this to be true is that eT^y 0
``ortho-In vector terms, the predicted value of y is
^y a1 bxand the ®tted model is
Writing the constant term as a column vector of `1'spave the way for the introduction of matrices in Sec.1.1.7
Figure 2 Relationship between y; ^y and e
Trang 51.1.5.2 VectorsÐCentering and Correlations
In this section, we write the vector terms in such a way
that the components are deviations from the mean; we
The length of the vector y, written as jyj, is the
square root of yTy 52:11 Similarly the lengths of
^y and e are 38.08 and 35.57, respectively
The inner product of y with the vector of ®tted
As y and x are centered, the correlation coef®cient of y
on x can be shown to be cos
1.1.6 Residuals and Fits
We return to the actual values of the X and Y
vari-ables, not the centered values as above.Figure 2
pro-vides more insight into the normal equations, as the
least-squares solution to the normal equation occurs
when the vector of residuals is orthogonal to the vector
of predicted values Notice that ^yTe 0 can be
expanded to
a1 bxTe a1Te bxTe 0 12
This condition will be true if each of the two parts are
equal to zero, which leads to the normal equations, Eq
(4), above
Notice that the last column of Sec 1.5.1 con®rms
that the sum of the residuals is zero It can be shown
that the corollary of this is that the sum of the observed
y is the same as the sum of the ®tted y values; if the
sums are equal the means are equal and Section 1.5.1shows that they are both 95.4
The second normal equation in Eq (4) could bechecked by multiplying the components of the twocolumns marked x1 and RESI1 and then adding theresult
In Fig 1.3, we would expect the residuals toapproximately fall into a horizontal band on eitherside of the zero line If the data satisfy the assumptions,
we would expect that there would not be any tic trend in the residuals At times, our eyes maydeceive us into thinking there is such a trend when infact there is not one We pick this topic up again later.1.1.7 Adding a Variable
as b0x0 and without loss of generality, x0 1
The normal equations follow a similar pattern tothose indicated by Eq (4), namely,
X
b0 b1x1 b2x2 XyX
x1b0 b1x1 b2x2 Xx1yX
x2b0 b1x1 b2x2 Xx2y
13
Figure 3 Plot of residuals against ®tted values for y on x1
Trang 6Note that the entries in bold type are the same as those
in the normal equations of the model with one
predic-tor variable It is clear that the solutions for b0and b1
will differ from those of a and b in the normal
equa-tions, Eq (6) It can be shown that the solutions are:
b0 52:6, b1 1:47, and b2 0:622:
Note:
1 By adding the second prediction variable x2, the
coef®cient for the constant term has changed
from a 81:5 to b0 52:6 Likewise the
coef®-cient for x has changed from 1.87 to 1.47 The
structure of the normal equations give some
indication why this is so
2 The coef®cients would not change in value if the
variables were orthogonal to each other For
example, if x0 was orthogonal to x2, Px0x2
would be zero This would occur if x2 was in
the form of deviation from its mean Likewise,
if x1 and x2 were orthogonal,Px1x2 would be
zero
3 What is the meaning of the coef®cients, for
example b1? From the ®tted regression
equa-tion, one is tempted to say that ``b1 is the
increase in y when x1 increases by 1.'' From 2,
we have to add to this, the words ``in the
pre-sence of the other variables in the model.''
Hence, if you change the variables, the meaning
of b1 also changes
When other variables are added to the model, the
for-mulas for the coef®cients become very clumsy and it is
much easier to extend the notation of vectors to that of
matrices Matrices provide a clear, generic approach to
the problem
1.1.7.2 Vectors and Matrices
As an illustration, we use the cement data in which
there are four predictor variables The model is
y 0x0 1x1 2x2 3x3 4x4 "
The ®tted regression equation can be written in vector
notation,
y b0x0 b1x1 b2x2 b3x3 b4x4 e 15
The data are displayed in Sec 1.5.1 Notice that each
column vector has n 13 entries and there are k 5
vectors As blocks of ®ve vectors, the predictors can bewritten as an n k 13 5 matrix, X
The ®tted regression equation is
13b0 97b1 626b2 153b3 39064b4 1240:597b0 1130b1 4922b2 769b3 2620b4 10,032626b0 4922b1 33050b2 7201b3 15739b4
62,027.8153b0 769b1 7201b2 2293b3 4628b4
13,981.539,064b0 2620b1 15;739b2 4628b3 15;062b4
34,733.3Notice the symmetry in the coef®cients of the bi.The matrix solution is
b XTX 1XTY
bT 62:4; 1:55; 0:510; 0:102; 0:144 18With the solution to the normal equations written asabove, it is easy to see that the least-squares estimates
of the parameters are weighted means of all the yvalues in the data The estimates can be written as
biXwiyiwhere the weights wi are functions of the x values:The regression coef®cients re¯ect the strengths andweaknesses of means The strengths are that eachpoint in the data set contributes to each estimate butthe weaknesses are that one or two unusual values inthe data set can have a disproportionate effect on theresulting estimates
1.1.7.3 The Projection Matrix, PFrom the matrix solution, the ®tted regression equa-tion becomes
Trang 7^y xb x XTX 1XTy or Py 19
P X XTX 1XTis called the projection matrix and it
has some nice properties, namely
1 PT P that is, it is symmetrical
2 PTP P that is, it is idempotent
3 The residual vector e y ^y I Py
I is the identity matrix with diagonal elements
being 1 and the off-diagonal elements being 0
4 From the triangle diagram, e is orthogonal to ^y,
which is easy to see as
eT^y yT I PTPy yT P PTPy 0
5 P is the projection matrix onto X and ^y is the
projection of y onto X
6 I P is the projection matrix orthogonal to X
and the residual, 1, is the projection of y onto a
direction orthogonal to X
The vector diagram ofFig 2becomes Fig 4
1.1.8 Normality
1.1.8.1 Assumptions about the Models
In the discussion so far, we have seen some of the
relationships and estimates which result from the
least-squares method which are dependent on
assump-tions about the error, or deviation, term in the model
We now add a further restriction to these assumptions,
namely that the error term, e, is distributed normally
This allows us to ®nd the distribution of the residuals
and ®nd con®dence intervals for certain estimates and
carry out hypothesis tests on them
The addition of the assumption of normality adds
to the concept of correlation as a zero correlation
coef-®cient between two variables will mean that they are
We are usually more interested in the coef®cient of the
x term The con®dence interval (CI) for this coef®cient
1 is given by
CI b1 tn 2qs2=Sxx 21
1.1.8.3 Con®dence Interval for the MeanThe 95% con®dence interval for the predicted value, ^y,when x x0 is given by
smal-This con®dence interval is illustrated in Fig 5 usingthe cement data
1.1.8.4 Prediction Interval for a Future Value
At times one wants to forecast the value of y for agiven single future value x0 of x This prediction inter-val for a future single point is widier than the con®-dence interval of the mean as the variance of singlevalue of y around the mean is 2 In fact, the ``1''
Figure 4 Projections of y in terms of P Figure 5 Con®dence and prediction intervals
Trang 8under the square root symbol may dominate the other
terms The formula is given by
Regression is a widely used and ¯exible tool,
applic-able to many situations
The method of least squares is the most commonly
used in regression
The resulting estimates are weighted means of the
response variable at each data point Means
may not be resistant to extreme values of either
X or y
The normal, gaussian, distribution is closely linked
to least squares, which facilitates the use of the
standard statistical methods of con®dence
inter-vals and hypothesis tests
In ®tting a model to data, an important result of the
least-squares approach is that the vector of ®tted
or predicted values is orthogonal to the vector of
residuals With the added assumptions of
nor-mality, the residuals are statistically independent
of the ®tted values
The data appear as columns which can be
consid-ered as vectors Groups of X vectors can be
manipulated as a matrix A projection matrix is
a useful tool in understanding the relationships
between the observed values of y, the predicted y
and the residuals
1.2 GOODNESS OF FIT OF THE MODEL
1.2.1 Regression Printout from MINITAB
1.2.1.1 Regression with One or More Predictor
Variables
In this section, comments are made on the printout
from a MINITAB program on the cement data using
the heat evolved as y and the number of grams of
tricalcium aluminate as x This is extended to two or
We have noted in Sec 1.1.7.1 that the estimatedcoef®cients will vary depending on the other variables
in the model With the ®rst two variables in the model,the ®tted regression equation represents a plane andthe least-squares solution is
y 52:6 1:47x1 0:662x2
In vector terms, it is clear that x1 is not orthogonal to
x2.1.2.1.3 Distribution of the Coef®cients
The formula for the standard deviation (also called thestandard error by some authors) of the constant termand for the x1 term is given in Sec 1.1.8.1
The T is the t-statistic (estimator hypothesizedparameter)/standard deviation The hypothesizedparameter is its value under the null hypothesis,which is zero in this situation The degrees of freedomare the same as those for the error or residual term.One measure of the goodness of ®t of the model iswhether the values of the estimated coef®cients, andhence the values of the respective t-statistics, couldhave arisen by chance and these are indicated by thep-values
The p-value is the probability of obtaining a moreextreme t-value by chance As the p-values here aresmall, we conclude that small t-value is due to thepresence of x1 in the model In other words, as theprobabilities are small (< 0:05 which is the commonlevel used), both the constant and b1 are signi®cant
at the 5% level
1.2.1.4 R-Squared and Standard Error
S = 10.73 R-Sq = 53.4% R-Sq(adj) = 49.2%
S 10:73 is the standard error of the residual term
We would prefer to use lower case, s, as it is anestimate of the S in the S2 of Eq (3)
R Sq (short for R-squared) is the coef®cient ofdetermination, R2, which indicates the proportion of
Trang 9the variation of Y explained by the regression
equa-tion:
R2 S^y ^y=Syy and recall that Syy X y y2
It is shown that R is the correlation coef®cient between
^y and y provided that the x and y terms have been
y2yTPy
R2 lies between 0, if the regression equation does not
explain any of the variation of Y, and 1 if the
regres-sion equation explains all of the variation Some
authors and programs such as MINITAB write R2 as
a percentage between 0 and 100% In this case, R2 is
only about 50%, which does not indicate a good ®t
After all, this means that 50% of the variation of y is
unaccounted for
As more variables are added to the model, the value
of R2will increase as shown in the following table The
variables x1; x2; x3, and x4 were sequentially added to
the model Some authors and computer programs
con-sider the increase in R2, denoted by R2 In this
exam-ple, x2 adds a considerable amount to R2 but the next
two variables add very little In fact, x4 appears not to
add any prediction power to the model but this would
suggest that the vector x4is orthogonal to the others It
is more likely that some rounding error has occurred
One peculiarity of R2 is that it will, by chance, give a
value between 0 and 100% even if the X variable is a
column of random numbers To adjust for the random
effect of the k variables in the model, the R2, as a
proportion, is reduced by k= n 1 and then adjusted
to fall between 0 and 1 to give the adjusted R2 It could
be multiplied by 100 to become a percent:
Sums of squares of y Sums of squares of ^y
Sums of squares of eThat is,
Sums of squares of total Sums of squares for regression Sums of squares for residual
26
The ANOVA table is set up to test the hypothesis thatthe parameter 0 If there are more than one pre-dictor variable, the hypothesis would be,
H: ...
60522 047 3322 644 2226 341 21230
94. 56183. 348 102.036102.036 94. 561102.03687.08683. 348 85.217120.72383. 348 102.036100.16795 .4
16:06069: 048 12:2 644 14: 43561:33 947 :1 644 15:6 144 10: 848 17:883 24: 82300 :45 1911:2 644 9:23320...
16:06069: 048 12:2 644 14: 43561:33 947 :1 644 15:6 144 10: 848 17:883 24: 82300 :45 1911:2 644 9:23320
80.0 747 3.251105.81589.25897.293105.1521 04. 002 74. 57591.2751 14. 53880.536112 .43 7112.29395 .4
1:5 740 01: 049 081:5 147 41:65 848 1:2925 14: 047 511:302052:07 542 1:8 245 11:36 246 3:2 643 30:862762:89 344 0...
1:923870:281 640 : 249 060:768370:506520: 242 462:059570:598011:35 846 1:9 341 41:193510:318 640 :37788
0:7 649 60 :49 6130: 047 231:01 948 0:808360: 048 290:099350:285060:068630:766200:380170:015580:23013
0:02 847 20: 049 0050:1 046 090:1968 140 :07 845 40:0367120: 049 9 640 :0708930: 046 0950:1122220: 049 449 0: 042 2930:0729894