CHAPTER 10Scatterplot Smoothers and Generalised Additive Models: The Men’s Olympic 1500m, Air Pollution in the USA, and Risk Factors for Kyphosis 10.1 Introduction The modern Olympics be
Trang 1CHAPTER 10
Scatterplot Smoothers and Generalised Additive Models: The Men’s Olympic 1500m, Air Pollution in the USA, and
Risk Factors for Kyphosis
10.1 Introduction
The modern Olympics began in 1896 in Greece and have been held every four years since, apart from interruptions due to the two world wars On the track the blue ribbon event has always been the 1500m for men since competitors that want to win must have a unique combination of speed, strength and stamina combined with an acute tactical awareness For the spectator the event lasts long enough to be interesting (unlike say the 100m dash) but not too long so as to become boring (as do most 10,000m races) The event has been witness to some of the most dramatic scenes in Olympic history; who can forget Herb Elliott winning by a street in 1960, breaking the world record and continuing his sequence of never being beaten in a 1500m or mile race in his career? And remembering the joy and relief etched on the face of Seb Coe when winning and beating his arch rival Steve Ovett still brings a tear to the eye of many of us
The complete record of winners of the men’s 1500m from 1896 to 2004 is given in Table 10.1 Can we use these winning times as the basis of a suitable statistical model that will enable us to predict the winning times for future Olympics?
Table 10.1: men1500m data Olympic Games 1896 to 2004
win-ners of the men’s 1500m
1900 Paris C Bennett Great Britain 246.20
1912 Stockholm A Jackson Great Britain 236.80
1920 Antwerp A Hill Great Britain 241.80
1932 Los Angeles L Beccali Italy 231.20
Trang 2Table 10.1: men1500m data (continued).
1936 Berlin J Lovelock New Zealand 227.80
1952 Helsinki J Barthel Luxemborg 225.10
1956 Melbourne R Delaney Ireland 221.20
1976 Montreal J Walker New Zealand 219.17
1980 Moscow S Coe Great Britain 218.40
1984 Los Angeles S Coe Great Britain 212.53
2004 Athens H El Guerrouj Morocco 214.18
The data in Table 10.2 relate to air pollution in 41 US cities as reported by Sokal and Rohlf (1981) The annual mean concentration of sulphur dioxide,
in micrograms per cubic metre, is a measure of the air pollution of the city The question of interest here is what aspects of climate and human ecology
as measured by the other six variables in the table determine pollution Thus,
we are interested in a regression model from which we can infer the relation-ship between each of the exploratory variables to the response (SO2content) Details of the seven measurements are;
SO2: SO2 content of air in micrograms per cubic metre,
temp: average annual temperature in Fahrenheit,
manu: number of manufacturing enterprises employing 20 or more workers, popul: population size (1970 census); in thousands,
wind: average annual wind speed in miles per hour,
precip: average annual precipitation in inches,
predays: average number of days with precipitation per year
Table 10.2: USairpollution data Air pollution in 41 US cities
SO2 temp manu popul wind precip predays
Trang 3INTRODUCTION 179
Table 10.2: USairpollution data (continued)
SO2 temp manu popul wind precip predays
Source: From Sokal, R R., Rohlf, F J., Biometry, W H Freeman, San
Fran-cisco, USA, 1981 With permission
Trang 4The final data set to be considered in this chapter is taken from Hastie and Tibshirani (1990) The data are shown in Table 10.3 and involve observations
on 81 children undergoing corrective surgery of the spine There are a number
of risk factors for kyphosis, or outward curvature of the spine in excess of 40 degrees from the vertical following surgery; these are age in months (Age), the starting vertebral level of the surgery (Start) and the number of vertebrae involved (Number) Here we would like to model the data to determine which risk factors are of most importance for the occurrence of kyphosis
Table 10.3: kyphosis data (package rpart) Children who have
had corrective spinal surgery
Kyphosis Age Number Start Kyphosis Age Number Start
Trang 5SMOOTHERS AND GENERALISED ADDITIVE MODELS 181
Table 10.3: kyphosis data (continued)
Kyphosis Age Number Start Kyphosis Age Number Start
10.2 Scatterplot Smoothers and Generalised Additive Models
Each of the three data sets described in the Introduction appear to be perfect candidates to be analysed by one of the methods described in earlier chapters Simple linear regression could, for example, be applied to the 1500m times and multiple linear regression to the pollution data; the kyphosis data could
be analysed using logistic regression But instead of assuming we know the linear functional form for a regression model we might consider an alterna-tive approach in which the appropriate functional form is estimated from the data How is this achieved? The secret is to replace the global estimates from the regression models considered in earlier chapters with local estimates, in which the statistical dependency between two variables is described, not with
a single parameter such as a regression coefficient, but with a series of lo-cal estimates For example, a regression might be estimated between the two variables for some restricted range of values for each variable and the pro-cess repeated across the range of each variable The series of local estimates
is then aggregated by drawing a line to summarise the relationship between the two variables In this way no particular functional form is imposed on the relationship Such an approach is particularly useful when
• the relationship between the variables is expected to be of a complex form, not easily fitted by standard linear or nonlinear models;
• there is no a priori reason for using a particular model;
• we would like the data themselves to suggest the appropriate functional form
The starting point for a local estimation approach to fitting relationships
between variables is scatterplot smoothers, which are described in the next
subsection
Trang 610.2.1 Scatterplot Smoothers
The scatterplot is an excellent first exploratory graph to study the dependence
of two variables and all readers will be familiar with plotting the outcome of
a simple linear regression fit onto the graph to help in a better understand-ing of the pattern of dependence But many readers will probably be less familiar with some non-parametric alternatives to linear regression fits that may be more useful than the latter in many situations These alternatives are labelled non-parametric since unlike parametric techniques such as lin-ear regression they do not summarise the relationship between two variables with a parameter such as a regression or correlation coefficient Instead non-parametric ‘smoothers’ summarise the relationship between two variables with
a line drawing The simplest of this collection of non-parametric smoothers is a
locally weighted regression or lowess fit, first suggested by Cleveland (1979) In
essence this approach assumes that the independent variable xiand a response
yiare related by
yi= g(xi) + εi, i= 1, , n where g is a locally defined p-degree polynomial function in the predictor variable, xi, and εi are random variables with mean zero and constant scale Values ˆyi = g(xi) are used to estimate the yi at each xi and are found by fitting the polynomials using weighted least squares with large weights for points near to xi and small otherwise Two parameters control the shape of a lowess curve; the first is a smoothing parameter, α, (often know as the span, the width of the local neighbourhood) with larger values leading to smoother curves – typical values are 0.25 to 1 In essence the span decides the amount
of the tradeoff between reduction in bias and increase in variance If the span
is too large, the non-parametric regression estimate will be biased, but if the span is too small, the estimate will be overfitted with inflated variance Keele (2008) gives an extended discussion of the influence of the choice of span on the non-parametric regression The second parameter, λ , is the degree of the polynomials that are fitted by the method; λ can be 0, 1, or 2 In any specific application, the change of the two parameters must be based on a combination
of judgement and of trial and error Residual plots may be helpful in judging
a particular combination of values
An alternative smoother that can often be usefully applied to bivariate data
is some form of spline function (A spline is a term for a flexible strip of metal or
rubber used by a draftsman to draw curves.) Spline functions are polynomials within intervals of the x-variable that are smoothly connected across different values of x Figure 10.1 for example shows a linear spline function, i.e., a piecewise linear function, of the form
f(x) = β0+ β1x+ β2(x − a)++ β3(x − b)++ β4(x − c)+
where (u)+= u for u > 0 and zero otherwise The interval endpoints, a, b, and
c, are called knots The number of knots can vary according to the amount of data available for fitting the function
Trang 7SMOOTHERS AND GENERALISED ADDITIVE MODELS 183
x
Figure 10.1 A linear spline function with knots at a = 1, b = 3 and c = 5
The linear spline is simple and can approximate some relationships, but it
is not smooth and so will not fit highly curved functions well The problem is overcome by using smoothly connected piecewise polynomials – in particular, cubics, which have been found to have nice properties with good ability to
fit a variety of complex relationships The result is a cubic spline Again we
wish to fit a smooth curve, g(x), that summarises the dependence of y on x
A natural first attempt might be to try to determine g by least squares as the curve that minimises
n X i=1 (yi− g(xi))2
But this would simply result in very wiggly curve interpolating the
Trang 8tions Instead of (10.1) the criterion used to determine g is
n X i=1 (yi− g(xi))2
+ λ
Z
g′′(x)2
where g′′(x) represents the second derivation of g(x) with respect to x Al-though written formally this criterion looks a little formidable, it is really nothing more than an effort to govern the trade-off between the goodness-of-fit of the data (as measured by P
(yi − g(xi))2
) and the ‘wiggliness’ or departure of linearity of g measured byR
g′′(x)2 dx; for a linear function, this part of (10.2) would be zero The parameter λ governs the smoothness of g, with larger values resulting in a smoother curve
The cubic spline which minimises (10.2) is a series of cubic polynomials joined at the unique observed values of the explanatory variables, xi, (for more details, see Keele, 2008)
The ‘effective number of parameters’ (analogous to the number of param-eters in a parametric fit) or degrees of freedom of a cubic spline smoother is generally used to specify its smoothness rather than λ directly A numerical search is then used to determine the value of λ corresponding to the required degrees of freedom Roughly, the complexity of a cubic spline is about the same
as a polynomial of degree one less than the degrees of freedom (see Keele, 2008, for details) But the cubic spline smoother ‘spreads out’ its parameters in a more even way and hence is much more flexible than is polynomial regression The spline smoother does have a number of technical advantages over the lowess smoother such as providing the best mean square error and avoiding overfitting that can cause smoothers to display unimportant variation between
x and y that is of no real interest But in practise the lowess smoother and the cubic spline smoother will give very similar results on many examples
10.2.2 Generalised Additive Models
The scatterplot smoothers described above are the basis of a more general, semi-parametric approach to modelling situations where there is more than a single explanatory variable, such as the air pollution data in Table 10.2 and the kyphosis data in Table 10.3 These models are usually called generalised additive models (GAMs) and allow the investigator to model the relationship between the response variable and some of the explanatory variables using the non-parametric lowess or cubic splines smoothers, with this relationship for other explanatory variables being estimated in the usual parametric fashion
So returning for a moment to the multiple linear regression model described in Chapter 6 in which there is a dependent variable, y, and a set of explanatory variables, x1, , xq, and the model assumed is
y= β0+
q X j=1
βjxj+ ε
Trang 9SMOOTHERS AND GENERALISED ADDITIVE MODELS 185 Additive models replace the linear function, βjxj, by a smooth non-parametric function, g, to give the model
y= β0+
q X j=1
where gj can be one of the scatterplot smoothers described in the previous sub-section, or, if the investigator chooses, it can also be a linear function for particular explanatory variables
A generalised additive model arises from (10.3) in the same way as a gen-eralised linear model arises from a multiple regression model (see Chapter 7), namely that some function of the expectation of the response variable is now modelled by a sum of non-parametric and parametric functions So, for exam-ple, the logistic additive model with binary response variable y is
logit(π) = β0+
q X j=1 gj(xj)
where π is the probability that the response variable takes the value one Fitting a generalised additive model involves either iteratively weighted least squares, an optimisation algorithm similar to the algorithm used to fit
gener-alised linear models, or what is known as a backfitting algorithm The smooth
functions gj are fitted one at a time by taking the residuals
y −X k6=j
gk(xk)
and fitting them against xj using one of the scatterplot smoothers described previously The process is repeated until it converges Linear terms in the model are fitted by least squares The mgcv package fits generalised additive models using the iteratively weighted least squares algorithm, which in this case has the advantage that inference procedures, such as confidence intervals, can be derived more easily Full details are given in Hastie and Tibshirani (1990), Wood (2006), and Keele (2008)
Various tests are available to assess the non-linear contributions of the fitted smoothers, and generalised additive models can be compared with, say linear models fitted to the same data, by means of an F -test on the residual sum
of squares of the competing models In this process the fitted smooth curve
is assigned an estimated equivalent number of degrees of freedom However, such a procedure has to be used with care For full details, again, see Wood (2006) and Keele (2008)
Two alternative approaches to the variable selection and model choice prob-lem are helpful As always, a graphical inspection of the model properties, ideally guided by subject-matter knowledge, helps to identify the most impor-tant aspects of the fitted regression function A more formal approach is to fit the model using algorithms that, implicitly or explicitly, have nice variable selection properties, one of which is mentioned in the following section
Trang 1010.2.3 Variable Selection and Model Choice
Quantifying the influence of covariates on the response variable in generalised additive models does not merely relate to the problem of estimating regression coefficients but more generally calls for careful implementation of variable se-lection (determination of the relevant subset of covariates to enter the model) and model choice (specifying the particular form of the influence of a variable) The latter task requires choosing between linear and nonlinear modelling of covariate effects While variable selection and model choice issues are already complicated in linear models (see Chapter 6) and generalised linear models (see Chapter 7) and still receive considerable attention in the statistical litera-ture, they become even more challenging in generalised additive models Here, variable selection and model choice needs to provide and answer on the com-plicated question: Should a continuous covariate be included into the model at all and, if so, as a linear effect or as a flexible, smooth effect? Methods to deal with this problem are currently actively researched Two general approaches can be distinguished: One can fit models using a target function incorporating
a penalty term which will increase for increasingly complex models (similar to 10.2) or one can iteratively fit simple, univariate models which sum to a more
complex generalised additive model The latter approach is called boosting and
requires a careful determination of the stop criterion for the iterative model fitting algorithms The technical details are far too complex to be sketched here, and we refer the interested reader to the review paper by B¨uhlmann and Hothorn (2007)
10.3 Analysis Using R
10.3.1 Olympic 1500m Times
To begin we will construct a scatterplot of winning time against year the games were held The R code and the resulting plot are shown inFigure 10.2 There is very clear downward trend in the times over the years, and, in addition there
is a very clear outlier namely the winning time for 1896 We shall remove this time from the data set and now concentrate on the remaining times First
we will fit a simple linear regression to the data and plot the fit onto the scatterplot The code and the resulting plot are shown inFigure 10.3 Clearly the linear regression model captures in general terms the downward trend in the times Now we can add the fits given by the lowess smoother and by a cubic spline smoother; the resulting graph and the extra R code needed are shown inFigure 10.4
Both non-parametric fits suggest some distinct departure from linearity, and clearly point to a quadratic model being more sensible than a linear model here And fitting a parametric model that includes both a linear and
a quadratic effect for year gives a prediction curve very similar to the non-parametric curves; seeFigure 10.5
Here use of the non-parametric smoothers has effectively diagnosed our