GLMs are most commonly used to model binary or count data, so we will focus on models for these types of data... Review of Linear ModelsStructure The General Linear Model In ageneral lin
Trang 1Introduction to Generalized Linear Models
Heather Turner
ESRC National Centre for Research Methods, UK
and Department of Statistics University of Warwick, UK
WU, 2008–04–22-24
Trang 2This short course provides an overview of generalized linear models(GLMs)
We shall see that these models extend the linear modelling
framework to variables that are not Normally distributed
GLMs are most commonly used to model binary or count data, so
we will focus on models for these types of data
Trang 4Exercises
Trang 5Part II: Binary Data
Part II: Binary Data
Binary Data
Models for Binary DataModel Selection
Model EvaluationExercises
Trang 6Part III: Count Data
Part III: Count Data
Count Data
Modelling Rates
Modelling Contingency TablesExercises
Trang 7Part I
Introduction to Generalized Linear Models
Trang 8Review of Linear Models
Structure
The General Linear Model
In ageneral linear model
yi= β0+ β1x1i+ + βpxpi+ i
theresponse yi, i = 1, , n is modelled by a linear function of
explanatoryvariables xj, j = 1, , p plus an error term
Trang 9Review of Linear Models
Structure
General and Linear
Heregeneralrefers to the dependence on potentially more than
yi = β0+ β1xi+ i
The model is linear in the parameters, e.g
yi = β0+ β1x1+ β2x21+ i
yi = β0+ γ1δ1x1+ exp(β2)x2+ ibut not e.g
yi = β0+ β1xβ2
1 + i
yi = β0exp(β1x1) + i
Trang 10Review of Linear Models
Structure
Error structure
distributed such that
E[i] = 0and var[i] = σ2
Typically we assume
i ∼ N (0, σ2)
as a basis for inference, e.g t-tests on parameters
Trang 11Review of Linear Models
Examples
Some Examples
abdomin
60 80 100 120 140 biceps
25 30 35 40
0 20 40 60
bodyfat = −14.59 + 0.7 * biceps − 0.9 * abdomin
Trang 12Review of Linear Models
Trang 13Review of Linear Models
1 1
1 1 1
1 1
1 1
2 2 2
2 2 2 2 2 2
2 2
2 3
3
3 3
3
3 3
3 3 3 3
3
3
3
3 3
operator
Trang 14Review of Linear Models
Restrictions
Restrictions of Linear Models
Although a very useful framework, there are some situations wheregeneral linear models are not appropriate
Generalized linear modelsextend the general linear modelframework to address both of these issues
Trang 15Generalized Linear Models
Structure
Generalized Linear Models (GLMs)
Ageneralized linear modelis made up of alinear predictor
ηi= β0+ β1x1i+ + βpxpi
and two functions
depends on the linear predictor
Trang 16Generalized Linear Models
Structure
Normal General Linear Model as a Special Case
predictor
ηi= β0+ β1x1i+ + βpxpi
the link function
g(µi) = µiand the variance function
V (µi) = 1
Trang 17Generalized Linear Models
Trang 18Generalized Linear Models
Structure
Modelling Poisson Data
Suppose
Yi ∼ Poisson(λi)Then
Trang 19Generalized Linear Models
Structure
Transformation vs GLM
In some situations a response variable can be transformed toimprove linearity and homogeneity of variance so that a generallinear model can be applied
This approach has some drawbacks
homogeneity of variance
sample space
Trang 20Generalized Linear Models
be appropriate E.g if Y is income perhaps we are really interested
in the mean income of population subgroups, in which case itwould be better to model E(Y ) using a glm :
log E(Yi) = β0+ β1x1
with V (µ) = µ This also avoids difficulties with y = 0
Trang 21Generalized Linear Models
Structure
Exponential Family
Most of the commonly used statistical distributions, e.g Normal,
distributionswhose densities can be written in the form
Trang 22Generalized Linear Models
g = (b0)−1
⇒ g(µi) = θi= β0+ β1x1i+ + βpxpi
Canonical links lead to desirable statistical properties of the glmhence tend to be used by default However there is no a priorireason why the systematic effects in the model should be additive
on the scale given by this link
Trang 23Generalized Linear Models
Estimation
Estimation of the Model Parameters
A single algorithm can be used to estimate the parameters of anexponential family glm using maximum likelihood
The log-likelihood for the sample y1, , yn is
Trang 24Generalized Linear Models
Estimation
We assume that
ai
weights; for example binomial proportions with known index nihave φ = 1 and ai = ni
The estimating equations are then
Trang 25Generalized Linear Models
Estimation
A general method of solving score equations is the iterative
H
Trang 26Generalized Linear Models
Trang 27Generalized Linear Models
Estimation
(Re-)Weighted Least Squaresalgorithm:
1 Start with initial estimates µ(r)i
4 Repeat 2 and 3 till convergence
For models with the canonical link, this is simply the
Newton-Raphson method
Trang 28Generalized Linear Models
Estimation
Standard Errors
estimators In particular, ˆβ is asymptotically
N (β, i−1)where
square roots of the diagonal elements of
ˆcov( ˆβ) = φ(XTW X)ˆ −1
If φ is unknown, an estimate is required
Trang 29Generalized Linear Models
Estimation
There are practical difficulties in estimating the dispersion φ bymaximum likelihood
be
1n
Trang 30GLMs in R
glm Function
The glm Function
which is similar to thelm function for fitting linear models
The arguments to a glm call are as follows
contrasts = NULL, )
Trang 31All specified variables must be in the workspace or in the data
Trang 32GLMs in R
glm Function
Other symbols that can be used in the formula include
Trang 33The exponential family functions available in R are
Trang 34GLMs in R
glm Function
Extractor Functions
Theglmfunction returns an object of class c("glm", "lm")
Trang 35GLMs in R
Example with Normal Data
Example: Household Food Expenditure
Griffiths, Hill and Judge (1993) present a dataset on food
expenditure for households that have three family members Weconsider two variables, the logarithm of expenditure on food andthe household income:
dat <- read.table("GHJ_food_income.txt", header = TRUE) attach(dat)
plot(Food ~ Income, xlab = "Weekly Household Income ($)", ylab = "Weekly Household Expenditure on Food (Log $)")
It would seem that a simple linear model would fit the data well
Trang 36GLMs in R
Example with Normal Data
Trang 37GLMs in R
Example with Normal Data
Summary of Fit Using lm
F-statistic: 19.94 on 1 and 38 DF, p-value: 6.951e-05
Trang 38GLMs in R
Example with Normal Data
Summary of Fit Using glm
The default family forglmis"gaussian" so the arguments of thecall are unchanged
the response is assumed to be normally distributed these are the
Trang 39GLMs in R
Example with Normal Data
The estimated coefficients are unchanged
(Dispersion parameter for gaussian family taken to be 0.07650739
Partial t-tests test the significance of each coefficient in the
presence of the others The dispersion parameter for the gaussian
family is equal to the residual variance
Trang 41GLMs in R
Example with Normal Data
Different model summaries are reported for GLMs First we havethedevianceof two models:
excluded, except the intercept if present The degrees of freedomfor this model are the number of data points n minus 1 if anintercept is fitted
The second two refer to the fitted model, which has n − p degees
of freedom, where p is the number of parameters, including anyintercept
Trang 42In the saturated model, the number of parameters is equal to the
For linear regression with Normal data, the deviance is equal to theresidual sum of squares
Trang 43GLMs in R
Example with Normal Data
Akiake Information Criterion (AIC)
Finally we have:
AIC: 14.649
Number of Fisher Scoring iterations: 2
The AIC is a measure of fit that penalizes for the number ofparameters p
Smaller values indicate better fit and thus the AIC can be used tocompare models (not necessarily nested)
Trang 45GLMs in R
Example with Normal Data
Deviance residuals are the default used in R, since they reflect thesame criterion as used in the fitting
For example we can plot the deviance residuals against the fittedvalues ( on the response scale) as follows:
plot(residuals(foodGLM) ~ fitted(foodGLM),
xlab = expression(hat(y)[i]),
ylab = expression(r[i]))
abline(0, 0, lty = 2)
Trang 46GLMs in R
Example with Normal Data
on the deviance residuals By default
variance
distance contours
Trang 48Exercises
data in the help file
variable
2 Produce appropriate plots to examine the bivariate relationships
ofwages with the other variables in the data set Which variables
variables Look at a summary of the fit Do the results appear to
residuals from the fit Which modelling assumptions appear to beinvalid?
Trang 49response variable Confirm that the residuals are more consistentwith the modelling assumptions Can any variables be droppedfrom the model?
Investigate whether two-way and three-way interactions should beadded to the model
Trang 51log(Y ) is normally distributed If X = log(Y ) ∼ N (µ, σ2), then Y
Trang 52predictor variables as your chosen model in question 4 Look at asummary of the fit and compare with the log-Normal model – Arethe inferences the same? Are the parameter estimates similar?Note that t statistics rather than z statistics are given for theparameters since the dispersion φ has had to be estimated
6 (Extra time!) Go back and fit your chosen model in question 4
Gamma model? Note that the AIC values are not comparable here:constants in the likelihood functions are dropped when computingthe AIC, so these values are only comparable when fitting modelswith the same error distribution