5.1 Pearson residuals obtained from linear model 5.2 Normal scores versus sorted Pearson residuals obtained from linear model 5.3 Pearson residuals versus kilocalories; Pearson residuals
Trang 2Second edition 2007 Third edition 2012 Fourth edition 2018
Trang 3Stata, , Stata Press, Mata, , and NetCourse are registered trademarks of StataCorp LLC.
Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of the United Nations.
NetCourseNow is a trademark of StataCorp LLC.
L AT XE 2 is a trademark of the American Mathematical Society.
Trang 4In all editions of this text, this dedication page was written as the final deliverable to the editor—after all the work and all the requested
changes and additions had been addressed In every previous edition, Joe and I co-dedicated this book to our wives and children This time,
he is very proud of our collaboration, but even more proud of his
family: Cheryl, Heather, Michael, and Mitchell.
Trang 145.1 Pearson residuals obtained from linear model
5.2 Normal scores versus sorted Pearson residuals obtained from linear model
5.3 Pearson residuals versus kilocalories; Pearson residuals obtained from linearmodel
5.4 Pearson residuals obtained from log-Gaussian model (two outliers removed)
5.5 Pearson residuals versus fitted values from log-Gaussian model (two outliersremoved)
5.6 Pearson residuals from lognormal model (log-transformed outcome, twooutliers removed, and zero outcome removed)
5.7
Pearson residuals versus fitted values from lognormal model (log-transformed outcome, two outliers removed, and zero outcome removed)
5.8 Normal scores versus sorted Pearson residuals obtained from lognormalmodel (log-transformed outcome, two outliers removed, and zero outcome
removed)
5.9 Pearson residuals versus kilocalories; Pearson residuals obtained from
lognormal model (log-transformed outcome, two outliers removed, and zerooutcome removed)
10.1 Probit and logit functions
10.2 Predicted probabilities for probit and logit link function in grouped binarymodels The observed (sample) proportions are included as well
10.3 Complementary log-log and log-log functions
10.4 Probit, logit, and identity functions
10.5 Observed proportion of carrot fly damage for each treatment (see
table 10.3)
Trang 1620.2 Additional built-in support for the bayes prefix
Trang 17A.3 First derivatives of link functions ( )
A.4 First derivatives of inverse link functions ( )
A.5 Second derivatives of link functions where and
A.6 Second derivatives of inverse link functions where and
Trang 185.5 IRLS algorithm for log-Gaussian models using OIM
10.2 IRLS algorithm for binary probit regression using OIM
10.3 IRLS algorithm for binary clog-log regression using EIM
10.4 IRLS algorithm for binary clog-log regression using OIM
Trang 20We have added several new models to the discussion of extended generalizedlinear models (GLMs) We have included new software and discussion of
extensions to negative binomial regression because of Waring and Famoye Wehave also added discussion of heaped data and bias-corrected GLMs because ofFirth There are two new chapters on multivariate outcomes and Bayes GLMs Inaddition, we have expanded the clustered data discussion to cover more of thecommands available in Stata
We now include even more examples using synthetically created models toillustrate estimation results, and we illustrate to readers how to construct
synthetic Monte Carlo models for binomial and major count models Code forcreating synthetic Poisson, negative binomial, zero-inflated, hurdle, and finitemixture models is provided and further explained We have enhanced discussion
of marginal effects and discrete change for GLMs
This fourth edition of Generalized Linear Models and Extensions is written
for the active researcher as well as for the theoretical statistician Our goal hasbeen to clarify the nature and scope of GLMs and to demonstrate how all thefamilies, links, and variations of GLMs fit together in an understandable whole
In a step-by-step manner, we detail the foundations and provide workingalgorithms that readers can use to construct and better understand models thatthey wish to develop In a sense, we offer readers a workbook or handbook ofhow to deal with data using GLM and GLM extensions
This text is intended as a textbook on GLMs and as a handbook of advice forresearchers We continue to use this book as the required text for a web-based
short course through Statistics.com (also known as the Institute for Statistical
Education); see http://www.statistics.com The students of this six-week course
include university professors and active researchers from hospitals, governmentagencies, research institutes, educational concerns, and other institutions acrossthe world This latest edition reflects the experiences we have had in
communicating to our readers and students the relevant materials over the pastdecade
Many people have contributed to the ideas presented in the new edition of
Trang 21Norman Breslow, Berwin Turlach, Gordon Johnston, Thomas Lumley, BillSribney, Vince Wiggins, Mario Cleves, William Greene, Andrew Robinson,Heather Presnal, and others Specifically, for this edition, we thank TammyCummings, Chelsea Deroche, Xinling Xu, Roy Bower, Julie Royer, James
Hussey, Alex McLain, Rebecca Wardrop, Gelareh Rahimi, Michael G Smith,Marco Geraci, Bo Cai, and Feifei Xiao
As always, we thank William Gould, president of StataCorp, for his
encouragement in this project His statistical computing expertise and his
contributions to statistical modeling have had a deep impact on this book
We are grateful to StataCorp’s editorial staff for their equanimity in readingand editing our manuscript, especially to Patricia Branton and Lisa Gilmore fortheir insightful and patient contributions in this area Finally, we thank KristinMacDonald and Isabel Canette-Fernandez, Stata statisticians at StataCorp, fortheir expert assistance on various programming issues, and Nikolay Balov,Senior Statistician and Software Developer at StataCorp, for his helpful
assistance with chapter 20 on Bayesian GLMs We would also like to thank RoseMedeiros, Senior Statistician at StataCorp, for her assistance in the final passes
of this edition
Stata Press allowed us to dictate some of the style of this text In writing thismaterial in other forms for short courses, we have always included equationnumbers for all equations rather than only for those equations mentioned in text.Although this is not the standard editorial style for textbooks, we enjoy thebenefits of students being able to communicate questions and comments moreeasily (and efficiently) We hope that readers find this practice as beneficial asour short-course participants have found it
Errata, datasets, and supporting Stata programs (do-files and ado-files) may
linear- models-and-extensions/ We also maintain these materials on the authorsites at http://www.thirdwaystat.com/jameshardin/ and at
be found at the publisher’s site http://www.stata-press.com/books/generalized-https://works.bepress.com/joseph_hilbe/ We are very pleased to be able toproduce this newest edition Working on this text from the first edition in 2001
Trang 22James W HardinJoseph M HilbeMarch 2018
Trang 23Introduction
In updating this text, our primary goal is to convey the practice of analyzing datavia generalized linear models to researchers across a broad spectrum of scientificfields We lay out the framework used for describing various aspects of data andfor communicating tools for data analysis This initial part of the text contains noexamples Rather, we focus on the lexicon of generalized linear models used inlater chapters These later chapters include examples from fields such as
biostatistics, economics, and survival analysis
In developing analysis tools, we illustrate techniques via their genesis inestimation algorithms We believe that motivating the discussion through theestimation algorithms clarifies the origin and usefulness of all generalized linearmodels Instead of detailed theoretical exposition, we refer to texts and papersthat present such material so that we may focus our detailed presentations on thealgorithms and their justification Our detailed presentations are mostly
algebraic; we have minimized matrix notation whenever possible
We often present illustrations of models using data that we synthesize
Although it is preferable to use real data to illustrate interpretation in context,there is a distinct advantage to examples using simulated data The advantage ofworked examples relying on data synthesis is that the data-generating processoffers yet another glimpse into the associations between variables and outcomesthat are to be captured in the procedure The associations are thus seen from boththe results of the model and the origins of data generation
Trang 24We wrote this text for researchers who want to understand the scope and
application of generalized linear models while being introduced to the
underlying theory For brevity’s sake, we use the acronym GLM to refer to thegeneralized linear model, but we acknowledge that GLM has been used elsewhere
as an acronym for the general linear model The latter usage, of course, refers tothe area of statistical modeling based solely on the normal or Gaussian
probability distribution
We take GLM to be the generalization of the general, because that is preciselywhat GLMs are They are the result of extending ordinary least-squares (OLS)regression, or the normal model, to a model that is appropriate for a variety ofresponse distributions, specifically to those distributions that compose the singleparameter exponential family of distributions We examine exactly how thisextension is accomplished We also aim to provide the reader with a firm
understanding of how GLMs are evaluated and when their use is appropriate Weeven advance a bit beyond the traditional GLM and give the reader a look at how
GLMs can be extended to model certain types of data that do not fit exactly withinthe GLM framework
Nearly every text that addresses a statistical topic uses one or more statisticalcomputing packages to calculate and display results We use Stata exclusively,though we do refer occasionally to other software packages—especially when it
is important to highlight differences
Some specific statistical models that make up GLMs are often found as
standalone software modules, typically fit using maximum likelihood methodsbased on quantities from model-specific derivations Stata has several such
commands for specific GLMs including poisson, logistic, and regress Some
of these procedures were included in the Stata package from its first version.More models have been addressed through commands written by users of Stata’sprogramming language leading to the creation of highly complex statistical
models Some of these community-contributed commands have since been
incorporated into the official Stata package We highlight these commands andillustrate how to fit models in the absence of a packaged command; see
especially chapter 14
Trang 25StataCorp’s continued updates to the command
Readers of technical books often need to know about prerequisites,
especially how much math and statistics background is required To gain fulladvantage from this text and follow its every statement and algorithm, you
should have an understanding equal to a two-semester calculus-based course onstatistical theory Without a background in statistical theory, the reader can
accept the presentation of the theoretical underpinnings and follow the (mostly)algebraic derivations that do not require more than a mastery of simple
derivatives We assume prior knowledge of multiple regression but no otherspecialized knowledge is required
We believe that GLMs are best understood if their computational basis is clear.Hence, we begin our exposition with an explanation of the foundations and
computation of GLMs; there are two major methodologies for developing
algorithms We then show how simple changes to the base algorithms lead todifferent GLM families, links, and even further extensions In short, we attempt tolay the GLM open to inspection and to make every part of it as clear as possible
In this fashion, the reader can understand exactly how and why GLM algorithmscan be used, as well as altered, to better model a desired dataset
Perhaps more than any other text in this area, we alternatively examine twomajor computational GLM algorithms and their modifications:
(GEEs) On the other hand, truncated models that do not fit neatly into the
exponential family of distributions are modeled using Newton–Raphson methods
—and for this, too, we show why Again, focusing on the details of calculationshould help the reader understand both the scope and the limits of a particular
Trang 26Whenever possible, we present the log likelihood for the model under
discussion In writing the log likelihood, we include offsets so that interestedprogrammers can see how those elements enter estimation In fact, we attempt tooffer programmers the ability to understand and write their own working GLMs,plus many useful extensions As programmers ourselves, we believe that there isvalue in such a presentation; we would have much enjoyed having it at our
fingertips when we first entered this statistical domain
Trang 27We use to denote the likelihood and the script to denote the log likelihood
We use to denote the design matrix of independent (explanatory) variables.When appropriate, we use boldface type to emphasize that we are referring to
a matrix; a lowercase letter with a subscript will refer to the th row from thematrix
We use to denote the dependent (response) variable and refer to the vector
as the coefficients of the design matrix We use when we wish to discuss oremphasize the fitted coefficients Throughout the text, we discuss the role of the(vector) linear predictor In generalizing this concept, we also refer tothe augmented (by an offset) version of the linear predictor
Finally, we use the notation to refer to the expectation of a randomvariable and the notation to refer to the variance of a random variable Wedescribe other notational conventions at the time of their first use
Trang 28A common question regarding texts concerns their focus Is the text applied ortheoretical? Our text is both However, we would argue that it is basically
applied We show enough technical details for the theoretician to understand theunderlying basis of GLMs However, we believe that understanding the use andlimitations of a GLM includes understanding its estimation algorithm For some,dealing with formulas and algorithms appears thoroughly theoretical We believethat it aids in understanding the scope and limits of proper application Perhaps
we can call the text a bit of both and not worry about classification In any case,for those who fear formulas, each formula and algorithm is thoroughly
explained We hope that by book’s end the formulas and algorithms will seemsimple and meaningful For completeness, we give the reader references to textsthat discuss more advanced topics and theory
Trang 29Part I of the text deals with the basic foundations of GLM We detail the variouscomponents of GLM, including various family, link, variance, deviance, and log-likelihood functions We also provide a thorough background and detailed
particulars of both the Newton–Raphson and iteratively reweighted least-squaresalgorithms The chapters that follow highlight this discussion, which describesthe framework through which the models of interest arise
We also give the reader an overview of GLM residuals, introducing some thatare not widely known, but that nevertheless can be extremely useful for
analyzing a given model’s worth We discuss the general notion of goodness offit and provide a framework through which you can derive more extensions to
GLM We conclude this part with discussion and illustration of simulation anddata synthesis
We often advise participants only interested in the application and
interpretation of models to skip this first part of the book Even those interested
in the theoretical underpinnings will find that this first part of the book can servemore as an appendix That is, the information in this part often turns out to bemost useful in subsequent readings of the material
Part II addresses the continuous family of distributions, including the
Gaussian, gamma, inverse Gaussian, and power families We derive the relatedformulas and relevant algorithms for each family and then discuss the ancillary
or scale parameters appropriate to each model We also examine noncanonicallinks and generalizations to the basic model Finally, we give examples, showinghow a given dataset may be analyzed using each family and link We give
examples dealing with model application, including discussion of the appropriatecriteria for the analysis of fit We have expanded the number of examples in thisnew edition to highlight both model fitting and assessment
Part III addresses binomial response models It includes exposition of thegeneral binomial model and of the various links Major links described includethe canonical logit, as well as the noncanonical links probit, log-log, and
complementary log-log We also cover other links We present examples andcriteria for analysis of fit throughout the part This new edition includes
extensions to generalized binomial regression resulting from a special case of
Trang 30We also give considerable space to overdispersion We discuss the problem’snature, how it is identified, and how it can be dealt with in the context of
discovery and analysis We explain how to adjust the binomial model to
accommodate overdispersion You can accomplish this task by internal
adjustment to the base model, or you may need to reformulate the base modelitself We also introduce methods of adjusting the variance–covariance matrix ofthe model to produce robust standard errors The problem of dealing with
overdispersion continues in the chapters on count data
Part IV addresses count response data We include examinations of the
Poisson, the geometric, and the negative binomial models With respect to thenegative binomial, we show how the standard models can be further extended toderive a class called heterogeneous negative binomial models There are several
“brands” of negative binomial, and it is wise for the researcher to know howeach is best used The distinction of these models is typically denoted NB-1 and
NB-2 and relates to the variance-to-mean ratio of the resulting derivation of themodel We have updated this discussion to include the generalized Poisson
regression model, which is similar to NB-1 In this edition, we present severalother variations of negative binomial regression
Part V addresses categorical response regression models Typically
considered extensions to the basic GLM, categorical response models are dividedinto two general varieties: unordered response models, also known as
multinomial models, and ordered response models We begin by consideringordered response models In such models, the discrete number of outcomes areordered, but the integer labels applied to the ordered levels of outcome are notnecessarily equally spaced A simple example is the set of outcomes “bad”,
“average”, and “good” We also cover unordered multinomial responses, whoseoutcomes are given no order For an example of an unordered outcome, considerchoosing the type of entertainment that is available for an evening The
following choices may be given as “movie”, “restaurant”, “dancing”, or
“reading” Ordered response models are themselves divisible into two varieties:1) ordered binomial including ordered logit, ordered probit, ordered
complementary log-log, or ordered log-log and 2) the generalized ordered
binomial model with the same links as the nongeneralized parameterization Wehave expanded our discussion to include more ordered outcome models,
Trang 31Finally, part VI is about extensions to GLMs In particular, we examine thefollowing models:
statistical modeling is moving, hence laying a foundation for future research andfor ever-more-appropriate GLMs Moreover, we have expanded each section ofthe original version of this text to bring new and expanded regression modelsinto focus Our attempt, as always, is to illustrate these new models within thecontext of the GLM
Trang 32All the data used in this book are freely available for you to download from theStata Press website, http://www.stata-press.com In fact, when we introduce newdatasets, we merely load them into Stata the same way that you would Forexample,
To download the datasets and do-files for this book, type
The datasets and do-files will be downloaded to your current working directory
We suggest that you create a new directory into which the materials will bedownloaded
Trang 33Part I Foundations of Generalized Linear Models
Trang 34GLMs
Nelder and Wedderburn (1972) introduced the theory of GLMs The authors
derived an underlying unity for an entire class of regression models This classconsisted of models whose single response variable, the variable that the model
is to explain, is hypothesized to have the variance that is reflected by a member
of the single-parameter exponential family of probability distributions Thisfamily of distributions includes the Gaussian or normal, binomial, Poisson,
gamma, inverse Gaussian, geometric, and negative binomial
To establish a basis, we begin discussion of GLMs by initially recalling
important results on linear models, specifically those results for linear
regression The standard linear regression model relies on several assumptions,among which are the following:
1 Each observation of the response variable is characterized by the normal orGaussian distribution;
2 The distributions for all observations have a common variance; forall
3 There is a direct or “identical” relationship between the linear predictor(linear combination of covariate values and associated parameters) and theexpected values of the model;
The purpose of GLMs, and the linear models that they generalize, is to specifythe relationship between the observed response variable and some number ofcovariates The outcome variable is viewed as a realization from a random
variable
Nelder and Wedderburn showed that general models could be developed byrelaxing the assumptions of the linear model By restructuring the relationshipbetween the linear predictor and the fit, we can “linearize” relationships thatinitially seem to be nonlinear Nelder and Wedderburn accordingly dubbed thesemodels “generalized linear models”
Most models that were placed under the original GLM framework were well
Trang 35or initial estimates for parameters must be provided, and considerable work isrequired to derive model-specific quantities to ultimately obtain parameter
estimates and their standard errors In the next chapter, we show much effort isinvolved
Ordinary least squares (OLS) extends ML linear regression such that the
properties of OLS estimates depend only on the assumptions of constant varianceand independence ML linear regression carries the more restrictive distributionalassumption of normality Similarly, although we may derive likelihoods fromspecific distributions in the exponential family, the second-order properties ofour estimates are shown to depend only on the assumed mean–variance
relationship and on the independence of the observations rather than on a morerestrictive assumption that observations follow a particular distribution
The classical linear model assumes that the observations that our dependentvariable represents are independent normal variates with constant variance Also covariates are related to the expected value of the independent variablesuch that
This last equation shows the “identical” or identity relationship between thelinear predictor and the mean
Whereas the linear model conceptualizes the outcome as the sum of itsmean and a random variable , Nelder and Wedderburn linearized each GLM
family member by means of a link function They then altered a previously used
algorithm called iterative weighted least squares, which was used in fitting
weighted least-squares regression models Aside from introducing the link
function relating the linear predictor to the fitted values, they also introduced thevariance function as an element in the weighting of the regression The iterations
of the algorithm updates parameter estimates to produce appropriate linear
predictors, fitted values, and standard errors We will clarify exactly how all thisfalls together in the section on the iteratively reweighted least-squares (IRLS)algorithm
Trang 36previously considered to be nonlinear by restructuring them into GLMs Later, itwas discovered that an even more general class of linear models results frommore relaxations of assumptions for GLMs
However, even though the historical roots of GLMs are based on IRLS
methodology, many generalizations to the linear model still require Newton–Raphson techniques common to ML methods We take the position here that GLMsshould not be constrained to those models first discussed by Nelder and
Wedderburn but rather that they encompass all such linear generalizations to thestandard model
Many other books and journal articles followed the cornerstone article byNelder and Wedderburn (1972) as well as the text by McCullagh and
Nelder (1989) (the original text was published in 1983) Lindsey (1997)
illustrates the application of GLMs to biostatistics, most notably focusing on
survival models Hilbe (1994) gives an overview of the GLM and its support fromvarious software packages Software was developed early on In fact, Nelder wasinstrumental in developing the first statistical program based entirely on GLM
principles—generalized linear interactive modeling (GLIM) Published by theNumerical Algorithms Group (NAG), the software package has been widely usedsince the mid-1970s Other vendors began offering GLM capabilities in the 1980s,including GENSTAT and S-Plus Stata and SAS included it in their software
offerings in 1993 and 1994, respectively
This text covers much of the same foundation material as other books Whatdistinguishes our presentation of the material is twofold First, we focus on theestimation of various models via the estimation technique Second, we presentour derivation of the methods of estimation in a more accessible manner thanwhich is presented in other sources In fact, where possible, we present completealgebraic derivations that include nearly every step in the illustrations
Pedagogically, we have found that this manner of exposition imparts a moresolid understanding and “feel” of the area than do other approaches The idea isthis: if you can write your own GLM, then you are probably more able to knowhow it works, when and why it does not work, and how it is to be evaluated Ofcourse, we also discuss methods of fit assessment and testing To model datawithout subjecting them to evaluation is like taking a test without checking theanswers Hence, we will spend considerable time dealing with model evaluation
as well as algorithm construction
Trang 37Cited in various places such as Hilbe (1993b) and Francis, Green, and
Payne (1993) , GLMs are characterized by an expanded itemized list given by thefollowing:
1 A random component for the response, , which has the characteristic
variance of a distribution that belongs to the exponential family
2 A linear systematic component relating the linear predictor, , to theproduct of the design matrix and the parameters
3 A known monotonic, one-to-one, differentiable link function relatingthe linear predictor to the fitted values Because the function is one-to-one,there is an inverse function relating the mean expected response, ,
traditional viewpoint Adjustments to the weight function have been added tomatch the usual Newton–Raphson algorithms more closely and so that moreappropriate standard errors may be calculated for noncanonical link models.Such features as scaling and robust variance estimators have also been added tothe basic algorithm More importantly, sometimes a traditional GLM must berestructured and fit using a model-specific Newton–Raphson algorithm Ofcourse, one may simply define a GLM as a model requiring only the standardapproach but doing so would severely limit the range of possible models Weprefer to think of a GLM as a model that is ultimately based on the probabilityfunction belonging to the exponential family of distributions, but with the
proviso that this criterion may be relaxed to include quasilikelihood models aswell as certain types of multinomial, truncated , censored , and inflated models.Most of the latter type require a Newton–Raphson approach rather than the
Trang 38Early GLM software development constrained GLMs to those models that
could be fit using the originally described estimation algorithm As we will
illustrate, the traditional algorithm is relatively simple to implement and requireslittle computing power In the days when RAM was scarce and expensive, thiswas an optimal production strategy for software development Because this is nolonger the case, a wider range of GLMs can more easily be fit using a variety ofalgorithms We will discuss these implementation details at length
In the classical linear model, the observations of the dependent variable y are
independent normal variates with constant variance We assume that the meanvalue of may depend on other quantities (predictors) denoted by the columnvectors In the simplest situation, we assume that this
dependency is linear and write
(2.3)
and attempt to estimate the vector
GLMs specify a relationship between the mean of the random variable and afunction of the linear combination of the predictors This generalization admits amodel specification allowing for continuous or discrete outcomes and allows adescription of the variance as a function of the mean
Trang 39The link function relates the mean to the linear predictor , and thevariance function relates the variance as a function of the mean
, where is the scale factor For the Poisson, binomial, andnegative binomial variance models,
Breslow (1996) points out that the critical assumptions in the GLM frameworkmay be stated as follows:
Trang 40GLMs are traditionally formulated within the framework of the exponential
family of distributions In the associated representation, we can derive a generalmodel that may be fit using the scoring process (IRLS) detailed in section 3.3.Many people confuse the estimation method with the class of GLMs This is amistake because there are many estimation methods Some software
implementations allow specification of more diverse models than others We willpoint this out throughout the text
The exponential family is usually (there are other algebraically equivalentforms in the literature) written as
(2.4)
where is the canonical (natural) parameter of location and is the parameter ofscale The location parameter (also known as the canonical link function) relates
to the means, and the scalar parameter relates to the variances for members ofthe exponential family of distributions including Gaussian, gamma, inverseGaussian, and others Using the notation of the exponential family provides ameans to specify models for continuous, discrete, proportional, count, and binaryoutcomes
In the exponential family presentation, we construe each of the
observations as being defined in terms of the parameters Because the
observations are independent, the joint density of the sample of observations ,given parameters and , is defined by the product of the density over the
individual observations (review section 2.2) Interested readers can review
Barndorff-Nielsen (1976) for the theoretical justification that allows this
factorization:
(2.5)