Generalized linear models and extensions

5.1 Pearson residuals obtained from linear model 5.2 Normal scores versus sorted Pearson residuals obtained from linear model 5.3 Pearson residuals versus kilocalories; Pearson residuals

Trang 2

Second edition 2007 Third edition 2012 Fourth edition 2018

Trang 3

Stata, , Stata Press, Mata, , and NetCourse are registered trademarks of StataCorp LLC.

Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of the United Nations.

NetCourseNow is a trademark of StataCorp LLC.

L AT XE 2 is a trademark of the American Mathematical Society.

Trang 4

In all editions of this text, this dedication page was written as the final deliverable to the editor—after all the work and all the requested

changes and additions had been addressed In every previous edition, Joe and I co-dedicated this book to our wives and children This time,

he is very proud of our collaboration, but even more proud of his

family: Cheryl, Heather, Michael, and Mitchell.

Trang 14

5.1 Pearson residuals obtained from linear model

5.2 Normal scores versus sorted Pearson residuals obtained from linear model

5.3 Pearson residuals versus kilocalories; Pearson residuals obtained from linearmodel

5.4 Pearson residuals obtained from log-Gaussian model (two outliers removed)

5.5 Pearson residuals versus fitted values from log-Gaussian model (two outliersremoved)

5.6 Pearson residuals from lognormal model (log-transformed outcome, twooutliers removed, and zero outcome removed)

5.7

Pearson residuals versus fitted values from lognormal model (log-transformed outcome, two outliers removed, and zero outcome removed)

5.8 Normal scores versus sorted Pearson residuals obtained from lognormalmodel (log-transformed outcome, two outliers removed, and zero outcome

removed)

5.9 Pearson residuals versus kilocalories; Pearson residuals obtained from

lognormal model (log-transformed outcome, two outliers removed, and zerooutcome removed)

10.1 Probit and logit functions

10.2 Predicted probabilities for probit and logit link function in grouped binarymodels The observed (sample) proportions are included as well

10.3 Complementary log-log and log-log functions

10.4 Probit, logit, and identity functions

10.5 Observed proportion of carrot fly damage for each treatment (see

table 10.3)

Trang 16

20.2 Additional built-in support for the bayes prefix

Trang 17

A.3 First derivatives of link functions ( )

A.4 First derivatives of inverse link functions ( )

A.5 Second derivatives of link functions where and

A.6 Second derivatives of inverse link functions where and

Trang 18

5.5 IRLS algorithm for log-Gaussian models using OIM

10.2 IRLS algorithm for binary probit regression using OIM

10.3 IRLS algorithm for binary clog-log regression using EIM

10.4 IRLS algorithm for binary clog-log regression using OIM

Trang 20

We have added several new models to the discussion of extended generalizedlinear models (GLMs) We have included new software and discussion of

extensions to negative binomial regression because of Waring and Famoye Wehave also added discussion of heaped data and bias-corrected GLMs because ofFirth There are two new chapters on multivariate outcomes and Bayes GLMs Inaddition, we have expanded the clustered data discussion to cover more of thecommands available in Stata

We now include even more examples using synthetically created models toillustrate estimation results, and we illustrate to readers how to construct

synthetic Monte Carlo models for binomial and major count models Code forcreating synthetic Poisson, negative binomial, zero-inflated, hurdle, and finitemixture models is provided and further explained We have enhanced discussion

of marginal effects and discrete change for GLMs

This fourth edition of Generalized Linear Models and Extensions is written

for the active researcher as well as for the theoretical statistician Our goal hasbeen to clarify the nature and scope of GLMs and to demonstrate how all thefamilies, links, and variations of GLMs fit together in an understandable whole

In a step-by-step manner, we detail the foundations and provide workingalgorithms that readers can use to construct and better understand models thatthey wish to develop In a sense, we offer readers a workbook or handbook ofhow to deal with data using GLM and GLM extensions

This text is intended as a textbook on GLMs and as a handbook of advice forresearchers We continue to use this book as the required text for a web-based

short course through Statistics.com (also known as the Institute for Statistical

Education); see http://www.statistics.com The students of this six-week course

include university professors and active researchers from hospitals, governmentagencies, research institutes, educational concerns, and other institutions acrossthe world This latest edition reflects the experiences we have had in

communicating to our readers and students the relevant materials over the pastdecade

Many people have contributed to the ideas presented in the new edition of

Trang 21

Norman Breslow, Berwin Turlach, Gordon Johnston, Thomas Lumley, BillSribney, Vince Wiggins, Mario Cleves, William Greene, Andrew Robinson,Heather Presnal, and others Specifically, for this edition, we thank TammyCummings, Chelsea Deroche, Xinling Xu, Roy Bower, Julie Royer, James

Hussey, Alex McLain, Rebecca Wardrop, Gelareh Rahimi, Michael G Smith,Marco Geraci, Bo Cai, and Feifei Xiao

As always, we thank William Gould, president of StataCorp, for his

encouragement in this project His statistical computing expertise and his

contributions to statistical modeling have had a deep impact on this book

We are grateful to StataCorp’s editorial staff for their equanimity in readingand editing our manuscript, especially to Patricia Branton and Lisa Gilmore fortheir insightful and patient contributions in this area Finally, we thank KristinMacDonald and Isabel Canette-Fernandez, Stata statisticians at StataCorp, fortheir expert assistance on various programming issues, and Nikolay Balov,Senior Statistician and Software Developer at StataCorp, for his helpful

assistance with chapter 20 on Bayesian GLMs We would also like to thank RoseMedeiros, Senior Statistician at StataCorp, for her assistance in the final passes

of this edition

Stata Press allowed us to dictate some of the style of this text In writing thismaterial in other forms for short courses, we have always included equationnumbers for all equations rather than only for those equations mentioned in text.Although this is not the standard editorial style for textbooks, we enjoy thebenefits of students being able to communicate questions and comments moreeasily (and efficiently) We hope that readers find this practice as beneficial asour short-course participants have found it

Errata, datasets, and supporting Stata programs (do-files and ado-files) may

linearmodels-and-extensions/ We also maintain these materials on the authorsites at http://www.thirdwaystat.com/jameshardin/ and at

be found at the publisher’s site http://www.stata-press.com/books/generalized-https://works.bepress.com/joseph_hilbe/ We are very pleased to be able toproduce this newest edition Working on this text from the first edition in 2001

Trang 22

James W HardinJoseph M HilbeMarch 2018

Trang 23

Introduction

In updating this text, our primary goal is to convey the practice of analyzing datavia generalized linear models to researchers across a broad spectrum of scientificfields We lay out the framework used for describing various aspects of data andfor communicating tools for data analysis This initial part of the text contains noexamples Rather, we focus on the lexicon of generalized linear models used inlater chapters These later chapters include examples from fields such as

biostatistics, economics, and survival analysis

In developing analysis tools, we illustrate techniques via their genesis inestimation algorithms We believe that motivating the discussion through theestimation algorithms clarifies the origin and usefulness of all generalized linearmodels Instead of detailed theoretical exposition, we refer to texts and papersthat present such material so that we may focus our detailed presentations on thealgorithms and their justification Our detailed presentations are mostly

algebraic; we have minimized matrix notation whenever possible

We often present illustrations of models using data that we synthesize

Although it is preferable to use real data to illustrate interpretation in context,there is a distinct advantage to examples using simulated data The advantage ofworked examples relying on data synthesis is that the data-generating processoffers yet another glimpse into the associations between variables and outcomesthat are to be captured in the procedure The associations are thus seen from boththe results of the model and the origins of data generation

Trang 24

We wrote this text for researchers who want to understand the scope and

application of generalized linear models while being introduced to the

underlying theory For brevity’s sake, we use the acronym GLM to refer to thegeneralized linear model, but we acknowledge that GLM has been used elsewhere

as an acronym for the general linear model The latter usage, of course, refers tothe area of statistical modeling based solely on the normal or Gaussian

probability distribution

We take GLM to be the generalization of the general, because that is preciselywhat GLMs are They are the result of extending ordinary least-squares (OLS)regression, or the normal model, to a model that is appropriate for a variety ofresponse distributions, specifically to those distributions that compose the singleparameter exponential family of distributions We examine exactly how thisextension is accomplished We also aim to provide the reader with a firm

understanding of how GLMs are evaluated and when their use is appropriate Weeven advance a bit beyond the traditional GLM and give the reader a look at how

GLMs can be extended to model certain types of data that do not fit exactly withinthe GLM framework

Nearly every text that addresses a statistical topic uses one or more statisticalcomputing packages to calculate and display results We use Stata exclusively,though we do refer occasionally to other software packages—especially when it

is important to highlight differences

Some specific statistical models that make up GLMs are often found as

standalone software modules, typically fit using maximum likelihood methodsbased on quantities from model-specific derivations Stata has several such

commands for specific GLMs including poisson, logistic, and regress Some

of these procedures were included in the Stata package from its first version.More models have been addressed through commands written by users of Stata’sprogramming language leading to the creation of highly complex statistical

models Some of these community-contributed commands have since been

incorporated into the official Stata package We highlight these commands andillustrate how to fit models in the absence of a packaged command; see

especially chapter 14

Trang 25

StataCorp’s continued updates to the command

Readers of technical books often need to know about prerequisites,

especially how much math and statistics background is required To gain fulladvantage from this text and follow its every statement and algorithm, you

should have an understanding equal to a two-semester calculus-based course onstatistical theory Without a background in statistical theory, the reader can

accept the presentation of the theoretical underpinnings and follow the (mostly)algebraic derivations that do not require more than a mastery of simple

derivatives We assume prior knowledge of multiple regression but no otherspecialized knowledge is required

We believe that GLMs are best understood if their computational basis is clear.Hence, we begin our exposition with an explanation of the foundations and

computation of GLMs; there are two major methodologies for developing

algorithms We then show how simple changes to the base algorithms lead todifferent GLM families, links, and even further extensions In short, we attempt tolay the GLM open to inspection and to make every part of it as clear as possible

In this fashion, the reader can understand exactly how and why GLM algorithmscan be used, as well as altered, to better model a desired dataset

Perhaps more than any other text in this area, we alternatively examine twomajor computational GLM algorithms and their modifications:

(GEEs) On the other hand, truncated models that do not fit neatly into the

exponential family of distributions are modeled using Newton–Raphson methods

—and for this, too, we show why Again, focusing on the details of calculationshould help the reader understand both the scope and the limits of a particular

Trang 26

Whenever possible, we present the log likelihood for the model under

discussion In writing the log likelihood, we include offsets so that interestedprogrammers can see how those elements enter estimation In fact, we attempt tooffer programmers the ability to understand and write their own working GLMs,plus many useful extensions As programmers ourselves, we believe that there isvalue in such a presentation; we would have much enjoyed having it at our

fingertips when we first entered this statistical domain

Trang 27

We use to denote the likelihood and the script to denote the log likelihood

We use to denote the design matrix of independent (explanatory) variables.When appropriate, we use boldface type to emphasize that we are referring to

a matrix; a lowercase letter with a subscript will refer to the th row from thematrix

We use to denote the dependent (response) variable and refer to the vector

as the coefficients of the design matrix We use when we wish to discuss oremphasize the fitted coefficients Throughout the text, we discuss the role of the(vector) linear predictor In generalizing this concept, we also refer tothe augmented (by an offset) version of the linear predictor

Finally, we use the notation to refer to the expectation of a randomvariable and the notation to refer to the variance of a random variable Wedescribe other notational conventions at the time of their first use

Trang 28

A common question regarding texts concerns their focus Is the text applied ortheoretical? Our text is both However, we would argue that it is basically

applied We show enough technical details for the theoretician to understand theunderlying basis of GLMs However, we believe that understanding the use andlimitations of a GLM includes understanding its estimation algorithm For some,dealing with formulas and algorithms appears thoroughly theoretical We believethat it aids in understanding the scope and limits of proper application Perhaps

we can call the text a bit of both and not worry about classification In any case,for those who fear formulas, each formula and algorithm is thoroughly

explained We hope that by book’s end the formulas and algorithms will seemsimple and meaningful For completeness, we give the reader references to textsthat discuss more advanced topics and theory

Trang 29

Part I of the text deals with the basic foundations of GLM We detail the variouscomponents of GLM, including various family, link, variance, deviance, and log-likelihood functions We also provide a thorough background and detailed

particulars of both the Newton–Raphson and iteratively reweighted least-squaresalgorithms The chapters that follow highlight this discussion, which describesthe framework through which the models of interest arise

We also give the reader an overview of GLM residuals, introducing some thatare not widely known, but that nevertheless can be extremely useful for

analyzing a given model’s worth We discuss the general notion of goodness offit and provide a framework through which you can derive more extensions to

GLM We conclude this part with discussion and illustration of simulation anddata synthesis

We often advise participants only interested in the application and

interpretation of models to skip this first part of the book Even those interested

in the theoretical underpinnings will find that this first part of the book can servemore as an appendix That is, the information in this part often turns out to bemost useful in subsequent readings of the material

Part II addresses the continuous family of distributions, including the

Gaussian, gamma, inverse Gaussian, and power families We derive the relatedformulas and relevant algorithms for each family and then discuss the ancillary

or scale parameters appropriate to each model We also examine noncanonicallinks and generalizations to the basic model Finally, we give examples, showinghow a given dataset may be analyzed using each family and link We give

examples dealing with model application, including discussion of the appropriatecriteria for the analysis of fit We have expanded the number of examples in thisnew edition to highlight both model fitting and assessment

Part III addresses binomial response models It includes exposition of thegeneral binomial model and of the various links Major links described includethe canonical logit, as well as the noncanonical links probit, log-log, and

complementary log-log We also cover other links We present examples andcriteria for analysis of fit throughout the part This new edition includes

extensions to generalized binomial regression resulting from a special case of

Trang 30

We also give considerable space to overdispersion We discuss the problem’snature, how it is identified, and how it can be dealt with in the context of

discovery and analysis We explain how to adjust the binomial model to

accommodate overdispersion You can accomplish this task by internal

adjustment to the base model, or you may need to reformulate the base modelitself We also introduce methods of adjusting the variance–covariance matrix ofthe model to produce robust standard errors The problem of dealing with

overdispersion continues in the chapters on count data

Part IV addresses count response data We include examinations of the

Poisson, the geometric, and the negative binomial models With respect to thenegative binomial, we show how the standard models can be further extended toderive a class called heterogeneous negative binomial models There are several

“brands” of negative binomial, and it is wise for the researcher to know howeach is best used The distinction of these models is typically denoted NB-1 and

NB-2 and relates to the variance-to-mean ratio of the resulting derivation of themodel We have updated this discussion to include the generalized Poisson

regression model, which is similar to NB-1 In this edition, we present severalother variations of negative binomial regression

Part V addresses categorical response regression models Typically

considered extensions to the basic GLM, categorical response models are dividedinto two general varieties: unordered response models, also known as

multinomial models, and ordered response models We begin by consideringordered response models In such models, the discrete number of outcomes areordered, but the integer labels applied to the ordered levels of outcome are notnecessarily equally spaced A simple example is the set of outcomes “bad”,

“average”, and “good” We also cover unordered multinomial responses, whoseoutcomes are given no order For an example of an unordered outcome, considerchoosing the type of entertainment that is available for an evening The

following choices may be given as “movie”, “restaurant”, “dancing”, or

“reading” Ordered response models are themselves divisible into two varieties:1) ordered binomial including ordered logit, ordered probit, ordered

complementary log-log, or ordered log-log and 2) the generalized ordered

binomial model with the same links as the nongeneralized parameterization Wehave expanded our discussion to include more ordered outcome models,

Trang 31

Finally, part VI is about extensions to GLMs In particular, we examine thefollowing models:

statistical modeling is moving, hence laying a foundation for future research andfor ever-more-appropriate GLMs Moreover, we have expanded each section ofthe original version of this text to bring new and expanded regression modelsinto focus Our attempt, as always, is to illustrate these new models within thecontext of the GLM

Trang 32

All the data used in this book are freely available for you to download from theStata Press website, http://www.stata-press.com In fact, when we introduce newdatasets, we merely load them into Stata the same way that you would Forexample,

To download the datasets and do-files for this book, type

The datasets and do-files will be downloaded to your current working directory

We suggest that you create a new directory into which the materials will bedownloaded

Trang 33

Part I Foundations of Generalized Linear Models

Trang 34

GLMs

Nelder and Wedderburn (1972) introduced the theory of GLMs The authors

derived an underlying unity for an entire class of regression models This classconsisted of models whose single response variable, the variable that the model

is to explain, is hypothesized to have the variance that is reflected by a member

of the single-parameter exponential family of probability distributions Thisfamily of distributions includes the Gaussian or normal, binomial, Poisson,

gamma, inverse Gaussian, geometric, and negative binomial

To establish a basis, we begin discussion of GLMs by initially recalling

important results on linear models, specifically those results for linear

regression The standard linear regression model relies on several assumptions,among which are the following:

1 Each observation of the response variable is characterized by the normal orGaussian distribution;

2 The distributions for all observations have a common variance; forall

3 There is a direct or “identical” relationship between the linear predictor(linear combination of covariate values and associated parameters) and theexpected values of the model;

The purpose of GLMs, and the linear models that they generalize, is to specifythe relationship between the observed response variable and some number ofcovariates The outcome variable is viewed as a realization from a random

variable

Nelder and Wedderburn showed that general models could be developed byrelaxing the assumptions of the linear model By restructuring the relationshipbetween the linear predictor and the fit, we can “linearize” relationships thatinitially seem to be nonlinear Nelder and Wedderburn accordingly dubbed thesemodels “generalized linear models”

Most models that were placed under the original GLM framework were well

Trang 35

or initial estimates for parameters must be provided, and considerable work isrequired to derive model-specific quantities to ultimately obtain parameter

estimates and their standard errors In the next chapter, we show much effort isinvolved

Ordinary least squares (OLS) extends ML linear regression such that the

properties of OLS estimates depend only on the assumptions of constant varianceand independence ML linear regression carries the more restrictive distributionalassumption of normality Similarly, although we may derive likelihoods fromspecific distributions in the exponential family, the second-order properties ofour estimates are shown to depend only on the assumed mean–variance

relationship and on the independence of the observations rather than on a morerestrictive assumption that observations follow a particular distribution

The classical linear model assumes that the observations that our dependentvariable represents are independent normal variates with constant variance Also covariates are related to the expected value of the independent variablesuch that

This last equation shows the “identical” or identity relationship between thelinear predictor and the mean

Whereas the linear model conceptualizes the outcome as the sum of itsmean and a random variable , Nelder and Wedderburn linearized each GLM

family member by means of a link function They then altered a previously used

algorithm called iterative weighted least squares, which was used in fitting

weighted least-squares regression models Aside from introducing the link

function relating the linear predictor to the fitted values, they also introduced thevariance function as an element in the weighting of the regression The iterations

of the algorithm updates parameter estimates to produce appropriate linear

predictors, fitted values, and standard errors We will clarify exactly how all thisfalls together in the section on the iteratively reweighted least-squares (IRLS)algorithm

Trang 36

previously considered to be nonlinear by restructuring them into GLMs Later, itwas discovered that an even more general class of linear models results frommore relaxations of assumptions for GLMs

However, even though the historical roots of GLMs are based on IRLS

methodology, many generalizations to the linear model still require Newton–Raphson techniques common to ML methods We take the position here that GLMsshould not be constrained to those models first discussed by Nelder and

Wedderburn but rather that they encompass all such linear generalizations to thestandard model

Many other books and journal articles followed the cornerstone article byNelder and Wedderburn (1972) as well as the text by McCullagh and

Nelder (1989) (the original text was published in 1983) Lindsey (1997)

illustrates the application of GLMs to biostatistics, most notably focusing on

survival models Hilbe (1994) gives an overview of the GLM and its support fromvarious software packages Software was developed early on In fact, Nelder wasinstrumental in developing the first statistical program based entirely on GLM

principles—generalized linear interactive modeling (GLIM) Published by theNumerical Algorithms Group (NAG), the software package has been widely usedsince the mid-1970s Other vendors began offering GLM capabilities in the 1980s,including GENSTAT and S-Plus Stata and SAS included it in their software

offerings in 1993 and 1994, respectively

This text covers much of the same foundation material as other books Whatdistinguishes our presentation of the material is twofold First, we focus on theestimation of various models via the estimation technique Second, we presentour derivation of the methods of estimation in a more accessible manner thanwhich is presented in other sources In fact, where possible, we present completealgebraic derivations that include nearly every step in the illustrations

Pedagogically, we have found that this manner of exposition imparts a moresolid understanding and “feel” of the area than do other approaches The idea isthis: if you can write your own GLM, then you are probably more able to knowhow it works, when and why it does not work, and how it is to be evaluated Ofcourse, we also discuss methods of fit assessment and testing To model datawithout subjecting them to evaluation is like taking a test without checking theanswers Hence, we will spend considerable time dealing with model evaluation

as well as algorithm construction

Trang 37

Cited in various places such as Hilbe (1993b) and Francis, Green, and

Payne (1993) , GLMs are characterized by an expanded itemized list given by thefollowing:

1 A random component for the response, , which has the characteristic

variance of a distribution that belongs to the exponential family

2 A linear systematic component relating the linear predictor, , to theproduct of the design matrix and the parameters

3 A known monotonic, one-to-one, differentiable link function relatingthe linear predictor to the fitted values Because the function is one-to-one,there is an inverse function relating the mean expected response, ,

traditional viewpoint Adjustments to the weight function have been added tomatch the usual Newton–Raphson algorithms more closely and so that moreappropriate standard errors may be calculated for noncanonical link models.Such features as scaling and robust variance estimators have also been added tothe basic algorithm More importantly, sometimes a traditional GLM must berestructured and fit using a model-specific Newton–Raphson algorithm Ofcourse, one may simply define a GLM as a model requiring only the standardapproach but doing so would severely limit the range of possible models Weprefer to think of a GLM as a model that is ultimately based on the probabilityfunction belonging to the exponential family of distributions, but with the

proviso that this criterion may be relaxed to include quasilikelihood models aswell as certain types of multinomial, truncated , censored , and inflated models.Most of the latter type require a Newton–Raphson approach rather than the

Trang 38

Early GLM software development constrained GLMs to those models that

could be fit using the originally described estimation algorithm As we will

illustrate, the traditional algorithm is relatively simple to implement and requireslittle computing power In the days when RAM was scarce and expensive, thiswas an optimal production strategy for software development Because this is nolonger the case, a wider range of GLMs can more easily be fit using a variety ofalgorithms We will discuss these implementation details at length

In the classical linear model, the observations of the dependent variable y are

independent normal variates with constant variance We assume that the meanvalue of may depend on other quantities (predictors) denoted by the columnvectors In the simplest situation, we assume that this

dependency is linear and write

(2.3)

and attempt to estimate the vector

GLMs specify a relationship between the mean of the random variable and afunction of the linear combination of the predictors This generalization admits amodel specification allowing for continuous or discrete outcomes and allows adescription of the variance as a function of the mean

Trang 39

The link function relates the mean to the linear predictor , and thevariance function relates the variance as a function of the mean

, where is the scale factor For the Poisson, binomial, andnegative binomial variance models,

Breslow (1996) points out that the critical assumptions in the GLM frameworkmay be stated as follows:

Trang 40

GLMs are traditionally formulated within the framework of the exponential

family of distributions In the associated representation, we can derive a generalmodel that may be fit using the scoring process (IRLS) detailed in section 3.3.Many people confuse the estimation method with the class of GLMs This is amistake because there are many estimation methods Some software

implementations allow specification of more diverse models than others We willpoint this out throughout the text

The exponential family is usually (there are other algebraically equivalentforms in the literature) written as

(2.4)

where is the canonical (natural) parameter of location and is the parameter ofscale The location parameter (also known as the canonical link function) relates

to the means, and the scalar parameter relates to the variances for members ofthe exponential family of distributions including Gaussian, gamma, inverseGaussian, and others Using the notation of the exponential family provides ameans to specify models for continuous, discrete, proportional, count, and binaryoutcomes

In the exponential family presentation, we construe each of the

observations as being defined in terms of the parameters Because the

observations are independent, the joint density of the sample of observations ,given parameters and , is defined by the product of the density over the

individual observations (review section 2.2) Interested readers can review

Barndorff-Nielsen (1976) for the theoretical justification that allows this

factorization:

(2.5)

Tiêu đề	Generalized Linear Models and Extensions
Tác giả	James W. Hardin, Joseph M. Hilbe
Trường học	University of South Carolina
Chuyên ngành	Epidemiology and Biostatistics
Thể loại	book
Năm xuất bản	2018
Thành phố	College Station

Định dạng
Số trang	789
Dung lượng	27,15 MB
File đính kèm	25. Generalized Linear Models and Extensions.rar (26 MB)