Of course, the failure to reject a model by a goodness-of-fit test does not prove that the model is true or fits the data well. For this reason, residual analyses are always advisable for any results that are to be published. A residual analysis for a logistic regression model is analogous to one for linear regression. Although the standard error ofdj is se[dj]=
njπj(1−πj), the standard error of the residualdj−njπˆjis less than se[dj] due to the fact that the maximum likelihood values of the parameter estimates tend to shift njπˆj in the direction ofdj. The ability of an individual covariate pattern to reduce the standard deviation of its associated residual is measured by theleveragehj (Pregibon, 1981). The formula forhj is complex and not terribly edifying. For our purposes, we can definehj by the formula var[dj−njπˆj]=njπˆj(1−πˆj)(1−hj)∼=var[dj−njπj](1−hj).
(5.51) In other words, 100(1−hj) is the percent reduction in the variance of the jthresidual due to the fact that the estimate ofnjπˆj is pulled towardsdj. The value ofhjlies between 0 and 1. Whenhjis very small,dj has almost no effect on its estimated expected valuenjπˆj. Whenhjis close to one, then dj ∼=njπˆj. This implies that both the residualdj−njπˆj and its variance will be close to zero. This definition of leverage is highly analogous to that given for linear regression. See, in particular, equation (2.24).
5.28.1. Standardized Pearson Residual
Thestandardized Pearson residualfor thejthcovariate pattern is the resid- ual divided by its standard error. That is,
rs j = dj −njπj
njπˆj(1−πˆj)(1−hj) = rj
1−hj
. (5.52)
This residual is analogous to the standardized residual for linear regression (see equation 2.25). The key difference between equation (2.25) and equa- tion (5.52) is that the standardized residual has a knowntdistribution under the linear model. Althoughrs jhas mean zero and standard error one it does not have a normally shaped distribution whennjis small. The square of the standardized Pearson residual is denoted by
X2j =rs j2 =r2j/(1−hj). (5.53)
We will use the critical value (z0.025)2=1.962=3.84 as a very rough guide
185 5.28. Residual and inf luence analysis
to identifying large values ofX2j. Approximately 95% of these squared residuals should be less than 3.84 if the logistic regression model is correct.
5.28.2. ∆βj Influence Statistic
Covariate patterns that are associated with both high leverage and large residuals can have a substantial influence on the parameter estimates of the model. Theβˆjinfluence statisticis a measure of the influence of the jth covariate pattern on all of the parameter estimates taken together (Pregibon, 1981), and equals
βˆj =rs j2hj/(1−hj). (5.54)
Note thatβˆjincreases with both the magnitude of the standardized resid- ual and the size of the leverage. It is analogous to Cook’s distance for linear regression (see Section 3.20.2). Covariate patterns associated with large val- ues ofX2jandβˆjmerit special attention.
5.28.3. Residual Plots of the Ille-et-Vilaine Data on Esophageal Cancer
Figure 5.1 shows a plot of the squared residualsX2jagainst the estimated cancer probability for model (5.40). Each circle represents the squared resid- ual associated with a unique covariate pattern. The area of each circle is pro- portional toβˆj. Black circles are used to indicate positive residuals while gray circles indicate negative residuals. Hosmer and Lemeshow (1989) first suggested this form of residual plot. They recommend that the area of the plotted circles be 1.5 times the magnitude ofβˆj. Figure 5.1 does not reveal any obvious relationship between the magnitude of the residuals and the values of ˆπj. There are 68 unique covariate patterns in this data set. Five percent of 68 equals 3.4. Hence, if model (5.40) is correct we would expect three or four squared residuals to be greater than 3.84. There are six such residuals with two of them being close to 3.84. Thus, the magnitude of the residuals is reasonably consistent with model (5.40).
There are two large squared residuals in Figure 5.1 that have high influ- ence. These squared residuals are labeled A and B in this figure. Residual A is associated with patients who are age 55–64 and consume, on a daily basis, at least 120 gm of alcohol and 0–9 gm of tobacco. Residual B is associated with patients who are age 55–64 and consume, on a daily basis, 0–39 gm of alcohol and at least 30 gm of tobacco. Theβjinfluence statistics associated
186 5. Multiple logistic regression
ˆπ
0 0.2 0.4 0.6 0.8 1.0
0 2
0.3
0.1 0.5 0.7 0.9
1 3 4 5 6 7 8
B A 2X∆Squared Standardized Pearson Residual
3.84
j
j
Figure 5.1 Squared residual plot ofX2j against πj for the esophageal cancer data ana- lyzed with model (5.40). A separate circle is plotted for each distinct covariate pattern. The area of each circle is proportional to the influence statisticβˆj. Xj2is the squared standardized Pearson residual for the jthcovariate pattern;
πj is the estimated probability that a study subject with this pattern is one of the case patients. Black and gray circles indicate positive and negative resid- uals, respectively. Two circles associated with covariate patterns having large influence and big squared residuals are labeled A and B (see text).
with residuals A and B are 6.16 and 4.15, respectively. Table 5.6 shows the effects of deleting patients with these covariate patterns from the analysis.
Column 3 of this table repeats the odds ratio given in Table 5.5. Columns 5 and 7 show the odds ratios that result when patients with covariate patterns A and B are deleted from model (5.40). Deleting patients with pattern A increases the odds ratio for men who smoke 0–9 gm and drink ≥120 gm from 65.1 to 274. This is a 321% increase that places this odds ratio outside of its 95% confidence interval based on the complete data. The other odds ratios in Table 5.5 are not greatly changed by deleting these patients. Delet- ing the patients associated with covariate pattern B causes a 78% reduction in the odds ratio for men who smoke at least 30 gm and drink 0–39 gm a day. Their deletion does not greatly affect the other odds ratios in this table.
How should these analyses guide the way in which we present these re- sults? Here, reasonable investigators may disagree on the best way to pro- ceed. My own inclination would be to publish Table 5.5. This table provides compelling evidence that tobacco and alcohol are strong independent risk
187 5.28. Residual and inf luence analysis
Table 5.6. Effects on odds ratios from model (5.40)due to deleting patients with covariates A and B identified in Figure 5.1 (see text).
Deleted covariate pattern
Complete data A† B‡
Daily drug consumption 95% Percent change Percent change
Odds confidence Odds from Odds from
Tobacco Alcohol
ratio interval ratio complete data ratio complete data
0–9 gm 0–39 gm 1.0∗ 1.0∗ 1.0∗
0–9 gm 40–79 gm 7.55 (3.4–17) 7.53 −0.26% 7.70 2.0%
0–9 gm 80–119 gm 12.7 (5.2–31) 12.6 −0.79% 13.0 2.4%
0–9 gm ≥120 gm 65.1 (20–210) 274 321% 66.8 2.6%
10–29 gm 0–39 gm 3.80 (1.6–9.2) 3.77 −0.79% 3.86 1.6%
10–29 gm 40–79 gm 9.34 (4.2–21) 9.30 −0.43% 9.53 2.0%
10–29 gm 80–119 gm 16.1 (6.8–38) 16.0 −0.62% 16.6 3.1%
10–29 gm ≥120 gm 92.3 (29–290) 95.4 3.4% 94.0 1.8%
≥30 gm 0–39 gm 8.65 (2.4–31) 8.66 0.12% 1.88 −78%
≥30 gm 40–79 gm 32.9 (10–110) 33.7 2.4% 33.5 1.8%
≥30 gm 80–119 gm 72.3 (15–350) 73.0 0.97% 74.2 2.6%
≥30 gm ≥120 gm 196 (30–1300) 198 1.02% 203 3.6%
∗Denominator of odds ratios
†Patients age 55–64 who drink at least 120 gm a day and smoke 0–9 gm a day deleted
‡Patients age 55–64 who drink 0–39 gm a day and smoke at least 30 gm a day deleted
factors for esophageal cancer and indicates an impressive synergy between these two risk factors. Deleting patients with covariate patterns A and B does not greatly alter this conclusion, although it does profoundly alter the size of two of these odds ratios. On the other hand, the size of some of theβˆj
influence statistics in Figure 5.1 and the width of the confidence intervals in Table 5.5 provide a clear warning that model (5.40) is approaching the upper limit of complexity that is reasonable for this data set. A more con- servative approach would be to not report the combined effects of alcohol and smoking, or to use just two levels of consumption for each drug rather than three or four. Model (5.38) could be used to report the odds ratios associated with different levels of alcohol consumption adjusted for tobacco usage. This model could also be used to estimate odds ratios associated with different tobacco levels adjusted for alcohol.
Residual analyses in logistic regression are in many ways similar to those for linear regression. There is, however, one important difference. In linear
188 5. Multiple logistic regression
regression, an influential observation is made on a single patient and there is always the possibility that this result is invalid and should be discarded from the analysis. In logistic regression, an influential observation usually is due to the response from multiple patients with the same covariate pattern.
Hence, deleting these observations is not an option. Nevertheless, residual analyses are worthwhile in that they help us evaluate how well the model fits the data and can indicate instabilities that can arise from excessively complicated models.