. *
. * Perform Pearson chi-squared and Hosmer-Lemeshow tests of . * goodness of fit.
. *
. lfit {1}
Logistic model for cancer, goodness-of-fit test number of observations = 975
number of covariate patterns = 68 Pearson chi2(51) = 55.85
Prob > chi2 = 0.2977
. lfit, group(10) table {2}
Logistic model for cancer, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
_Group _Prob _Obs_1 _Exp_1 _Obs_0 _Exp_0 _Total
1 0.0046 0 0.3 116 115.7 116
2 0.0273 2 2.0 118 118.0 120
3 0.0418 4 3.1 76 76.9 80
4 0.0765 4 5.1 87 85.9 91
5 0.1332 5 7.8 81 78.2 86
6 0.2073 21 20.2 91 91.8 112
7 0.2682 22 22.5 65 64.5 87
8 0.3833 32 28.5 56 59.5 88
9 0.5131 46 41.6 52 56.4 98
10 0.9440 64 68.9 33 28.1 97
189 5.29. Using stata for goodness-of-fit tests and residual analyses
number of observations = 975 number of groups = 10 Hosmer-Lemeshow chi2( 8 ) = 4.73
Prob > chi2 = 0.7862 . *
. * Perform residual analysis . *
. predict p, p {3}
. predict dx2, dx2 {4}
(57 missing values generated)
. predict rstandard, rstandard {5}
(57 missing values generated)
. generate dx2_pos = dx2 if rstandard >= 0 {6}
(137 missing values generated)
. generate dx2_neg = dx2 if rstandard < 0 (112 missing values generated)
. predict dbeta, dbeta {7}
(57 missing values generated) . generate bubble= 1.5*dbeta (57 missing values generated)
. graph dx2_pos dx2_neg p [weight=bubble], symbol(OO) xlabel(0.1 to 1.0) {8}
> xtick (0.05 0.1 to 0.95) ylabel(0 1 to 8) ytick (.5 1 to 7.5) yline(3.84)
. save temporary, replace {9}
file temporary.dta saved
. drop if patients == 0 {10}
(57 observations deleted)
. generate ca_no = cancer*patients
. collapse (sum) n = patients ca = ca_no, by(age alcohol smoke dbeta dx2 p) {11}
. *
. * Identify covariate patterns associated with large squared residuals . *
. list n ca age alcohol smoke dbeta dx2 p if dx2 > 3.84, nodisplay {12}
n ca age alcohol smoke dbeta dx2 p
11. 2 1 25-34 >= 120 10-29 1.335425 7.942312 .060482 17. 37 4 35-44 40-79 10-29 1.890465 5.466789 .041798 22. 3 2 35-44 >= 120 0-9 .9170162 3.896309 .2331274
190 5. Multiple logistic regression
25. 28 0 45-54 0-39 10-29 1.564479 4.114906 .0962316 38. 6 4 55-64 0-39 >= 30 4.159096 6.503713 .2956251 45. 10 5 55-64 >= 120 0-9 6.159449 6.949361 .7594333 . *
. * Rerun analysis without the covariate pattern A . *
. use temporary, clear {13}
. drop if age == 4 alcohol == 4 smoke == 1 {14}
(2 observations deleted)
. xi: logistic cancer i.age i.alcohol*i.smoke [freq=patients] {15} {Output omitted}
--- cancer | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---+--- {Output omitted}
_Ialcohol_2 | 7.525681 3.032792 5.008 0.000 3.416001 16.57958 _Ialcohol_3 | 12.62548 5.790079 5.529 0.000 5.139068 31.01781 _Ialcohol_4 | 273.8578 248.0885 6.196 0.000 46.38949 1616.705 _Ismoke_2 | 3.76567 1.6883 2.957 0.003 1.563921 9.067132 _Ismoke_3 | 8.65512 5.583627 3.345 0.001 2.444232 30.64811 {Output omitted} . lincom _Ialcohol_2 + _Ismoke_2 + _IalcXsmo_2_2, or
{Output omitted} --- cancer | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
--- (1) | 9.298176 3.811849 5.439 0.000 4.163342 20.76603 ---
. lincom _Ialcohol_2 +_Ismoke_3 + _IalcXsmo_2_3, or
{Output omitted}
(1) | 33.6871 20.40138 5.808 0.000 10.27932 110.3985 ---
. lincom _Ialcohol_3 + _Ismoke_2 + _IalcXsmo_3_2, or
{Output omitted}
(1) | 16.01118 7.097924 6.256 0.000 6.715472 38.1742 ---
191 5.29. Using stata for goodness-of-fit tests and residual analyses
. lincom _Ialcohol_3 + _Ismoke_3 + _IalcXsmo_3_3, or
{Output omitted} (1) | 73.00683 58.92606 5.316 0.000 15.00833 355.1358 --- . lincom _Ialcohol_4 + _Ismoke_2 + _IalcXsmo_4_2, or
{Output omitted}
(1) | 95.43948 56.55247 7.693 0.000 29.87792 304.8638 --- . lincom _Ialcohol_4 + _Ismoke_3 + _IalcXsmo_4_3, or
{Output omitted}
(1) | 197.7124 192.6564 5.426 0.000 29.28192 1334.96 --- . *
. * Rerun analysis without the covariate pattern B . *
. use temporary, clear {16}
. drop if age == 4 & alcohol == 1 & smoke == 3 {17} (2 observations deleted)
. xi: logistic cancer i.age i.alcohol*i.smoke [freq=patients] {18} {Output omitted} --- cancer | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---+--- {Output omitted}
_Ialcohol_2 | 7.695185 3.109016 5.051 0.000 3.485907 16.98722 _Ialcohol_3 | 13.04068 5.992019 5.589 0.000 5.298882 32.09342 _Ialcohol_4 | 66.83578 40.63582 6.912 0.000 20.29938 220.057
_Ismoke_2 | 3.864114 1.735157 3.010 0.003 1.602592 9.317017 _Ismoke_3 | 1.875407 2.107209 0.560 0.576 .2073406 16.96315 {Output omitted} . lincom _Ialcohol_2 + _Ismoke_2 + _IalcXsmo_2_2, or
{Output omitted}
--- cancer | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
--- (1) | 9.526812 3.914527 5.486 0.000 4.25787 21.31586 ---
192 5. Multiple logistic regression
. lincom _Ialcohol_2 + _Ismoke_3 + _IalcXsmo_2_3, or
{Output omitted} (1) | 33.48594 20.08865 5.853 0.000 10.33274 108.5199 ---
. lincom _Ialcohol_3 + _Ismoke_2 + _IalcXsmo_3_2, or
{Output omitted} (1) | 16.58352 7.369457 6.320 0.000 6.940903 39.62209 ---
. lincom _Ialcohol_3 + _Ismoke_3 + _IalcXsmo_3_3, or
{Output omitted} (1) | 74.22997 59.24187 5.397 0.000 15.53272 354.7406 ---
. lincom _Ialcohol_4 + _Ismoke_2 + _IalcXsmo_4_2, or
{Output omitted}
(1) | 94.0049 54.92414 7.776 0.000 29.91024 295.448 ---
. lincom _Ialcohol_4 + _Ismoke_3 + _IalcXsmo_4_3, or
{Output omitted} (1) | 202.6374 194.6184 5.530 0.000 30.84628 1331.179 ---
Comments
1 Thelfitcommand is a post-estimation command that can be used with logistic regression. Without options it calculates the Pearson chi-squared goodness-of-fit test for the preceding logistic regression analysis. In this example, the preceding logistic command analyzed model (5.40) (see Section 5.23). As indicated in Section 5.27.1, this statistic equals 55.85 and has 51 degrees of freedom. The associatedPvalue is 0.30.
2 The group(10) option causeslfit to calculate the Hosmer–Lemeshow goodness-of-fit test with the study subjects subdivided intog=10 groups.
Thetableoption displays information about these groups. The columns in the subsequent table are defined as follows: Group=kis the group number, Probis the maximum value of ˆπjin thekthgroup, Obs 1=ok
is the observed number of events in thekthgroup, Exp 1 = njπˆjis the expected number of events in thekth group, Obs 0 =mk−ok= the number of subjects who did not have events in the kth group,
Exp 0 =mk−
njπˆjis the expected number of subjects who did not have events in thekthgroup, and Total=mkis the total number of sub- jects in thekthgroup. The Hosmer–Lemeshow goodness-of-fit statistic
193 5.29. Using stata for goodness-of-fit tests and residual analyses
equals 4.73 with eight degrees of freedom. TheP value associated with this test is 0.79.
3 The poption in thispredictcommand defines the variable pto equal πˆj. In this and the next twopredictcommands the name of the newly defined variable is the same as the command option.
4 Define the variabledx2to equalX2j. All records with the same covariate pattern are given the same value ofdx2.
5 Defi nerstandardto equal the standardized Pearson residualrs j. 6 We are going to draw a scatterplot ofX2j against ˆπj. We would like
to color code the plotting symbols to indicate whether the residual is positive or negative. This command definesdx2 posto equalX2jif and only ifrs j is non-negative. The next command definesdx2 negto equal X2j ifrs j is negative. See comment 8 below.
7 Define the variabledbetato equalβˆj. The values ofdx2, dbetaand rstandardare affected by the number of subjects with a given covariate pattern, and the number of events that occur to these subjects. They are not affected by the number of records used to record this information.
Hence, it makes no difference whether there is one record per patient or just two records specifying the number of subjects with the specified covariate pattern who did, or did not, suffer the event of interest.
8 This graph produces a scatterplot ofX2j against ˆπj that is similar to Figure 5.1. The[weight=bubble]command modifier causes the plotting symbols to be circles whose area is proportional to the variable bubble.
(We set bubbleequal to 1.5×βˆj following the recommendation of Hosmer and Lemeshow (1989) for these residual plots.) We plot both dx2 posanddx2 neg against p in order to be able to assign different pen colors to values ofX2j that are associated with positive or negative residuals.
9 We need to identify and delete patients with covariate patterns A and B in Figure 5.1. Before doing this we save the current data file so that we can restore it to its current form when needed. This save command saves the data in a file calledtemporary, which is located in the Stata default file folder.
10 Delete covariate patterns that do not pertain to any patients in the study.
11 Thecollapse(sum) command reduces the data to one record for each unique combination of values for the variables listed in thebyoption.
This command definesn andcato be the sum ofpatients andca no, respectively over all records with identical values ofage,alcohol, smoke dbeta, dx2 and p. In other words, for each specific pattern of these covariates,nis the number of patients andcais the number of cancer
194 5. Multiple logistic regression
cases with this pattern. All other covariates that are not included in theby option are deleted from memory. The covariatesage,alcoholandsmoke uniquely define the covariate pattern. The variablesdbeta,dx2andpare the same for all patients with the same covariate pattern. However, we include them in thisbystatement in order to be able to list them in the following command.
12 List the covariate values and other variables for all covariate patterns for whichX2j >3.84. The two largest values ofβjare highlighted. The record with βj =6.16 corresponds to squared residual A in Figure 5.1. Patients with the covariate pattern associated with this resid- ual are age 55–64, drink at least 120 gm of alcohol and smoke less than 10 gm of tobacco a day. Squared residual B hasβj =4.16. The associated residual pattern is for patients aged 55–64 who drink 0–39 gm alcohol and smoke≥30 gm tobacco a day.
Thenodisplayoption forces the output to be given in tabular format rather than display format. Display format looks better when there are lots of variables but requires more lines per patient.
13 Restore the complete data file that we saved earlier.
14 Delete records with covariate pattern A. That is, the record is deleted ifage=4 and alcohol=4 and smoke=1. These coded values corre- spond to age 55–64, ≥120 gm alcohol, and 0–9 gm tobacco, respec- tively.
15 Analyze the data with covariate pattern A deleted using model (5.40).
The highlighted odds ratios in the subsequent output are also given in column 5 of Table 5.6.
16 Restore complete database.
17 Delete records with covariate pattern B.
18 Analyze the data with covariate pattern B deleted using model (5.40).
The highlighted odds ratios in the subsequent output are also given in column 7 of Table 5.6.