Frequently, data sets will contain missing values of some covariates. Most regression programs, including those of Stata, deal with missing values by excluding all records with missing values in any of the covariates. This can result in the discarding of substantial amounts of information and a consid- erable loss of power. Some statisticians recommend using methods ofdata imputationto estimate the values of missing covariates. The basic idea of these methods is as follows. Suppose thatxi j represents the jth covariate value on theithpatient, and thatxk j is missing for thekthpatient. We first identify all patients who have non-missing values of both the jthcovariate
196 5. Multiple logistic regression
and all other covariates that are available on the kth patient. Using this patient subset, we regressxi j against these other covariates. We use the re- sults of this regression to predict the value of xk j from the other known covariate values of thekth patient. This predicted value is called theim- puted valueofxk j. This process is then repeated for all of the other missing covariates in the data set. The imputed covariate values are then used in the final regression in place of the missing values.
These methods work well if the values of xi j that are available are a representative sample from the entire target population. Unfortunately, this is often not the case in medical studies. Consider the following example.
5.32.1. Cardiac Output in the Ibuprofen in Sepsis Study
An important variable for assessing and managing severe pulmonary mor- bidity is oxygen delivery, which is the rate at which oxygen is delivered to the body by the lungs. Oxygen delivery is a function of cardiac output and several other variables (Marini and Wheeler, 1997). Unfortunately, cardiac output can only be reliably measured by inserting a catheter into the pul- monary artery. This is an invasive procedure that is only performed in the sickest patients. In the Ibuprofen in Sepsis study, baseline oxygen delivery was measured in 37% of patients. However, we cannot assume that the oxy- gen delivery was similar in patients who were, or were not, catheterized.
Hence, any analysis that assesses the influence of baseline oxygen delivery on 30 day mortality must take into account the fact that this covariate is only known on a biased sample of study subjects.
Let us restrict our analyses to patients who are either black or white.
Consider the model
logit[E[di |xi,yi]]=α+β1xi +β2yi, (5.55) where
di =
1: if theithpatient dies within 30 days 0: otherwise,
xi =
1: if theithpatient is black 0: otherwise,
and yi is the rate of oxygen delivery for the ith patient. The analysis of model (5.55) excludes patients with missing oxygen delivery. This, together with the exclusion of patients of other race, restricts this analysis to 161 of the 455 subjects in the study. The results of this analysis are given in the
197 5.32. Analyzing data with missing values
Table 5.7. Effect of race and baseline oxygen delivery on mortality in the Ibuprofen in Sepsis study. Oxygen delivery can only be reliably measured in patients with pulmonary artery catheters.
In the analysis of model (5.55)262 patients were excluded because of missing oxygen delivery.
These patients were retained in the analysis of model (5.56). In this latter model, black patients had a significantly higher mortality than white patients, and uncatheterized patients had a significantly lower mortality than those who were catheterized. In contrast, race did not significantly affect mortality in the analysis of model (5.55)(see text).
Model (5.55) Model (5.56)
Odds 95% confidence P Odds 95% confidence P
Risk factor ratio interval value ratio interval value
Race
White 1.0∗ 1.0∗
Black 1.38 0.60–3.2 0.45 1.85 1.2–2.9 0.006
Unit increase 0.9988 0.9979–0.9997 0.01 0.9988 0.9979–0.9997 0.01 in oxygen
delivery†
Pulmonary artery catheter
Yes 1.0∗
No 0.236 0.087–0.64 0.005
∗Denominator of odds ratio
†Oxygen delivery is missing in patients who did not have a pulmonary artery catheter
left-hand side of Table 5.7. The mortality odds ratio for blacks of 1.38 is not significantly different from one, and the confidence interval for this odds ratio is wide. As one would expect, survival improves with increasing oxygen delivery (P =0.01).
In this study, oxygen delivery was measured in every patient who received a pulmonary artery catheter. Hence, a missing value for oxygen delivery indicates that the patient was not catheterized. A problem with model (5.55) is that it excludes 262 patients of known race because they did not have their oxygen delivery measured. A better model is
logit[E[di |xi,yi,zi]]=α+β1xi+β2yi+β3zi, (5.56) where
di andxiand are as in model(5.55), yi=
oxygen delivery for theithpatient if measured 0: if oxygen delivery was not measured, and
198 5. Multiple logistic regression
zi =
1: if oxygen delivery was not measured forithpatient 0: otherwise.
An analysis of model (5.56) gives the odds ratio estimates in the right half of Table 5.7. Note that the mortal odds ratio for blacks is higher than in model (5.55) and is significantly different from one. The confidence interval for this odds ratio is substantially smaller than in model (5.56) due to the fact that it is based on all 423 subjects rather than just the 161 patients who where catheterized. The odds ratio associated with oxygen delivery is the same in both models. This is becauseβ2only enters the likelihood function through the linear predictor, and yi is always 0 when oxygen delivery is missing. Hence, in model (5.56), patients with missing oxygen delivery have no influence on the maximum likelihood estimate ofβ2.
It is particularly noteworthy that the odds ratio associated withziis both highly significant and substantially less than one. This means that patients who were not catheterized were far less likely to die than patients who were catheterized. Thus, we need to be very cautious in interpreting the meaning of the significant odds ratio for oxygen consumption. We can only say that increased oxygen delivery was beneficial among those patients in whom it was measured. The effect of oxygen delivery on mortality among other uncatheterized patients may be quite different since this group had a much better prognosis. For example, it is possible that oxygen delivery in the uncatheterized is sufficiently good that variation in the rate of oxygen delivery has little effect on mortality. Using a data imputation method for these data would be highly inappropriate.
This analysis provides evidence that blacks have a higher mortality rate from sepsis than whites and catheterized patients have higher mortality than uncatheterized patients. It says nothing, however, about why these rates differ. As a group, blacks may differ from whites with respect to the etiology of their sepsis and the time between onset of illness and admission to hospital. Certainly, critical care physicians do not catheterize patients unless they consider it necessary for their care, and it is plausible that patients who are at the greatest risk of death are most likely to be monitored in this way.
5.32.2. Modeling Missing Values with Stata
The following Stata log file regresses death within 30 days against race and baseline oxygen delivery in the Ibuprofen in Sepsis study using models (5.55) and (5.56).
199 5.32. Analyzing data with missing values
. * 5.32.2.Sepsis.log . *
. * Regress fate against race and oxygen delivery in black and
. * white patients from the Ibuprofen in Sepsis study (Bernard et al., 1997).
. *
. use C:\WDDtext\1.4.11.Sepsis.dta, clear
. keep if race <2 {1}
(32 observations deleted)
. logistic fate race o2del {2}
Logit estimates Number of obs = 161
LR chi2(2) = 7.56 Prob > chi2 = 0.0228
Log likelihood = -105.19119 Pseudo R2 = 0.0347
--- fate | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---+--- race | 1.384358 .5933089 0.759 0.448 .5976407 3.206689 o2del | .9988218 .0004675 -2.519 0.012 .9979059 .9997385 --- . *
. * Let o2mis indicate whether o2del is missing. Set o2del1 = o2del when . * oxygen delivery is available and = 0 when it is not.
. *
. generate o2mis = 0
. replace o2mis = 1 if o2del == . {3}
(262 real changes made) . generate o2del1 = o2del
(262 missing values generated) . replace o2del1 = 0 if o2del == .
(262 real changes made)
. logistic fate race o2del1 o2mis {4}
Logit estimates Number of obs = 423
LR chi2(3) = 14.87 Prob > chi2 = 0.0019
Log likelihood = -276.33062 Pseudo R2 = 0.0262
200 5. Multiple logistic regression
--- fate | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---+--- race | 1.847489 .4110734 2.759 0.006 1.194501 2.857443 o2del1 | .9987949 .0004711 -2.557 0.011 .9978721 .9997186 o2mis | .2364569 .1205078 -2.829 0.005 .0870855 .6420338 ---
Comments
1 The values ofrace are 0 and 1 for whites and blacks, respectively. This statement excludes patients of other races from our analyses.
2 The variable o2del denotes baseline oxygen delivery. We regress fate againstraceando2delusing model (5.55).
3 Missing values are represented in Stata by a period; “o2del==.”is true wheno2delis missing.
4 We regressfateagainstrace,o2del1ando2misusing model (5.56).