Estimating the hazard function

sts graph can also be used to plot an estimate of the hazard function, h(t). Because the hazard is the derivative of the cumulative hazard, H ( t), it would seem straightforward to estimate the hazard itself. However, examination of figure 8.6 and the subsequent graphs reveals that the estimated cumulative hazards available to us are step functions and thus cannot be directly differentiated. That is not to say that it is not straightforward to take figure 8.6 and picture in our minds what the derivative of the cumulative hazard would look like; for the control group, it would be fairly linear (because the cumulative hazard is parabolic), and for the treatment group, the derivative would start off as constant for some time (because the cumulative hazard is initially linear) and then increase.

We can estimate the hazard by taking the steps of the Nelson-Aalen cumulative hazard and smoothing them with a kernel smoother. More precisely, for each observed death time, tj, if we define the estimated hazard contribution to be

Chapter 8 Nonparametric analysi~l

J }

we can obtain these hazard contributions using sts generate newvar = h. Then wJi

can estimate h( t) with 'i

for some kernel function Kt() and bandwidth b; the summation is over the D times at which failure occurs (Klein and Moeschberger 2003, 167; Muller and Wang 1994).

This whole process can be automated by specifying option hazard to sts graph.

Using our hip-fracture data, we can graph the estimated hazards for both the treatment and control groups as follows:

. use http://www.stata-press.com/data/cggm3/hip2, clear (hip fracture study)

sts graph, hazard by(protect) kernel(gaussian) width(4 5) failure _d: fracture

analysis time _t: timel id: id

This sts graph command produces figure 8.9. The graph agrees with our informal analysis of the Nelson-Aalen cumulative hazards. In applying the kernel smoother, we specified a Gaussian (normal) kernel function and bandwidths of four for the control group (protect==O) and five for the treatment group (protect==!), although defaults would have been provided had we not specified these; see [R] kdensity for a list of available kernel functions and their definitions.

Smoothed hazard estimates

0 10 20 30 40

analysis time

1--protect= 0 - - - - -protect= 1 I

Figure 8.9. Smoothed hazard functions

Estimating the hazard function 115 In practice, you will often find that plotting ranges of smoothed hazard curves narrower than those of their cumulative-hazard and survivor counterparts. Kernel

"u" ... ,.., requires averaging values over a moving window of data. Near the endpoints the plotting range, these windows contain insufficient data for accurate estimation, the resulting estimators are said to contain boundary bias. The standard approach takes in dealing with this problem is simply to restrict the plotting range so that estimates exhibiting boundary bias are not displayed. An alternate approach- that results in wider plotting ranges-is to use a boundary kernel; see the technical

below for details.

We should clarify the notation for the kernel function Kt () we used in the above ' formula. The conventional kernel estimator uses a symmetric kernel function K().

: Applying this estimator directly to obtain a hazard function smoother results in biased estimates in the boundary regions near the endpoints.

Consider the left and right boundary regions, B L = { t : tmin :::; t < b} and B R = {t : tmax - b < t :::; tmax}, respectively, where tmin and tmax are the minimum and maximum observed failure times. Using a symmetric kernel in these regions leads to biased estimates because there are no failures observed before time tmin nor after time

tmaxi that is, the support of the kernel exceeds the available range of the data. A more

ã appropriate choice is to use in the boundary region some asymmetric kernel function, referred to as a boundary kernel, K bnd (). This method consists of using a symmetric kernel for time points in the interior region, Kt() = K(), and a respective boundary kernel for time points in the boundary regions, Kt () = K bnd (). The method of boundary kernels is described more thoroughly in Gray (1990), Muller and Wang (1994), and Klein and Moeschberger (2003, 167), to name a few.

sts graph, hazard uses the boundary adjustments suggested in Muller and Wang (1994) with the epan2, biweight, and rectangular kernels. For other kernels (see [R] kdensity), no boundary adjustment is made. Instead, the default graphing range is constrained to be the range [tmin + b, tmax - b], following the advice of Breslow and Day (1987). You can also request that no boundary-bias adjustment be made at all by specifying option no boundary.

We change the kernel to epan2 in the above to obtain figure 8.10.

sts graph, hazard by(protect) kernel(epan2) width(6 7) failure _d: fracture

analysis time _t: timel id: id

We also specified larger bandwidths in width ( 6 7). The modified (boundary) epan2 kernel is used to correct for the left- and right-boundary bias. Notice the wider plotting range compared with the Gaussian example.

116 Chapter 8 Nonparametnc analysf Smoothed hazard estimates

---

/ /

0 ~---.---.---.---.

0 10 20 30 40

analysis time

1--protect= 0 - - - protect= 1 I

Figure 8.10. Smoothed hazard functions with the modified epan2 kernel for the left and right boundaries

There are few subjects remaining at risk past analysis time 15 in the control group (protect=O). This leads to a poor estimate of the hazard function in that region;

Hess, Serachitopol, and Brown (1999) found that the kernel-based hazard function es~

timators tend to perform poorly if there are fewer than 10 subjects at risk. There ani 11 subjects at risk at time 8 and 7 subjects at risk at time 11 in the control group:

It would be reasonable to limit the graphing range to, say, time 10 (option tmax(10))

when graphing the hazard function for the control group.

0 One interesting feature of smoothed hazard plots is that you can assess the assump, tion of proportional hazards (the importance of which will be discussed in chapter 9 o~

the Cox model) by plotting the estimated hazards on a log scale.

sts graph, hazard by(protect) kernel(gaussian) width(4 5) yscale(log) failure _d: fracture

analysis time _t: timel id: id

By examining figure 8.11, we find the lines to be somewhat parallel, meaning that the proportionality assumption is violated only slightly. When hazards are proportional, the proportionality can be exploited by using a Cox model to assess the effects of treatment more efficiently; see chapter 9.

Estimating mean and median survival times

Ll)

Smoothed hazard estimates

,. / --~ ~""

/ / /

/ ,.,.~

10 20 30

analysis time

1--protect= 0 - - - protect= 1 I

Figure 8.11. Smoothed hazard functions, log scale

Estimating mean and median survival times

117

we mentioned in section 8.1, standard univariate methods used to estimate means and

Et>er<;entllE~s may not be appropriate with complex survival data. In section 2.4, we gave definitions of the mean and median survival times as functions of the survivor These relationships form the basis of the survival methods for estimating

IB!Jne~ms and percentiles.

Recall the definition of the median survival time, t5o = fir, as the time beyond 50% of subjects are expected to survive, that is, S(fir) = 0.5. A natural way to

~~''!"""!lH<:Lo,o::; the median then is to use the Kaplan-Meier estimator S(t) in place of S(t).

,.,"'"a''"" this estimator is a step function, the nonparametric estimator of the median

ãsurvival time is defined to be

More generally, the estimate of the pth percentile of survival times, tp, is obtained as the smallest observed time ti for which S(ti) :::::: 1 - p/100 for any p between 0 and 100.

In Stata, you can use stci to obtain the median survival time provided for you by default or use the p(#) option with stci to obtain any other percentile. For our hip- fracture data, we can obtain median survival times for both the treatment and control

group as follows: '

:lt

use http://www.stata-press.com/data/cggm3/hip2 (hip fracture study)

stci, by(protect)

failure _d: fracture analysis time _t: time1

id: id protect

0 1 total

no. of subjects 20 28 48

50%

8 28 16

Std. Err.

2.077294 4.452617 4.753834

[95% Conf. Interval]

4 22 11

23 j

ã,'ã}

From the output, the estimated median in the control group is 8 months, and in the'l treatment group the median is 2.3 years (28 months). This agrees with the graphs of:

the survivor curves given in figure 8.2. '

The large-sample standard errors reported above are based on the following formula given by Collett (2003, 35) and Klein and Moeschberger (2003, 122):

-... ... ~~

where Var{S(tp)} is the Greenwood pointwise variance estimate (8.2) and f(tp) is the\:

estimated density function at the pth percentile.

Confidence intervals, however, are not calculated based on these standard errors.

For a given confidence level a, they are obtained by inverting the respective 100(1--' • a)% confidence interval for S ( tp) based on a ln {- ln S ( t)} transformation, as given in section 8.2.1. That is, the confidence interval for the pth percentile tp is defined.

by the pair (tL,tu) such that P{L(tL) ::::; S(tp) ::::; U(tu)} = 1 -a where L(ã) and U(ã) are the upper and the lower pointwise confidence limits of S(t); see section 8.2.1.

Computationally, tL and tu are estimated as the smallest observed times at which the upper and lower confidence limits for S(t) are less than or equal to 1 - pjlOO. For a review of other methods of obtaining confidence intervals see, for example, Collett (2003), Klein and Moeschberger (2003), and Keiding and Andersen (2006). ã In the above example, the 95% confidence interval for the median survival time of the protect==O group is (4, 12) (months) and of the protect==1 group is (22, .). A missing value (dot) for the upper confidence limit for the protect==1 group indicates that this limit could not be determined. For this group, the estimated upper confidence limit of the survivor function never falls below 0.5:

Estimating mean and median survival times

sts list if protect==!, at(5) failure _d: fracture analysis time _t: time1

id: id Beg.

Time Total Fail

5 28 0

16 18 6

27 10 4

38 2 2

49 1 0

Survivor Function 1.0000 0.7501 0.5193 0.3408

Std.

Error [95% Conf. Int.]

0.0891 0.5242 0.8798 0.1141 0.2824 0.7120 0.1318 0.1135 0.5871

Note: survivor function is calculated over full data and evaluated at indicated times; it is not calculated from aggregates shown at left.

119

Although the median survival time is commonly used to estimate the location of the

~~'"''''""'""ã~ distribution because it tends to be right skewed, the mean of the distribution also be of interest in some applications.

The mean t-tr is defined as an integral from zero to infinity of the survivor function ). Similar to the median estimation, a natural way of estimating the mean then is to plug in the Kaplan-Meier estimator S(t) for S(t) in the integral expression. The nonparametric estimator of the mean survival time is defined as follows:

rtmax

fir = Jo S(t)dt

where tmax is the maximum observed failure time. The integral above is restricted to the range [0, tmax] because the Kaplan-Meier estimator is not defined beyond the largest

ã observed failure time. Therefore, the mean estimated by using the above formula is : often referred to as a restricted mean. A restricted mean fir will underestimate the

true mean t-tr if the last observed analysis time is censored.

The standard error for the estimated restricted mean is given by Klein and Moeschberger (2003, 118) and Collett (2003, 340) as follows:

SE{/ir} = L Ai

i=l

where the sum is over all distinct failure times, Ai is the estimated area under the Kaplan-Meier product-limit survivor curve from time ti to tmax, Ri is the number of subjects at risk at time ti, and di is the number of failures at time k

The 100(1-a)% confidence interval for the estimated restricted mean is computed as fir ± za;2SE{fir }, where za;2 is the (1 - a/2) quantile of the standard normal distribution.

Continuing the hip-fracture example, we obtain group-specific restricted means by specifying option rmean with stci:

stci, by(protect) rmean failure d: fracture analysis time _t: time1

id: id

no. of restricted

protect subjects mean

0 20 8.938312

1 28 26.75578(*)

total 48 18.89901(*)

Std. Err.

1. 39057 2.503475 2.006556

[95% Conf. Interval]

6.21285 21.8491 14.9662

11.6638 31.6625 22.8318 (*) largest observed analysis time is censored, mean is underestimated

The estimated mean survival time of the control group (8.94 months) is smaller than the estimated mean survival time of the experimental group (26.76 months). The respective 95% confidence intervals (6.21, 11.66) and (21.85, 31.66) do not overlap, suggesting that the treatment-group patients have a higher expected time-without-fracture. This agrees with the results obtained earlier for the median survival times. Here we have an estimate of the upper confidence limit for the treatment group because the confidence interval is based on the estimates of the restricted mean and its standard error.

Notice the note, reported at the bottom of the table for the mean estimate of the treatment group and the overall mean. In this group, and in the combined sample, the last observed analysis time is censored; therefore, the restricted mean underestimates the true mean. Stata detects it and warns you about it.

Stata offers a way to alleviate this problem by computing an extended mean. The extended mean is computed by extending the Kaplan-Meier product-limit survivor curve to zero by using an exponentially fit curve and then computing the area under the entire curve (Klein and Moeschberger 2003, 100, 122). This approximation is ad hoc and must be evaluated with care. We recommend plotting the extended survivor function by using the graph option of stci.

For the above example,

stci, by(protect) emean failure _d: fracture analysis time t: time1

id: id

no. of extended

protect subjects mean

0 20 8.938312(*)

28 39.10101

total 48 23.07244

(*) no extension needed

the extended mean of the treatment group is 39.10 months and is noticeably larger than the previous estimate of 26. 76. The estimate of the overall mean increases from 18.90 to 23.07. The extended mean of the control group is the same as the restricted mean because the last observed time in this group ends in failure.

B.S Estimating mean and median survival times 121 By examining figures 8.12 and 8.13 of the extended overall and treatment-group urvivor curves, we conclude that the assumption of the exponential survivor function

~eyond the last observed failure time may be reasonable.

stci, emean graph (output omitted)

Exponentially extended survivor function

50 100 150

analysis time

Figure 8.12. Exponentially extended Kaplan-Meier estimate

(Continued on next page)

stci if protect==1, emean graph (output omitted)

Exponentially extended survivor function

0 50 100

analysis time

150 200

Figure 8.13. Exponentially extended Kaplan-Meier estimate, treatment group

The survivor and hazard functions

Interpreting the cumulative hazard and hazard rate