Example: Hip fracture data

Pretend that a study was performed to quantify the benefit of a new inflatable device to protect elderly persons from hip fractures resulting from falls. The device is worn around the hips at all times. It is hypothesized that the device will reduce the incidence of hip fractures in this population. Forty-eight women over the age of 60, with no

f 7.6 Example' Hip fracture data 83

~ã previous histories of hip trauma, were recruited for this study. Of these 48 women, 28

~: were randomly given the device and instructed on how to wear it. The remaining 20

~ã women were not provided with the device and were used as study control subjects. All ' 48 women were monitored closely, and blood calcium levels were drawn approximately every 5 months. The time to hip fracture or censoring was recorded in months. It was decided at study onset that, if a woman was ever hospitalized during follow-up, she would not be considered at risk of falling and fracturing her hip. This creates gaps in the data.

The dataset below is real, so feel free to use it. The data, however, are fictional.

. use http://www.stata-press.com/data/cggm3/hip (hip fracture study)

. describe

Contains data from http://www.stata-press.com/data/cggm3/hip.dta

cbs: 106 hip fracture study

vars: 7 30 Jan 2009 11:58

size: 1,484 (99.9% of memory free) storage display

variable name type format

value label id

tim eO time!

fracture protect age calcium Sorted by:

summarize Variable

id timeO time!

fracture protect age calcium sort id timeO

byte byte byte byte byte byte float

%4.0g

%5.0g

%8.0g

%4.0g

%8.0g

Obs 106 106 106 106 48 48 106 by id: list timeO-calcium -> id = 1

Mean 28.21698 4.792453 11.5283 .2924528 .5833333 70.875 10.10849

variable label patient ID begin of span end of span fracture event wears device age at enrollment blood calcium level

Std. Dev. Min

13.09599 1

5.631065 0

8.481024 1

.4570502 0

.4982238 0

5.659205 62

1.407355 7.25

timeO time! fracture protect age calcium

1. 0 1 1 0 76 9.35

Max 48 15 39 1 1 82 12.32

-> id = 2

timeO time1 fracture protect age calcium

1. 0 1 1 0 80 7.8

(output omitted) -> id = 17

timeO time1 fracture protect age calcium 1.

-> id = 18 b 8

8 15

0 1

0 66 11.48

10.79

timeO time1 fracture protect age calcium 1.

0 15 (output omitted) -> id = 47

5 17

0 1

0 64 11.58

11.59

timeO time1 fracture protect age calcium 1.

-> id = 48 0 5 15

5 15 35

0 0 0

1 63 12.18

11.64 11.79

timeO time1 fracture protect age calcium 1.

0 5 15

5 15 39

0 0 0

1 67 11.21

11.43 11.29

Here time is already recorded in analysis-time units, which just means we will not have to bother with the origin() option when we type stset.

Our data do, however, have multiple observations per subject to accommodate the time-varying covariate calcium, and we will assume that the value of this variable is fixed over the interval spanned by each record.

age records the age of each participant at the time of enrollment in the study.

Glancing at our data, you will notice that age appears to be coded only in the first record for each subject. All the records are like that. If we later copy down this value of

7.6 Example: Hip fracture data 85 age (that is, propagate age values from past to future observations), we will be treating age as fixed.

In any case, the stset command for this dataset is

stset time1, id(id) timeO(timeO) failure(fracture) id: id

failure event: fracture != 0 & fracture < . obs. time interval: (timeO, time1]

exit on or before: failure 106 total obs.

0 exclusions

106 obs. remaining, representing 48 subjects

31 failures in single failure-per-subject data 714 total analysis time at risk, at risk from t

earliest observed entry t last observed exit t

0 0 39

Let us now go through the data verification process described earlier in this chapter.

We begin by looking at the above output; examining _tO, _t, _d, and _st; confirming that stdescribe does not reveal any surprises; confirming that stvary makes a similarly unsurprising report; and finally using stfill to fix any problems uncovered by stvary.

First, we look at _tO, _t, _d, and _st in the beginning, middle, and end of the data.

list

23.

24.

25.

26.

27.

28.

id timeO time1 frac id timeO time1

1 0 1

2 0 1

3 0 2

id timeO time1 frac id timeO time1

16 0 5

16 5 12

17 0 8

17 8 15

18 0 5

18 15 17

_tO t d fracture 1 1 1 _tO _t _d -

fracture 0 1 0 1 0 1

st i f id<=3

_tO t -d st

0 1 1 1

0 1 1

0 2

st if 16<=id & id<=18, sepby(id) _tO _t _d -st

0 5 0 1

5 12 1 1

0 8 0 1

8 15 1 1

0 5 0 1

15 17 1 1

. list id timeO time1 frac tO _t d st if id>=47, sepby(id)

101.

102.

103.

104.

105.

106.

id 47 47 47 48 48 48

tim eO time1

0 5

5 15

15 35

0 5

5 15

15 39

fracture _tO _t _d

0 0 5 0

0 5 15 0

0 15 35 0

0 0 5 0

0 5 15 0

0 15 39 0

That looks good. Let's see what stdescribe has to say:

stdescribe

failure d: fracture analysis time _ t: time1

id: id

-st 1 1 1 1 1 1

per subject

Category total mean min median max

no. of subjects 48

no. of records 106 2.208333 2 3

(first) entry time 0 0 0 0

(final) exit time 15.5 12.5 39

subjects with gap 3

time on gap if gap 30 10 10 10 10

time at risk 714 14.875 1 11.5 39

failures 31 .6458333 0 1 1

Starting with the first two lines, stdescribe reports that there are 48 subjects in our data with 106 records; the average number of records per subject is just over two, with three being the maximum number of records for any one subject. From this, we see that the st system correctly recognizes that there are multiple records per subject.

In cases where there is only one observation per subject, the reported totals in the first and second line would be equal, and the mean, min, and max number of records per subject would all be equal to 1.

In lines 3 and 4, stdescribe reports that everyone entered at time 0-there is no delayed entry-and that the average exit time was 15.5 months, with a minimum of 1 month and a maximum of 39. Be careful when interpreting this reported average exit time. This is just the average of the follow-up times; it is not the average survival time because some of our subjects are censored. When there are no censored observations, the average exit time reported by stdescribe does equal the average survival time.

In lines 5 and 6, stdescribe reports that there are three subjects with gaps, each being 10 months long. This is a strange finding. In most datasets with time gaps, the gaps vary in length. We were immediately suspicious of these results and wanted to verify them. One way to identify these three subjects is to make use of the fact that,

lr.6 Example: Hip fracture data 87 [,,hen there are no gaps between consecutive observations for a subject, the ending time

f.' of the first record equals the beginning time of the next record. So, we will sort the data by subject and time, and then we will generate a dummy variable, gap = 1, for those observations with gaps. Then we can list the observations.

sort id _tO _t

quietly by id: gen gap=1 if _tO != _t[_n-1] & _n>1 list id if gap==1

28.

55.

63 . rn

. list id timeO time1 _tO _t if id==18 I id==30 I id==33, sepby(id)

27.

28.

54.

55.

62.

63.

id 18 18 30 30 33 33

timeO 0 15 0 15 0 15

time1 _tO _t

5 0 5

17 15 17

5 0 5

19 15 19

5 0 5

23 15 23

All appears well; each of these records had a gap lasting 10 months, yet perhaps we still might want to check that the data were entered correctly.

Returning to the stdescribe output, on line 7 we observe that subjects were at risk of failure for a total of 714 months. This is simply the sum of the time spanned by the records, calculated by stdescribe by calculating the length of the interval represented by each record (_tO, _t] and then summing these lengths conditional on _st==1.

Finally, in line 8, stdescribe reports that there were 31 failures, or 31 hip fractures, in our data. The maximum number of per-subject failures is one, indicating that we have single failure-per-subject data, and the minimum number of failures is zero, indicating the presence of censored observations. Of course, we can also see that there are censored observations when we compare the total number of failures and the total number of observations on line 2. As we may expect, for a dataset with multiple fail- . ures per observation, stdescribe will provide in line 8 summary statistics indicating

the existence of multiple failures-per-subject data.

(Continued on next page)

So, all looks fine based on stdescribe. Next let's try stvary:

capture drop gap stvary

failure _d: fracture analysis time _t: time1

id: id

subjects for whom the variable is variable constant varying

protect 48 0

age 48 0

calcium 8 40

never always sometimes missing missing missing

8 0 40

48 0 0

By default, stvary reports on all variables in the dataset, omitting the variables used or created by stset. (stvary optionally allows variable lists, so we can specify those variables that we wish to examine.)

The variable protect records if the subject is in the experimental or the control group, age records the subject's age at enrollment in the study, and calcium records the subject's blood calcium concentration. Recall that this last characteristic was examined approximately every 5 months, so the variable varies over time.

Looking at stvary's output, what catches our eye is the many "sometimes miss- ings". Well, we already knew that protect and age did not have the values filled in on subsequent records and that we would have to fix that. Let us now follow our own advice and use streset and stfill to fix this problem. Here streset is unnecessary because there are no observations for which _st==O, but it never hurts to be cautious:

streset, past future (output omitted)

stfill age protect, forward failure _d:

analysis time _t:

origin:

exit on or before:

id:

fracture (time1-origin) min

time . id

replace missing values with previously observed values:

age: 58 real changes made protect: 58 real changes made streset

(output omitted)

, 1.6 Exampl"' Wp fracture data

rã Now stvary reports

~;-

stvary

failure analysis time tã

id:

fracture timet id

subjects for whom the variable constant varying

protect 48 0

age 48 0

calcium 8 40

variable

Satisfied, we can save these data as hip2. dta.

. save hip2

file hip2.dta saved

never always sometimes missing missing missing

48 0 0

When you stset a dataset and save it, Stata remembers how the data were set the next time you use the data.

Nonparametric analysis

f: The pcevious two chapleC' 'erved M a tutodal on stset. Once you stset yom data,

~ you can use any st survival command, and the nice thing is that you do not have to

~' continually restate the definitions of analysis time, failure, and rules for inclusion.

ji As previously discussed in chapter 1, the analysis of survival data can take one of f.' three forms-nonparametric, semiparametric, and parametric-all depending on what

~ we are willing to assume about the form of the survivor function and about how the survival experience is affected by covariates.

Nonparametric analysis follows the philosophy of letting the dataset speak for itself and making no assumption about the functional form of the survivor function (and thus no assumption about, for example, the hazard, cumulative hazard). The effects of covariates are not modeled, either-the comparison of the survival experience is done at a qualitative level across the values of the covariates.

Most of Stata's nonparametric survival analysis is performed via the sts command, which calculates estimates, saves estimates as data, draws graphs, and performs tests, among other things; see [sT] sts.

The survivor and hazard functions

Interpreting the cumulative hazard and hazard rate