# Exercise 1: Non-parametric estimation of cumulative hazard and survival function

Back to course page

## Theory

The cumulative hazard can be estimated non-parametrically by the Nelson-Aalen estimator defined as

$\widehat{\Lambda}(t)=\int_0^t\frac{d\overline{N}(u)}{\overline{Y}(u)}= \sum_{\{i: t_i\leq t\}} \frac{\Delta\overline{N}(t_i)}{\overline{Y}(t_i)},$ where $$0<t_1<t_2<\cdots<t_d$$ are ordered distinct failure times.

The Kaplan-Meier [KM] estimator of the survival function is $\widehat{S}(t)=\prod_{\{i: t_i\leq t\}}\biggl[1-\frac{\Delta\overline{N}(t_i)}{\overline{Y}(t_i)}\biggr].$ If the distribution of $$T_i$$ is continuous, one could define an alternative survival function estimator using the relationship $$S(t)=\exp\{-\Lambda(t)\}$$ applied to the Nelson-Aalen estimator, that is $\widehat{S}_*(t)=\exp\{-\widehat{\Lambda}(t)\}=\prod_{\{i: t_i\leq t\}}\exp\biggl\{-\frac{\Delta\overline{N}(t_i)}{\overline{Y}(t_i)}\biggr\}.$ This is called the Fleming-Harrington [FH] estimator (sometimes also the Breslow estimator). Because $$\exp\{-h\}\approx 1-h$$ for small $$h$$, the difference between the KM and FH estimators is small unless the risk set size $$\overline{Y}(u)$$ is not large enough, which happens at the largest failure times. The FH estimator is not suitable for discrete data, while the KM estimator works well for both discrete and continuous cases.

## Understanding the formulae

The dataset fans [Nelson, Journal of Quality Technology, 1:27-52, 1969] comes from an engineering study of the time to failure of diesel generator fans. Seventy generators were studied. For each one, the number of hours of running time from its first being put into service until fan failure or until the end of the study (whichever came first) was recorded.

The data is available in the RData format from this link. The dataframe is called fans, the variable fans$time includes the censored failure time $$X_i$$ (in hours), the variable fans$fail is the failure indicator $$\delta_i$$.

We will use this dataset to demonstrate the calculation of cumulative hazard and survival function estimates.

Task 1

Using ordinary R functions (not the survival library), calculate and print the following table:

$$i$$ $$t_i$$ $$d_i$$ $$y_i$$ $$d_i/y_i$$ $$L_i$$ $$S_i$$ $$S^*_i$$
1 â€¦ â€¦ â€¦ â€¦ â€¦ â€¦ â€¦
â€¦ â€¦ â€¦ â€¦ â€¦ â€¦ â€¦ â€¦
d â€¦ â€¦ â€¦ â€¦ â€¦ â€¦ â€¦

where $$i$$ is the order of the failures, $$t_i$$ is the ordered failure time, $$d_i=\Delta\overline{N}(t_i)$$, $$y_i=\overline{Y}(t_i)$$, $$L_i=\widehat{\Lambda}(t_i)$$, $$S_i=\widehat{S}(t_i)$$, and $$S^*_i=\widehat{S}_*(t_i)$$.

Then plot the cumulative hazard estimator and both survival function estimators. Include in your report the code for calculating and plotting the results and the output you obtained.

## Calculating the estimators in R

The basic functions for censored data analysis are available in R library survival. Censored data are entered as special survival objects created in this way:

Surv(x,del)

where x is a numeric vector containing the censored failure times and del is a numeric vector containing failure indicators (1 = failure, 0 = censored). Survival objects are used as input to functions providing analysis of censored data.

For demonstration, we will use the dataset aml included in library(survival).

print(aml)
##    time status             x
## 1     9      1    Maintained
## 2    13      1    Maintained
## 3    13      0    Maintained
## 4    18      1    Maintained
## 5    23      1    Maintained
## 6    28      0    Maintained
## 7    31      1    Maintained
## 8    34      1    Maintained
## 9    45      0    Maintained
## 10   48      1    Maintained
## 11  161      0    Maintained
## 12    5      1 Nonmaintained
## 13    5      1 Nonmaintained
## 14    8      1 Nonmaintained
## 15    8      1 Nonmaintained
## 16   12      1 Nonmaintained
## 17   16      0 Nonmaintained
## 18   23      1 Nonmaintained
## 19   27      1 Nonmaintained
## 20   30      1 Nonmaintained
## 21   33      1 Nonmaintained
## 22   43      1 Nonmaintained
## 23   45      1 Nonmaintained

Survival function estimates are calculated by the function survfit. The most rudimentary use of this function is

fit <- survfit(Surv(x,del)~1, data=data)

This calculates the KM estimator. The FH estimator is obtained by adding an option type

fit <- survfit(Surv(x,del)~1, type="fleming", data=data)

The results can be plotted easily by calling

plot(fit)

In our example, we calculate and plot KM and FH estimates like this

fit1 <- survfit(Surv(time,status)~1, data=aml)
fit2 <- survfit(Surv(time,status)~1, type="fleming", data=aml)
plot(fit1,conf.int=FALSE,col="blue")
lines(fit2,conf.int=FALSE,col="red")

We suppressed the plotting of the confidence intervals.

Using the formula notation, we can easily create and plot separate curves for two distinct subgroups of observations:

fit3 <- survfit(Surv(time,status)~x, data=aml)
plot(fit3,col=c("green","magenta"))

It is interesting to investigate the contents of the survfit object by calling str(fit1).

There is no specific function to calculate the Nelson-Aalen estimator of the cumulative hazard. The easiest way to obtain it is by transforming the FH estimate of survival function.

Task 2

Calculate and plot the cumulative hazard estimator and both survival function estimators for fan data using the function survfit. Compare the results to those you obtained in Task 1. Include the code, output and comments in your report.

Task 3

Download dataset mort.RData. There are two dataframes mort.m and mort.f containing mortality data from the Czech republic in the year 2013 for males and females. Each dataframe has three columns: i is an age group (going from the age $$i$$ years to $$i+1$$ years), Di is the number of deaths in the age group in 2013, Yi is the number of people alive when they reached the age $$i$$.

Your task is to calculate and plot Nelson-Aalen and Kaplan-Meier estimates for males and females. Do you feel comfortable to use the data in this way? Do we need any additional assumptions to make the results interpretable?

Compare your results to the official mortality tables created by Czech Statistical Office: here for the males, here for the females. Can you recognize the columns in their table? Are their results comparable to yours?

Include the code, output and comments in your report.

## Homework

1. Show that the Nelson-Aalen estimator satisfies the equation $\sum_{i=1}^n \widehat{\Lambda}(X_i)=\sum_{i=1}^n \delta_i$

2. Show that if the data are uncensored, the Kaplan-Meier estimator corresponds to the empirical distribution function but the Fleming-Harrington estimator does not.

Due date: Nov. 16