Exercise 4: Actuarial (lifetable) estimator of the survival


Back to course page


Lifetable or actuarial estimator

Applies when the data are grouped

Our goal is still to estimate the survival function, hazard, and density function, but this is complicated by the fact that we don’t know exactly when during each time interval an event occurs.

Notation:

  • the \(j\)-th time interval is \([t_{j-1},t_j)\),
  • \(c_j\) - the number of censorings in the \(j\)-th interval,
  • \(d_j\) - the number of failures (deaths) in the \(j\)-th interval,
  • \(r_j\) - the number of units entering the interval.

We could apply the K-M formula directly. However, this approach is unsatisfactory for grouped data … it treats the problem as though it were in discrete time, with events happening only at 1 yr, 2 yr, etc. In fact, what we are trying to calculate here is the conditional probability of dying within the interval, given survival to the beginning of it.

We now make the assumption that the censoring process is such that the censored survival times occur uniformly throughout the \(j\)-th interval. So, we can assume that censorings occur on average halfway through the interval: \[ r_j'=r_j-c_j/2. \] I.e., average number of individuals who are at risk during this interval. The assumption yields the Actuarial Estimator. It is appropriate if censorings occur uniformly throughout the interval. This assumption is sometimes known as the actuarial assumption.

The event that an individual survives beyond \([t_{j-1},t_j)\) is equivalent to the event that an individual survives beyond \([t_{0},t_1)\) and … and that an individual survives beyond \([t_{j-2},t_{j-1})\) and that an individual survives beyond \([t_{j-1},t_j)\). Let us degfine the following quantities:

  • \(P_j=P\{\mbox{an individual survives beyond }[t_{j-1},t_j)\}\),
  • \(p_j=P\{\mbox{an individual survives beyond }[t_{j-1},t_j) |\mbox{ he survives beyond }[t_{j-2},t_{t-1})\}\).

Then, \[ P_j=p_1\ldots p_j. \] The conditional probability of an death in \([t_{j-1},t_j)\) given that the individual survives beyond \([t_{j-2},t_{j-1})\) is estimated by \[ d_j/r_j'. \] Thus, \(p_j\) is estimated by \[ \frac{r_j'-d_j}{r_j'}. \] So, the actuarial (lifetime) estimate of the survival function is \[ \widehat{S}(t)=\prod_{j=1}^k\frac{r_j'-d_j}{r_j'} \] for \(t_{j-1}\leq t<t_j\) and \(j=1,\ldots,k\).

Remarks:

  • Because the intervals are defined as \([t_{j-1}, t_j)\), the first interval typically starts with \(t_0 = 0\).
  • The implication in R is that \(\widehat{S}(t_0)=1\).

The form of the estimated survivor function obtained using this method is sensitive to the choice of the intervals used in its construction. On the other hand, the lifetable estimate is particularly well suited to situations in which the actual death times are unknown, and only available information is the number of deaths and the number of censored observations which occur in a series of consecutive time intervals.

Quantities estimated

  • Midpoint \(t_{mj}=(t_j+t_{j-1})/2\).
  • Width \(b_j=t_j-t_{j-1}\).
  • Conditional probability of dying \(\widehat{q}_j=d_j/r_j'\).
  • Conditional probability of surviving \(\widehat{p}_j=1-\widehat{q}_j\).
  • Cumulative probability of surviving at \(t_j\): \(\widehat{S}(t)=\prod_{l\leq j}\widehat{p}_l\).
  • Hazard in the \(j\)-th interval \[ \widehat{\lambda}(t_{mj})=\frac{\widehat{q}_j}{b_j(1-\widehat{q}_j/2)} \] the number of deaths in the interval divided by the average number of survivors at the midpoint.
  • Density at the midpoint of the \(j\)-th interval \[ \widehat{f}(t_{mj})=\frac{\widehat{S}(t_{j-1})\widehat{q}_j}{b_j}. \]

Constructing the lifetable

R with KMsurv package requires three elements:

  • a vector of \(k + 1\) interval endpoints
  • a \(k\) vector of death counts in each interval
  • a \(k\) vector of censored counts in each interval

Incidentally, KM in KMsurv stands for Klein and Moeschberger, not Kaplan-Meier!

library(KMsurv)

Duration of nursing home stay data

The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. ‘Treated’ nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.

Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.

Variables include:

  • los - Length of stay of a resident (in days)
  • age - Age of a resident
  • rx - Nursing home assignment (1:bonuses, 0:no bonuses)
  • gender - Gender (1:male, 0:female)
  • married - (1:married, 0:not married)
  • health - Health status (2:second best, 5:worst)
  • censor - Censoring indicator (1:censored, 0:discharged)

Suppose we wish to use the actuarial method, but the data do not come grouped.

Consider the treated nursing home patients, with length of stay (los) grouped into 100 day intervals:

nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat",
  head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))
nurshome$int = floor(nurshome$los/100)*100    # group in 100 day intervals
nurshome = nurshome[nurshome$rx==1,]   # keep only treated homes
nurshome[1:3,]    # check out a few observations
##    los age rx gender married health censor  int
## 1   37  86  1      0       0      2      0    0
## 2   61  77  1      0       0      4      0    0
## 3 1084  75  1      0       0      3      1 1000
tab = table(nurshome$int, nurshome$censor)
intEndpts = (0:11)*100
ntotal = dim(nurshome)[1]  # nr patients
cens = tab[,2]   # nr censored in each interval
released = tab[,1]   # nr released in each interval
fitlt = lifetab(tis = intEndpts, ninit=ntotal, nlost=cens, nevent=released)
round(fitlt, 4)        # restrict output to 4 decimals
##           nsubs nlost nrisk nevent   surv    pdf hazard se.surv se.pdf
## 0-100       712     0 712.0    330 1.0000 0.0046 0.0060  0.0000  2e-04
## 100-200     382     0 382.0     86 0.5365 0.0012 0.0025  0.0187  1e-04
## 200-300     296     0 296.0     65 0.4157 0.0009 0.0025  0.0185  1e-04
## 300-400     231     0 231.0     38 0.3244 0.0005 0.0018  0.0175  1e-04
## 400-500     193     1 192.5     32 0.2711 0.0005 0.0018  0.0167  1e-04
## 500-600     160     0 160.0     13 0.2260 0.0002 0.0008  0.0157  1e-04
## 600-700     147     0 147.0     13 0.2076 0.0002 0.0009  0.0152  1e-04
## 700-800     134    30 119.0     10 0.1893 0.0002 0.0009  0.0147  0e+00
## 800-900      94    29  79.5      4 0.1734 0.0001 0.0005  0.0143  0e+00
## 900-1000     61    30  46.0      4 0.1647 0.0001 0.0009  0.0142  1e-04
## 1000-1100    27    27  13.5      0 0.1503     NA     NA  0.0147     NA
##           se.hazard
## 0-100         3e-04
## 100-200       3e-04
## 200-300       3e-04
## 300-400       3e-04
## 400-500       3e-04
## 500-600       2e-04
## 600-700       3e-04
## 700-800       3e-04
## 800-900       3e-04
## 900-1000      5e-04
## 1000-1100        NA

Estimated Survival

names(fitlt)  # check out components of fitlt object
##  [1] "nsubs"     "nlost"     "nrisk"     "nevent"    "surv"     
##  [6] "pdf"       "hazard"    "se.surv"   "se.pdf"    "se.hazard"
x = rep(intEndpts, rep(2,12))[2:23]
y = rep(fitlt$surv, rep(2,11))
plot(x, y, type="l", col=4, xlab="Time Interval (Days)", ylab="Survival (Life Table)")
title(main = "Duration of stay in nursing homes", cex=.6)

Estimated Hazard

y = rep(fitlt$hazard, rep(2,11))
plot(x, y, type="l", col=6, xlab="Time Interval (Days)", ylab="Hazard (Life Table)")
title(main = "Duration of stay in nursing homes", cex=.6)

Task 1

Estimated the survival and hazard for Angina Pectoris data below using the lifetable estimator.

intEndpts = 0:16    # interval endpoints
deaths = c(456, 226, 152, 171, 135, 125, 83, 74, 51, 42, 43, 34, 18, 9, 6, 0)
cens = c(0, 39, 22, 23, 24, 107, 133, 102, 68, 64, 45, 53, 33, 27, 23, 30) # n censored

Due date: Dec. 14