**Applies when the data are grouped**

Our goal is still to estimate the survival function, hazard, and density function, but this is complicated by the fact that we donâ€™t know exactly when during each time interval an event occurs.

**Notation:**

- the \(j\)-th time interval is \([t_{j-1},t_j)\),
- \(c_j\) - the number of censorings in the \(j\)-th interval,
- \(d_j\) - the number of failures (deaths) in the \(j\)-th interval,
- \(r_j\) - the number of units
*entering*the interval.

We could apply the K-M formula directly. However, this approach is unsatisfactory for grouped data â€¦ it treats the problem as though it were in discrete time, with events happening only at 1 yr, 2 yr, etc. In fact, what we are trying to calculate here is the conditional probability of dying *within the interval*, given survival to the beginning of it.

We now make the assumption that the censoring process is such that the censored survival times occur uniformly throughout the \(j\)-th interval. So, we can assume that censorings occur on average *halfway* through the interval: \[
r_j'=r_j-c_j/2.
\] I.e., average number of individuals who are at risk during this interval. The assumption yields the *Actuarial Estimator*. It is appropriate if censorings occur uniformly throughout the interval. This assumption is sometimes known as the actuarial assumption.

The event that an individual survives beyond \([t_{j-1},t_j)\) is equivalent to the event that an individual survives beyond \([t_{0},t_1)\) and â€¦ and that an individual survives beyond \([t_{j-2},t_{j-1})\) and that an individual survives beyond \([t_{j-1},t_j)\). Let us degfine the following quantities:

- \(P_j=P\{\mbox{an individual survives beyond }[t_{j-1},t_j)\}\),
- \(p_j=P\{\mbox{an individual survives beyond }[t_{j-1},t_j) |\mbox{ he survives beyond }[t_{j-2},t_{t-1})\}\).

Then, \[ P_j=p_1\ldots p_j. \] The conditional probability of an death in \([t_{j-1},t_j)\) given that the individual survives beyond \([t_{j-2},t_{j-1})\) is estimated by \[ d_j/r_j'. \] Thus, \(p_j\) is estimated by \[ \frac{r_j'-d_j}{r_j'}. \] So, the actuarial (lifetime) estimate of the survival function is \[ \widehat{S}(t)=\prod_{j=1}^k\frac{r_j'-d_j}{r_j'} \] for \(t_{j-1}\leq t<t_j\) and \(j=1,\ldots,k\).

**Remarks:**

- Because the intervals are defined as \([t_{j-1}, t_j)\), the first interval typically starts with \(t_0 = 0\).
- The implication in
`R`

is that \(\widehat{S}(t_0)=1\).

The form of the estimated survivor function obtained using this method is sensitive to the choice of the intervals used in its construction. On the other hand, the lifetable estimate is particularly well suited to situations in which the actual death times are unknown, and only available information is the number of deaths and the number of censored observations which occur in a series of consecutive time intervals.

- Midpoint \(t_{mj}=(t_j+t_{j-1})/2\).
- Width \(b_j=t_j-t_{j-1}\).
- Conditional probability of dying \(\widehat{q}_j=d_j/r_j'\).
- Conditional probability of surviving \(\widehat{p}_j=1-\widehat{q}_j\).
- Cumulative probability of surviving at \(t_j\): \(\widehat{S}(t)=\prod_{l\leq j}\widehat{p}_l\).
- Hazard in the \(j\)-th interval \[ \widehat{\lambda}(t_{mj})=\frac{\widehat{q}_j}{b_j(1-\widehat{q}_j/2)} \] the number of deaths in the interval divided by the average number of survivors at the midpoint.
- Density at the midpoint of the \(j\)-th interval \[ \widehat{f}(t_{mj})=\frac{\widehat{S}(t_{j-1})\widehat{q}_j}{b_j}. \]

`R`

with `KMsurv`

package requires three elements:

- a vector of \(k + 1\) interval endpoints
- a \(k\) vector of death counts in each interval
- a \(k\) vector of censored counts in each interval

Incidentally, KM in `KMsurv`

stands for Klein and Moeschberger, not Kaplan-Meier!

`library(KMsurv)`

The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. â€˜Treatedâ€™ nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patientâ€™s health and sending them home.

Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.

*Variables include*:

`los`

- Length of stay of a resident (in days)`age`

- Age of a resident`rx`

- Nursing home assignment (1:bonuses, 0:no bonuses)`gender`

- Gender (1:male, 0:female)`married`

- (1:married, 0:not married)`health`

- Health status (2:second best, 5:worst)`censor`

- Censoring indicator (1:censored, 0:discharged)

Suppose we wish to use the actuarial method, but the data *do not come grouped*.

Consider the treated nursing home patients, with length of stay (`los`

) grouped into 100 day intervals:

```
nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))
nurshome$int = floor(nurshome$los/100)*100 # group in 100 day intervals
nurshome = nurshome[nurshome$rx==1,] # keep only treated homes
nurshome[1:3,] # check out a few observations
```

```
## los age rx gender married health censor int
## 1 37 86 1 0 0 2 0 0
## 2 61 77 1 0 0 4 0 0
## 3 1084 75 1 0 0 3 1 1000
```

```
tab = table(nurshome$int, nurshome$censor)
intEndpts = (0:11)*100
ntotal = dim(nurshome)[1] # nr patients
cens = tab[,2] # nr censored in each interval
released = tab[,1] # nr released in each interval
fitlt = lifetab(tis = intEndpts, ninit=ntotal, nlost=cens, nevent=released)
round(fitlt, 4) # restrict output to 4 decimals
```

```
## nsubs nlost nrisk nevent surv pdf hazard se.surv se.pdf
## 0-100 712 0 712.0 330 1.0000 0.0046 0.0060 0.0000 2e-04
## 100-200 382 0 382.0 86 0.5365 0.0012 0.0025 0.0187 1e-04
## 200-300 296 0 296.0 65 0.4157 0.0009 0.0025 0.0185 1e-04
## 300-400 231 0 231.0 38 0.3244 0.0005 0.0018 0.0175 1e-04
## 400-500 193 1 192.5 32 0.2711 0.0005 0.0018 0.0167 1e-04
## 500-600 160 0 160.0 13 0.2260 0.0002 0.0008 0.0157 1e-04
## 600-700 147 0 147.0 13 0.2076 0.0002 0.0009 0.0152 1e-04
## 700-800 134 30 119.0 10 0.1893 0.0002 0.0009 0.0147 0e+00
## 800-900 94 29 79.5 4 0.1734 0.0001 0.0005 0.0143 0e+00
## 900-1000 61 30 46.0 4 0.1647 0.0001 0.0009 0.0142 1e-04
## 1000-1100 27 27 13.5 0 0.1503 NA NA 0.0147 NA
## se.hazard
## 0-100 3e-04
## 100-200 3e-04
## 200-300 3e-04
## 300-400 3e-04
## 400-500 3e-04
## 500-600 2e-04
## 600-700 3e-04
## 700-800 3e-04
## 800-900 3e-04
## 900-1000 5e-04
## 1000-1100 NA
```

**Estimated Survival**

`names(fitlt) # check out components of fitlt object`

```
## [1] "nsubs" "nlost" "nrisk" "nevent" "surv"
## [6] "pdf" "hazard" "se.surv" "se.pdf" "se.hazard"
```

```
x = rep(intEndpts, rep(2,12))[2:23]
y = rep(fitlt$surv, rep(2,11))
```

```
plot(x, y, type="l", col=4, xlab="Time Interval (Days)", ylab="Survival (Life Table)")
title(main = "Duration of stay in nursing homes", cex=.6)
```

**Estimated Hazard**

`y = rep(fitlt$hazard, rep(2,11))`

```
plot(x, y, type="l", col=6, xlab="Time Interval (Days)", ylab="Hazard (Life Table)")
title(main = "Duration of stay in nursing homes", cex=.6)
```

Task 1

Estimated the survival and hazard for Angina Pectoris data below using the lifetable estimator.

```
intEndpts = 0:16 # interval endpoints
deaths = c(456, 226, 152, 171, 135, 125, 83, 74, 51, 42, 43, 34, 18, 9, 6, 0)
cens = c(0, 39, 22, 23, 24, 107, 133, 102, 68, 64, 45, 53, 33, 27, 23, 30) # n censored
```

Due date: Dec. 14