Assignment 3: Matched case-control analysis via conditional logistic regression

Author

Martin Otava

library(dplyr)
library(ggplot2)
library(forcats)
library(tidyr)

library(Epi)
library(survival)

Assignment

The Los Angeles Study of Endometrial Cancer was a matched case-control study conducted in California in the 1970’s (description in [BD1], Chap. 5.1, p. 162?163, data in [BD1], App. III, p. 290?296). There are 63 cases of endometrial cancer, all women age 55 or over, each matched to four controls living in the same retirement community. The primary exposure of interest was estrogen use. The secondary exposure was gallbladder disease. The Epi library in R includes two versions of the data: the full dataset bdendo and a subset containing a single control matched to each case bdendo11 .

  1. Conduct descriptive analysis similar to Table 5.1 of [BD1], p. 163. Use bdendo11 data to estimate odds ratios using the method for 1:1 matching and binary exposure.
  2. Conduct conditional logistic analysis of the bdendo11 dataset (1:1 matching) using the function glm. See [BD1], Chap. 7.3, p. 253?259, for inspiration and comparison of results.
  3. Conduct conditional logistic analysis of the bdendo dataset (1:4 matching) using one of the conditional logistic regression functions available in R (function clogistic from Epi library or function clogit from survival library. See [BD1], Chap. 7.4, p. 253?268, for inspiration and comparison of results.

Data

Classical R framework

First some data exploration

load("C:/Práce/Medical Studies/2018/bdendo.RData")
load("C:/Práce/Medical Studies/2018/bdendo11.RData")

dim(bdendo)
[1] 315  13
names(bdendo)
 [1] "set"      "d"        "gall"     "hyp"      "ob"       "est"     
 [7] "dur"      "non"      "duration" "age"      "cest"     "agegrp"  
[13] "age3"    
head(bdendo)
  set d gall hyp   ob est dur non duration age cest agegrp  age3
1   1 1   No  No  Yes Yes   4 Yes       96  74    3  70-74 65-74
2   1 0   No  No <NA>  No   0  No        0  75    0  70-74 65-74
3   1 0   No  No <NA>  No   0  No        0  74    0  70-74 65-74
4   1 0   No  No <NA>  No   0  No        0  74    0  70-74 65-74
5   1 0   No  No  Yes Yes   3 Yes       48  75    1  70-74 65-74
6   2 1   No  No   No Yes   4 Yes       96  67    3  65-69 65-74
head(bdendo11)
  set d gall hyp   ob est dur non duration age cest agegrp  age3
1   1 1   No  No  Yes Yes   4 Yes       96  74    3  70-74 65-74
2   1 0   No  No <NA>  No   0  No        0  75    0  70-74 65-74
3   2 1   No  No   No Yes   4 Yes       96  67    3  65-69 65-74
4   2 0   No  No   No Yes   1  No        5  67    3  65-69 65-74
5   3 1   No Yes  Yes Yes   1 Yes        9  76    1  75-79   75+
6   3 0   No Yes  Yes Yes   4 Yes       96  76    2  75-79   75+

Task 1: Reproduce the descriptive results

First let us see how we can easily obtain results of Table 5.1 for given variable using Epi library. We directly obtain result including the confidence intervals. Note that RR mentioned there is in fact odds ratio, as correctly described in the text.

## Reproduce the tables

with(bdendo, twoby2(d, est))
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : No 
Comparing : 0 vs. 1 

   No Yes    P(No) 95% conf. interval
0 125 127   0.4960    0.4347   0.5575
1   7  56   0.1111    0.0539   0.2152

                                   95% conf. interval
             Relative Risk: 4.4643    2.1961   9.0752
         Sample Odds Ratio: 7.8740    3.4554  17.9429
Conditional MLE Odds Ratio: 7.8292    3.3856  21.1587
    Probability difference: 0.3849    0.2669   0.4681

             Exact P-value: 0.0000 
        Asymptotic P-value: 0.0000 
------------------------------------------------------
with(bdendo, twoby2(d, gall))
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : No 
Comparing : 0 vs. 1 

   No Yes    P(No) 95% conf. interval
0 228  24   0.9048    0.8618   0.9353
1  46  17   0.7302    0.6080   0.8252

                                   95% conf. interval
             Relative Risk: 1.2391    1.0608   1.4474
         Sample Odds Ratio: 3.5109    1.7480   7.0518
Conditional MLE Odds Ratio: 3.4929    1.6241   7.4108
    Probability difference: 0.1746    0.0714   0.2988

             Exact P-value: 0.0006 
        Asymptotic P-value: 0.0004 
------------------------------------------------------
with(bdendo11, twoby2(d, est))
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : No 
Comparing : 0 vs. 1 

  No Yes    P(No) 95% conf. interval
0 33  30   0.5238    0.4015   0.6433
1  7  56   0.1111    0.0539   0.2152

                                   95% conf. interval
             Relative Risk: 4.7143    2.2559   9.8517
         Sample Odds Ratio: 8.8000    3.4778  22.2669
Conditional MLE Odds Ratio: 8.6344    3.2632  26.0178
    Probability difference: 0.4127    0.2550   0.5437

             Exact P-value: 0.0000 
        Asymptotic P-value: 0.0000 
------------------------------------------------------
with(bdendo11, twoby2(d, gall))
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : No 
Comparing : 0 vs. 1 

  No Yes    P(No) 95% conf. interval
0 54   9   0.8571    0.7476   0.9240
1 46  17   0.7302    0.6080   0.8252

                                   95% conf. interval
             Relative Risk: 1.1739    0.9797   1.4066
         Sample Odds Ratio: 2.2174    0.9028   5.4462
Conditional MLE Odds Ratio: 2.2035    0.8346   6.1824
    Probability difference: 0.1270   -0.0154   0.2643

             Exact P-value: 0.1223 
        Asymptotic P-value: 0.0824 
------------------------------------------------------

Now, let us calculate the odds ratios using mh function (Mantel-Haenszel analyses of cohort and case-control studies) for 1:1 matching. Unlike in analysis above, here we consider the matching and therefore the resulting odds ratio is different.

tg11Gall = with(bdendo11, table(d, gall, set))
mhor11Gall = mh(tg11Gall["1",,], tg11Gall["0",,], compare = "gall", levels = c(2, 1))
outGall <- c(mhor11Gall$ratio, mhor11Gall$cl.lower, mhor11Gall$cl.upper, mhor11Gall$p.value)
names(outGall) <- c("Odds ratio", "CI lower", "CI upper", "p-val")
round(outGall,3)
Odds ratio   CI lower   CI upper      p-val 
     2.600      1.094      6.179      0.059 
tg11Est = with(bdendo11,table(d, est, set))
mhor11Est = mh(tg11Est["1",,], tg11Est["0",,], compare = "est", levels = c(2, 1))
outEst <- c(mhor11Est$ratio, mhor11Est$cl.lower, mhor11Est$cl.upper, mhor11Est$p.value)
names(outEst) <- c("Odds ratio", "CI lower", "CI upper", "p-val")
round(outEst,3)
Odds ratio   CI lower   CI upper      p-val 
     9.667      3.565     26.213      0.000 

Task 2: Conduct conditional logistic analysis of 1:1 matching using glm

This task demonstrates on how to use the glm framework to be able to fit the conditional logistic regression for 1:1 matching. In order to have a correct fit, following steps need to be performed:

  • To have matching results wit [BD1], assign the age of case to control.
  • Create variables representing all interactions to be used for modelling
  • Create new data set that contains difference between case - control for response and all covariates for each set.
  • Fit this new data using glm without intercept.

First, let us look at the preparation of the data.

# matching the age
bdendo11temp <- bdendo11 
bdendo11temp[bdendo11temp$d == 0, "age"] <- bdendo11temp[bdendo11temp$d == 1, "age"]
bdendo11temp[bdendo11temp$d == 0, "agegrp"] <- bdendo11temp[bdendo11temp$d == 1, "agegrp"]

# create the data
bdendo11_sub <- bdendo11temp %>%
  # reformat the factors into 0/1
  mutate(gall2 = as.numeric(gall)-1, 
         est2  = as.numeric(est)-1,
         age4 =  as.numeric(age > 69.5),
         cest = as.numeric(as.character(cest))) %>%
  # create other covariates: interactions and conjugate estrogen doses
  mutate(gall2ageGr = gall2*age4, 
         gall2ageCont  = gall2*(age-70),
         dose1 = as.numeric(cest == 1), 
         dose2 = as.numeric(cest == 2),
         dose3 = as.numeric(cest == 3),
         dose = cest,
         doseAge = dose*(age-70)
         ) %>%
  # select the columns to keep for modelling
    select(set, d, gall2, est2, gall2ageGr, gall2ageCont, dose1, dose2, 
                 dose3, dose, doseAge)

# create template for reduced data set
bdendo11_diff <- bdendo11_sub[bdendo11_sub$d == 1, ] 

# subtract within each set the two rows
for(i in 1:nrow(bdendo11_diff)){
  bdendo11_diff[i, -1] <- bdendo11_sub[2*i-1, -1] - bdendo11_sub[2*i, -1]
}

Now we are ready for fitting the models as in Table 7.3. Compare the model results with the book. The modelling results and confidence intervals can be used for the selection of the model and inference. See the [BD1] for interpretation of model results and comments on model building. Note that exponential of the coefficient corresponds to the estimate of odds ratio and gives us the same number as for Mantel-Haensel approach.

# Estrogen
mE1 <- glm(d ~ - 1 + est2, family = binomial, data = bdendo11_diff)
round(summary(mE1)$coeff[, "Estimate"], 3)
[1] 2.269
confint(mE1)
   2.5 %   97.5 % 
1.234413 3.698096 
exp(summary(mE1)$coeff[, "Estimate"])
[1] 9.666666
# Gall blader
mg1 <- glm(d ~ -1 + gall2, family = binomial, data = bdendo11_diff)
mg2 <- glm(d ~ -1 + gall2 + gall2ageGr, family = binomial, data = bdendo11_diff)
mg3 <- glm(d ~ -1 + gall2 + gall2ageCont, family = binomial, data = bdendo11_diff)

round(summary(mg1)$coeff[, "Estimate"], 3)
[1] 0.956
round(summary(mg2)$coeff[, "Estimate"], 3)
     gall2 gall2ageGr 
     1.946     -1.540 
round(summary(mg3)$coeff[, "Estimate"], 3)
       gall2 gall2ageCont 
       1.052       -0.066 
# Estrogen dose

md1 <- glm(d ~ -1 + dose1 + dose2 + dose3, family = binomial, data = bdendo11_diff)
md2 <- glm(d ~ -1 + dose, family = binomial, data = bdendo11_diff)
md3 <- glm(d ~ -1 + dose + doseAge, family = binomial, data = bdendo11_diff)

round(summary(md1)$coeff[, "Estimate"], 3)
dose1 dose2 dose3 
1.524 1.266 2.120 
round(summary(md2)$coeff[, "Estimate"], 3)
[1] 0.69
round(summary(md3)$coeff[, "Estimate"], 3)
   dose doseAge 
  0.686   0.008 
# all except md3 works fine

Task 3: Conduct conditional logistic analysis of the 1:M matching using clogit

The conditional regression can be fitted with the specialized functions. We show first clogit from survival library. These allows us to fit directly complicated models without need of preparatory work and also allows for geneal 1:M matching.

First, check that for 1:1 dataset, we would obtain exactly same results as before.

mc11_e1 <- clogit(d ~ est + strata(set), bdendo11)
summary(mc11_e1)$coeff[, "exp(coef)"]
[1] 9.666667

Now we can continue with 1:M dataset and models from Table 7.5.

mc_e1 <- clogit(d ~ est + strata(set), bdendo)
mc_e2 <- clogit(d ~ est*age3 + strata(set), bdendo)

summary(mc_e1)$coeff[, "exp(coef)"]
[1] 7.954681
summary(mc_e2)$coeff[, "exp(coef)"]
          estYes        age365-74          age375+ estYes:age365-74 
        4.182159               NA               NA         2.333573 
  estYes:age375+ 
        2.181779 
bdendoC <- bdendo %>%
  mutate(cest = as.numeric(as.character(cest)))

mc_ce1 <- clogit(d ~ cest + strata(set), bdendoC)
mc_ce2 <- clogit(d ~ cest*age3 + strata(set), bdendoC)

summary(mc_ce1)$coeff[, "exp(coef)"]
[1] 2.152568
summary(mc_ce2)$coeff[, "exp(coef)"]
          cest      age365-74        age375+ cest:age365-74   cest:age375+ 
      1.628299             NA             NA       1.313193       1.968193 

More complicated model with missing data can be fitted as in Table 7.7.

mc_1 <- clogit(d ~ gall + hyp + ob + non + est + strata(set), bdendo)
summary(mc_1)$coeff[, "exp(coef)"]
  gallYes    hypYes     obYes    nonYes    estYes 
3.5939491 0.7234795 1.5441293 1.9125479 6.9096593 

Finally, we can do same work with clogistic function from Epi library.

mc_e1_Epi <- clogistic(d ~ est, strata = set, data = bdendo)
exp(mc_e1_Epi$coeff)
  estYes 
7.954678 
mc_1_Epi <- clogistic(d ~ gall + hyp + ob + non + est, strata = set, data = bdendo)
exp(mc_1_Epi$coeff)
  gallYes    hypYes     obYes    nonYes    estYes 
3.5939473 0.7234796 1.5441294 1.9125469 6.9096378