library(dplyr)
library(ggplot2)
library(forcats)
library(tidyr)
library(Epi)
library(survival)
Assignment 3: Matched case-control analysis via conditional logistic regression
Assignment
The Los Angeles Study of Endometrial Cancer was a matched case-control study conducted in California in the 1970’s (description in [BD1], Chap. 5.1, p. 162?163, data in [BD1], App. III, p. 290?296). There are 63 cases of endometrial cancer, all women age 55 or over, each matched to four controls living in the same retirement community. The primary exposure of interest was estrogen use. The secondary exposure was gallbladder disease. The Epi library in R includes two versions of the data: the full dataset bdendo and a subset containing a single control matched to each case bdendo11 .
- Conduct descriptive analysis similar to Table 5.1 of [BD1], p. 163. Use bdendo11 data to estimate odds ratios using the method for 1:1 matching and binary exposure.
- Conduct conditional logistic analysis of the bdendo11 dataset (1:1 matching) using the function glm. See [BD1], Chap. 7.3, p. 253?259, for inspiration and comparison of results.
- Conduct conditional logistic analysis of the bdendo dataset (1:4 matching) using one of the conditional logistic regression functions available in R (function clogistic from Epi library or function clogit from survival library. See [BD1], Chap. 7.4, p. 253?268, for inspiration and comparison of results.
Data
Classical R framework
First some data exploration
load("C:/Práce/Medical Studies/2018/bdendo.RData")
load("C:/Práce/Medical Studies/2018/bdendo11.RData")
dim(bdendo)
[1] 315 13
names(bdendo)
[1] "set" "d" "gall" "hyp" "ob" "est"
[7] "dur" "non" "duration" "age" "cest" "agegrp"
[13] "age3"
head(bdendo)
set d gall hyp ob est dur non duration age cest agegrp age3
1 1 1 No No Yes Yes 4 Yes 96 74 3 70-74 65-74
2 1 0 No No <NA> No 0 No 0 75 0 70-74 65-74
3 1 0 No No <NA> No 0 No 0 74 0 70-74 65-74
4 1 0 No No <NA> No 0 No 0 74 0 70-74 65-74
5 1 0 No No Yes Yes 3 Yes 48 75 1 70-74 65-74
6 2 1 No No No Yes 4 Yes 96 67 3 65-69 65-74
head(bdendo11)
set d gall hyp ob est dur non duration age cest agegrp age3
1 1 1 No No Yes Yes 4 Yes 96 74 3 70-74 65-74
2 1 0 No No <NA> No 0 No 0 75 0 70-74 65-74
3 2 1 No No No Yes 4 Yes 96 67 3 65-69 65-74
4 2 0 No No No Yes 1 No 5 67 3 65-69 65-74
5 3 1 No Yes Yes Yes 1 Yes 9 76 1 75-79 75+
6 3 0 No Yes Yes Yes 4 Yes 96 76 2 75-79 75+
Task 1: Reproduce the descriptive results
First let us see how we can easily obtain results of Table 5.1 for given variable using Epi library. We directly obtain result including the confidence intervals. Note that RR mentioned there is in fact odds ratio, as correctly described in the text.
## Reproduce the tables
with(bdendo, twoby2(d, est))
2 by 2 table analysis:
------------------------------------------------------
Outcome : No
Comparing : 0 vs. 1
No Yes P(No) 95% conf. interval
0 125 127 0.4960 0.4347 0.5575
1 7 56 0.1111 0.0539 0.2152
95% conf. interval
Relative Risk: 4.4643 2.1961 9.0752
Sample Odds Ratio: 7.8740 3.4554 17.9429
Conditional MLE Odds Ratio: 7.8292 3.3856 21.1587
Probability difference: 0.3849 0.2669 0.4681
Exact P-value: 0.0000
Asymptotic P-value: 0.0000
------------------------------------------------------
with(bdendo, twoby2(d, gall))
2 by 2 table analysis:
------------------------------------------------------
Outcome : No
Comparing : 0 vs. 1
No Yes P(No) 95% conf. interval
0 228 24 0.9048 0.8618 0.9353
1 46 17 0.7302 0.6080 0.8252
95% conf. interval
Relative Risk: 1.2391 1.0608 1.4474
Sample Odds Ratio: 3.5109 1.7480 7.0518
Conditional MLE Odds Ratio: 3.4929 1.6241 7.4108
Probability difference: 0.1746 0.0714 0.2988
Exact P-value: 0.0006
Asymptotic P-value: 0.0004
------------------------------------------------------
with(bdendo11, twoby2(d, est))
2 by 2 table analysis:
------------------------------------------------------
Outcome : No
Comparing : 0 vs. 1
No Yes P(No) 95% conf. interval
0 33 30 0.5238 0.4015 0.6433
1 7 56 0.1111 0.0539 0.2152
95% conf. interval
Relative Risk: 4.7143 2.2559 9.8517
Sample Odds Ratio: 8.8000 3.4778 22.2669
Conditional MLE Odds Ratio: 8.6344 3.2632 26.0178
Probability difference: 0.4127 0.2550 0.5437
Exact P-value: 0.0000
Asymptotic P-value: 0.0000
------------------------------------------------------
with(bdendo11, twoby2(d, gall))
2 by 2 table analysis:
------------------------------------------------------
Outcome : No
Comparing : 0 vs. 1
No Yes P(No) 95% conf. interval
0 54 9 0.8571 0.7476 0.9240
1 46 17 0.7302 0.6080 0.8252
95% conf. interval
Relative Risk: 1.1739 0.9797 1.4066
Sample Odds Ratio: 2.2174 0.9028 5.4462
Conditional MLE Odds Ratio: 2.2035 0.8346 6.1824
Probability difference: 0.1270 -0.0154 0.2643
Exact P-value: 0.1223
Asymptotic P-value: 0.0824
------------------------------------------------------
Now, let us calculate the odds ratios using mh function (Mantel-Haenszel analyses of cohort and case-control studies) for 1:1 matching. Unlike in analysis above, here we consider the matching and therefore the resulting odds ratio is different.
= with(bdendo11, table(d, gall, set))
tg11Gall = mh(tg11Gall["1",,], tg11Gall["0",,], compare = "gall", levels = c(2, 1))
mhor11Gall <- c(mhor11Gall$ratio, mhor11Gall$cl.lower, mhor11Gall$cl.upper, mhor11Gall$p.value)
outGall names(outGall) <- c("Odds ratio", "CI lower", "CI upper", "p-val")
round(outGall,3)
Odds ratio CI lower CI upper p-val
2.600 1.094 6.179 0.059
= with(bdendo11,table(d, est, set))
tg11Est = mh(tg11Est["1",,], tg11Est["0",,], compare = "est", levels = c(2, 1))
mhor11Est <- c(mhor11Est$ratio, mhor11Est$cl.lower, mhor11Est$cl.upper, mhor11Est$p.value)
outEst names(outEst) <- c("Odds ratio", "CI lower", "CI upper", "p-val")
round(outEst,3)
Odds ratio CI lower CI upper p-val
9.667 3.565 26.213 0.000
Task 2: Conduct conditional logistic analysis of 1:1 matching using glm
This task demonstrates on how to use the glm framework to be able to fit the conditional logistic regression for 1:1 matching. In order to have a correct fit, following steps need to be performed:
- To have matching results wit [BD1], assign the age of case to control.
- Create variables representing all interactions to be used for modelling
- Create new data set that contains difference between case - control for response and all covariates for each set.
- Fit this new data using glm without intercept.
First, let us look at the preparation of the data.
# matching the age
<- bdendo11
bdendo11temp $d == 0, "age"] <- bdendo11temp[bdendo11temp$d == 1, "age"]
bdendo11temp[bdendo11temp$d == 0, "agegrp"] <- bdendo11temp[bdendo11temp$d == 1, "agegrp"]
bdendo11temp[bdendo11temp
# create the data
<- bdendo11temp %>%
bdendo11_sub # reformat the factors into 0/1
mutate(gall2 = as.numeric(gall)-1,
est2 = as.numeric(est)-1,
age4 = as.numeric(age > 69.5),
cest = as.numeric(as.character(cest))) %>%
# create other covariates: interactions and conjugate estrogen doses
mutate(gall2ageGr = gall2*age4,
gall2ageCont = gall2*(age-70),
dose1 = as.numeric(cest == 1),
dose2 = as.numeric(cest == 2),
dose3 = as.numeric(cest == 3),
dose = cest,
doseAge = dose*(age-70)
%>%
) # select the columns to keep for modelling
select(set, d, gall2, est2, gall2ageGr, gall2ageCont, dose1, dose2,
dose3, dose, doseAge)
# create template for reduced data set
<- bdendo11_sub[bdendo11_sub$d == 1, ]
bdendo11_diff
# subtract within each set the two rows
for(i in 1:nrow(bdendo11_diff)){
-1] <- bdendo11_sub[2*i-1, -1] - bdendo11_sub[2*i, -1]
bdendo11_diff[i, }
Now we are ready for fitting the models as in Table 7.3. Compare the model results with the book. The modelling results and confidence intervals can be used for the selection of the model and inference. See the [BD1] for interpretation of model results and comments on model building. Note that exponential of the coefficient corresponds to the estimate of odds ratio and gives us the same number as for Mantel-Haensel approach.
# Estrogen
<- glm(d ~ - 1 + est2, family = binomial, data = bdendo11_diff)
mE1 round(summary(mE1)$coeff[, "Estimate"], 3)
[1] 2.269
confint(mE1)
2.5 % 97.5 %
1.234413 3.698096
exp(summary(mE1)$coeff[, "Estimate"])
[1] 9.666666
# Gall blader
<- glm(d ~ -1 + gall2, family = binomial, data = bdendo11_diff)
mg1 <- glm(d ~ -1 + gall2 + gall2ageGr, family = binomial, data = bdendo11_diff)
mg2 <- glm(d ~ -1 + gall2 + gall2ageCont, family = binomial, data = bdendo11_diff)
mg3
round(summary(mg1)$coeff[, "Estimate"], 3)
[1] 0.956
round(summary(mg2)$coeff[, "Estimate"], 3)
gall2 gall2ageGr
1.946 -1.540
round(summary(mg3)$coeff[, "Estimate"], 3)
gall2 gall2ageCont
1.052 -0.066
# Estrogen dose
<- glm(d ~ -1 + dose1 + dose2 + dose3, family = binomial, data = bdendo11_diff)
md1 <- glm(d ~ -1 + dose, family = binomial, data = bdendo11_diff)
md2 <- glm(d ~ -1 + dose + doseAge, family = binomial, data = bdendo11_diff)
md3
round(summary(md1)$coeff[, "Estimate"], 3)
dose1 dose2 dose3
1.524 1.266 2.120
round(summary(md2)$coeff[, "Estimate"], 3)
[1] 0.69
round(summary(md3)$coeff[, "Estimate"], 3)
dose doseAge
0.686 0.008
# all except md3 works fine
Task 3: Conduct conditional logistic analysis of the 1:M matching using clogit
The conditional regression can be fitted with the specialized functions. We show first clogit from survival library. These allows us to fit directly complicated models without need of preparatory work and also allows for geneal 1:M matching.
First, check that for 1:1 dataset, we would obtain exactly same results as before.
<- clogit(d ~ est + strata(set), bdendo11)
mc11_e1 summary(mc11_e1)$coeff[, "exp(coef)"]
[1] 9.666667
Now we can continue with 1:M dataset and models from Table 7.5.
<- clogit(d ~ est + strata(set), bdendo)
mc_e1 <- clogit(d ~ est*age3 + strata(set), bdendo)
mc_e2
summary(mc_e1)$coeff[, "exp(coef)"]
[1] 7.954681
summary(mc_e2)$coeff[, "exp(coef)"]
estYes age365-74 age375+ estYes:age365-74
4.182159 NA NA 2.333573
estYes:age375+
2.181779
<- bdendo %>%
bdendoC mutate(cest = as.numeric(as.character(cest)))
<- clogit(d ~ cest + strata(set), bdendoC)
mc_ce1 <- clogit(d ~ cest*age3 + strata(set), bdendoC)
mc_ce2
summary(mc_ce1)$coeff[, "exp(coef)"]
[1] 2.152568
summary(mc_ce2)$coeff[, "exp(coef)"]
cest age365-74 age375+ cest:age365-74 cest:age375+
1.628299 NA NA 1.313193 1.968193
More complicated model with missing data can be fitted as in Table 7.7.
<- clogit(d ~ gall + hyp + ob + non + est + strata(set), bdendo)
mc_1 summary(mc_1)$coeff[, "exp(coef)"]
gallYes hypYes obYes nonYes estYes
3.5939491 0.7234795 1.5441293 1.9125479 6.9096593
Finally, we can do same work with clogistic function from Epi library.
<- clogistic(d ~ est, strata = set, data = bdendo)
mc_e1_Epi exp(mc_e1_Epi$coeff)
estYes
7.954678
<- clogistic(d ~ gall + hyp + ob + non + est, strata = set, data = bdendo)
mc_1_Epi exp(mc_1_Epi$coeff)
gallYes hypYes obYes nonYes estYes
3.5939473 0.7234796 1.5441294 1.9125469 6.9096378