NMST432 Advanced Regression Models

Tutorial on grouped data analysis in R


Back to course page


Grouped data structures

We will need library("nlme") to access functions for longitudinal data analysis and library("lattice") for high-level plotting functions.

The examples use dataset Orthodont available in the package nlme. Access it by calling data(Orthodont, package = "nlme").

This data frame contains the following columns:

pterygomaxillary fissure

There are four measurements of distance on each subject, obtained at ages 8, 10, 12, and 14. The goal of the analysis is to describe how the distance changes with age, how the growth varies between subjects and how it depends on gender.


Construction of groupedData objects

As you can see by checking

str(Orthodont)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   108 obs. of  4 variables:
##  $ distance: num  26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
##  $ age     : num  8 10 12 14 8 10 12 14 8 10 ...
##  $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "outer")=Class 'formula'  language ~Sex
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  - attr(*, "formula")=Class 'formula'  language distance ~ age | Subject
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Age"
##   ..$ y: chr "Distance from pituitary to pterygomaxillary fissure"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(yr)"
##   ..$ y: chr "(mm)"
##  - attr(*, "FUN")=function (x)  
##   ..- attr(*, "source")= chr "function (x) max(x, na.rm = TRUE)"
##  - attr(*, "order.groups")= logi TRUE
head(Orthodont)
## Grouped Data: distance ~ age | Subject
##   distance age Subject  Sex
## 1     26.0   8     M01 Male
## 2     25.0  10     M01 Male
## 3     29.0  12     M01 Male
## 4     31.0  14     M01 Male
## 5     21.5   8     M02 Male
## 6     22.5  10     M02 Male

the dataset has an additional special structure. It is a groupedData object. Let us explain how such objects can be constructed.

Remove the groupedData structure from Orthodont by creating a classical data.frame called Ort:

Ort <- as.data.frame(Orthodont)
str(Ort)
## 'data.frame':    108 obs. of  4 variables:
##  $ distance: num  26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
##  $ age     : num  8 10 12 14 8 10 12 14 8 10 ...
##  $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
head(Ort)
##   distance age Subject  Sex
## 1     26.0   8     M01 Male
## 2     25.0  10     M01 Male
## 3     29.0  12     M01 Male
## 4     31.0  14     M01 Male
## 5     21.5   8     M02 Male
## 6     22.5  10     M02 Male

Now recreate a grouped data object it by running the command

Ort2 <- groupedData(formula = distance ~ age | Subject, 
                 data = Ort, outer = ~Sex,
                 labels = list(x = "Age", y = "Distance from pituitary to pterygomaxillary fissure"),
                 units = list(x = "[yr]", y = "[mm]"))
str(Ort2)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   108 obs. of  4 variables:
##  $ distance: num  26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
##  $ age     : num  8 10 12 14 8 10 12 14 8 10 ...
##  $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula'  language distance ~ age | Subject
##   .. ..- attr(*, ".Environment")=<environment: 0x906d1a8> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Age"
##   ..$ y: chr "Distance from pituitary to pterygomaxillary fissure"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "[yr]"
##   ..$ y: chr "[mm]"
##  - attr(*, "outer")=Class 'formula'  language ~Sex
##   .. ..- attr(*, ".Environment")=<environment: 0x906d1a8> 
##  - attr(*, "FUN")=function (x)  
##  - attr(*, "order.groups")= logi TRUE
head(Ort2)
## Grouped Data: distance ~ age | Subject
## <environment: 0x906d1a8>
##   distance age Subject  Sex
## 1     26.0   8     M01 Male
## 2     25.0  10     M01 Male
## 3     29.0  12     M01 Male
## 4     31.0  14     M01 Male
## 5     21.5   8     M02 Male
## 6     22.5  10     M02 Male

This call to the groupedData function specifies the following:

See also help(groupedData). Functions available in library("nlme") are able to use the structure specified by the groupedData object.


Exploratory analysis of grouped data

The standard R functions for descriptive analysis may not be useful for describing grouped data because they do not take into account the group structure. Problems may be caused, e.g., by the repetition of the values of outer variables on each subject. For example:

summary(Ort)
##     distance          age          Subject       Sex    
##  Min.   :16.50   Min.   : 8.0   M16    : 4   Male  :64  
##  1st Qu.:22.00   1st Qu.: 9.5   M05    : 4   Female:44  
##  Median :23.75   Median :11.0   M02    : 4              
##  Mean   :24.02   Mean   :11.0   M11    : 4              
##  3rd Qu.:26.00   3rd Qu.:12.5   M07    : 4              
##  Max.   :31.50   Max.   :14.0   M08    : 4              
##                                 (Other):84

Instead of using summary, one can use gsummary on groupedData object to see summaries of variables by group. Continuous variables are summarized by mean, which can be changed by providing the argument FUN.

gsummary(Ort2)
##     distance age Subject    Sex
## M16   23.000  11     M16   Male
## M05   23.000  11     M05   Male
## M02   23.375  11     M02   Male
## M11   23.625  11     M11   Male
## M07   23.750  11     M07   Male
## M08   23.875  11     M08   Male
## M03   24.250  11     M03   Male
## M12   24.250  11     M12   Male
## M13   24.250  11     M13   Male
## M14   24.875  11     M14   Male
## M09   25.125  11     M09   Male
## M15   25.875  11     M15   Male
## M06   26.375  11     M06   Male
## M04   26.625  11     M04   Male
## M01   27.750  11     M01   Male
## M10   29.500  11     M10   Male
## F10   18.500  11     F10 Female
## F09   21.125  11     F09 Female
## F06   21.125  11     F06 Female
## F01   21.375  11     F01 Female
## F05   22.625  11     F05 Female
## F07   23.000  11     F07 Female
## F02   23.000  11     F02 Female
## F08   23.375  11     F08 Female
## F03   23.750  11     F03 Female
## F04   24.875  11     F04 Female
## F11   26.375  11     F11 Female
gsummary(Ort2, FUN = max)
##     distance age Subject    Sex
## M16     25.0  14     M16   Male
## M05     26.0  14     M05   Male
## M02     26.5  14     M02   Male
## M11     25.0  14     M11   Male
## M07     26.5  14     M07   Male
## M08     25.5  14     M08   Male
## M03     27.5  14     M03   Male
## M12     28.0  14     M12   Male
## M13     29.5  14     M13   Male
## M14     26.0  14     M14   Male
## M09     31.0  14     M09   Male
## M15     30.0  14     M15   Male
## M06     28.5  14     M06   Male
## M04     27.5  14     M04   Male
## M01     31.0  14     M01   Male
## M10     31.5  14     M10   Male
## F10     19.5  14     F10 Female
## F09     22.0  14     F09 Female
## F06     22.5  14     F06 Female
## F01     23.0  14     F01 Female
## F05     23.5  14     F05 Female
## F07     25.0  14     F07 Female
## F02     25.5  14     F02 Female
## F08     24.0  14     F08 Female
## F03     26.0  14     F03 Female
## F04     26.5  14     F04 Female
## F11     28.0  14     F11 Female

This output can be further summarized across all groups by the function summary:

summary(gsummary(Ort2))
##     distance          age        Subject       Sex    
##  Min.   :18.50   Min.   :11   M16    : 1   Male  :16  
##  1st Qu.:23.00   1st Qu.:11   M05    : 1   Female:11  
##  Median :23.75   Median :11   M02    : 1              
##  Mean   :24.02   Mean   :11   M11    : 1              
##  3rd Qu.:25.00   3rd Qu.:11   M07    : 1              
##  Max.   :29.50   Max.   :11   M08    : 1              
##                               (Other):21

Compare this to the output of

summary(Ort2)
##     distance          age          Subject       Sex    
##  Min.   :16.50   Min.   : 8.0   M16    : 4   Male  :64  
##  1st Qu.:22.00   1st Qu.: 9.5   M05    : 4   Female:44  
##  Median :23.75   Median :11.0   M02    : 4              
##  Mean   :24.02   Mean   :11.0   M11    : 4              
##  3rd Qu.:26.00   3rd Qu.:12.5   M07    : 4              
##  Max.   :31.50   Max.   :14.0   M08    : 4              
##                                 (Other):84

The package nlme adds methods for creating subject-level plots of groupedData objects.

plot(Ort2)

plot of chunk unnamed-chunk-9

Note that such plot is only useful if the number of subjects is not too high. With a high number of subjects, user-tailored plots must be used! Another strategy is to select (randomly) a reasonable number of subjects and then to create plots for only this subset.

A two-panel plot by gender is obtained by calling

plot(Ort2, outer = TRUE, aspect = "fill")

plot of chunk unnamed-chunk-10

In case there are more subject-specific covariates in the data, only some of them can be selected to split the plot:

plot(Ort2, outer = ~Sex, aspect = "fill")

plot of chunk unnamed-chunk-11

For more details, see help(plot.nfnGroupedData).


Building a linear mixed model for longitudinal dara

It is not appropriate to analyze longitudinal and in general grouped data by a linear model using least squares. Nevertheless, we will still try this model to compare the results with the analysis by the linear mixed model we will consider later. Fitting a classical linear model can (sometimes!!!) be useful in an exploratory part of the analysis. A classical linear model is useful exploratory tool in two ways:

  1. Using ordinary least squares (OLS) applied to the full data set while ignoring dependencies. This basically means fitting a marginal model in which we (mostly incorrectly) assume homoscedastic and independent errors. This is especially useful to decide (at least preliminary) on the fixed effects part of the model based on residual plots used for a classical linear model. Nevertheless, do not forget that formal tests are in general invalid due to correlated errors which are also not necessarily homoscedastic.

  2. Using OLS applied separately to each subject. This is especially useful to decide (again at least preliminary) on the random effects part of the model. When doing so, we can

  3. Use residuals from point 1 as outcome (= detrended observations);

  4. Assume (and estimate) common residual variance for all subjects (which is basically achieved by considering Subject as a categorical covariate (factor) in the linear model).

In case there are no time-dependent regressors except time and no or only few (and categorical) outer regressors, steps 1 and 2 can be done almost together.

To obtain a meaningful interpretation of the intercept, we subtract 8 (the age at the initial measurement) from age. Then the intercept estimates the expected value at the start of the study.

OLS applied to full-data set

Let us first fit a classical linear model while assuming a linear trend over time with possibly different parameters of the line for male and female.

fitlm <- lm(distance ~ I(age - 8) * Sex, data = Ort2)
summary(fitlm)
## 
## Call:
## lm(formula = distance ~ I(age - 8) * Sex, data = Ort2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6156 -1.3219 -0.1682  1.3299  5.2469 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           22.6156     0.4721  47.907  < 2e-16 ***
## I(age - 8)             0.7844     0.1262   6.217 1.07e-08 ***
## SexFemale             -1.4065     0.7396  -1.902    0.060 .  
## I(age - 8):SexFemale  -0.3048     0.1977  -1.542    0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.257 on 104 degrees of freedom
## Multiple R-squared:  0.4227, Adjusted R-squared:  0.4061 
## F-statistic: 25.39 on 3 and 104 DF,  p-value: 2.108e-12

Classical basic residual plots:

par(mfrow = c(2, 2))
plot(fitlm, pch = 23, col = "darkblue", bg = "skyblue")

plot of chunk unnamed-chunk-13

par(mfrow = c(1, 1))

Especially from the first plot we can conclude that the linear trend with parameters of the lines depending on Sex is a reasonable model for the marginal evolution of the response, i.e., for the fixed effects part of the subsequent linear mixed model.

One more classical plot (mainly to see how the trends for male and female differ):

COL <- c("darkblue", "red")
BG <- c("skyblue", "pink")
PCH <- c(24, 25)
names(COL) <- names(BG) <- names(PCH) <- c("Male", "Female")
plot(distance ~ age, col = COL[Sex], pch = PCH[Sex], bg = BG[Sex], data = Ort2)
abline(lm(distance ~ age, data = subset(Ort2, Sex == "Female")), col="red", lwd = 2)
abline(lm(distance ~ age, data = subset(Ort2, Sex == "Male")), col = "blue", lwd = 2)

plot of chunk unnamed-chunk-14

However, the residuals for measurements taken on the same subject are correlated, as the following plot testifies.

bwplot(Ort2$Subject ~ residuals(fitlm))

plot of chunk unnamed-chunk-15

But we should not be surprised by that…

OLS applied separately to each subject

Let us now explore subject-specific evolutions of the response over time. In our dataset, we have just one outer covariate (Sex) which, moreover, has only two levels. Hence it is not really necessary to detrend the outcomes before fitting subject-specific models. We can easily take Sex into account at later stage as will be shown.

We will now consider fitting subject-specific regression lines to these data. At the same time, we will assume the same residual variance for all subjects. The subject-specific regression lines are obtained from the function lmList.

fitslm <- lmList(Ort2)

The information on how to identify the subjects and what models to fit is extracted from the groupedData structure of Ort2. The structure of the model can also be specified explicitely (and must be specified in this way if a simpler linear model is to be fitted than that implied by the groupedData structure). The lmList function can also be used with data stored in a standard data.frame:

fitslm <- lmList(distance ~ age | Subject, data = Ort)

The estimated subject-specific coefficients (intercepts and slopes) can be summarized and the residuals plotted easily.

summary(fitslm)
## Call:
##   Model: distance ~ age | Subject 
##    Data: Ort 
## 
## Coefficients:
##    (Intercept) 
##     Estimate Std. Error   t value     Pr(>|t|)
## M16    16.95   3.288173 5.1548379 3.695247e-06
## M05    13.65   3.288173 4.1512411 1.181678e-04
## M02    14.85   3.288173 4.5161854 3.458934e-05
## M11    20.05   3.288173 6.0976106 1.188838e-07
## M07    14.95   3.288173 4.5465974 3.116705e-05
## M08    19.75   3.288173 6.0063745 1.665712e-07
## M03    16.00   3.288173 4.8659237 1.028488e-05
## M12    13.25   3.288173 4.0295930 1.762580e-04
## M13     2.80   3.288173 0.8515366 3.982319e-01
## M14    19.10   3.288173 5.8086964 3.449588e-07
## M09    14.40   3.288173 4.3793313 5.509579e-05
## M15    13.50   3.288173 4.1056231 1.373664e-04
## M06    18.95   3.288173 5.7630783 4.078189e-07
## M04    24.70   3.288173 7.5117696 6.081644e-10
## M01    17.30   3.288173 5.2612799 2.523621e-06
## M10    21.25   3.288173 6.4625549 3.065505e-08
## F10    13.55   3.288173 4.1208291 1.306536e-04
## F09    18.10   3.288173 5.5045761 1.047769e-06
## F06    17.00   3.288173 5.1700439 3.499774e-06
## F01    17.25   3.288173 5.2460739 2.665260e-06
## F05    19.60   3.288173 5.9607565 1.971127e-07
## F07    16.95   3.288173 5.1548379 3.695247e-06
## F02    14.20   3.288173 4.3185072 6.763806e-05
## F08    21.45   3.288173 6.5233789 2.443813e-08
## F03    14.40   3.288173 4.3793313 5.509579e-05
## F04    19.65   3.288173 5.9759625 1.863600e-07
## F11    18.95   3.288173 5.7630783 4.078189e-07
##    age 
##     Estimate Std. Error   t value     Pr(>|t|)
## M16    0.550  0.2929338 1.8775576 6.584707e-02
## M05    0.850  0.2929338 2.9016799 5.361639e-03
## M02    0.775  0.2929338 2.6456493 1.065760e-02
## M11    0.325  0.2929338 1.1094659 2.721458e-01
## M07    0.800  0.2929338 2.7309929 8.511442e-03
## M08    0.375  0.2929338 1.2801529 2.059634e-01
## M03    0.750  0.2929338 2.5603058 1.328807e-02
## M12    1.000  0.2929338 3.4137411 1.222240e-03
## M13    1.950  0.2929338 6.6567951 1.485652e-08
## M14    0.525  0.2929338 1.7922141 7.870160e-02
## M09    0.975  0.2929338 3.3283976 1.577941e-03
## M15    1.125  0.2929338 3.8404587 3.247135e-04
## M06    0.675  0.2929338 2.3042752 2.508117e-02
## M04    0.175  0.2929338 0.5974047 5.527342e-01
## M01    0.950  0.2929338 3.2430540 2.030113e-03
## M10    0.750  0.2929338 2.5603058 1.328807e-02
## F10    0.450  0.2929338 1.5361835 1.303325e-01
## F09    0.275  0.2929338 0.9387788 3.520246e-01
## F06    0.375  0.2929338 1.2801529 2.059634e-01
## F01    0.375  0.2929338 1.2801529 2.059634e-01
## F05    0.275  0.2929338 0.9387788 3.520246e-01
## F07    0.550  0.2929338 1.8775576 6.584707e-02
## F02    0.800  0.2929338 2.7309929 8.511442e-03
## F08    0.175  0.2929338 0.5974047 5.527342e-01
## F03    0.850  0.2929338 2.9016799 5.361639e-03
## F04    0.475  0.2929338 1.6215270 1.107298e-01
## F11    0.675  0.2929338 2.3042752 2.508117e-02
## 
## Residual standard error: 1.31004 on 54 degrees of freedom
plot(fitslm)

plot of chunk unnamed-chunk-19

However, we would rather adjust the intercept to the expected distance at the age 8, so we redefine the fitted model as follows:

fitslm2 <- lmList(distance ~ I(age - 8), data = Ort2)

or

fitslm2 <- lmList(distance ~ I(age - 8) | Subject, data = Ort)
summary(fitslm2)
## Call:
##   Model: distance ~ I(age - 8) | Subject 
##    Data: Ort 
## 
## Coefficients:
##    (Intercept) 
##     Estimate Std. Error  t value     Pr(>|t|)
## M16    21.35   1.096058 19.47890 4.359278e-26
## M05    20.45   1.096058 18.65778 3.319674e-25
## M02    21.05   1.096058 19.20519 8.514661e-26
## M11    22.65   1.096058 20.66497 2.596148e-27
## M07    21.35   1.096058 19.47890 4.359278e-26
## M08    22.75   1.096058 20.75621 2.100698e-27
## M03    22.00   1.096058 20.07194 1.046935e-26
## M12    21.25   1.096058 19.38766 5.444880e-26
## M13    18.40   1.096058 16.78744 4.366662e-23
## M14    23.30   1.096058 21.25800 6.639625e-28
## M09    22.20   1.096058 20.25441 6.794018e-27
## M15    22.50   1.096058 20.52812 3.571686e-27
## M06    24.35   1.096058 22.21598 7.809554e-29
## M04    26.10   1.096058 23.81261 2.589060e-30
## M01    24.90   1.096058 22.71778 2.621421e-29
## M10    27.25   1.096058 24.86183 3.054764e-31
## F10    17.15   1.096058 15.64699 1.031736e-21
## F09    20.30   1.096058 18.52092 4.686286e-25
## F06    20.00   1.096058 18.24721 9.391685e-25
## F01    20.25   1.096058 18.47530 5.259209e-25
## F05    21.80   1.096058 19.88946 1.618187e-26
## F07    21.35   1.096058 19.47890 4.359278e-26
## F02    20.60   1.096058 18.79463 2.355967e-25
## F08    22.85   1.096058 20.84744 1.701036e-27
## F03    21.20   1.096058 19.34205 6.087007e-26
## F04    23.45   1.096058 21.39486 4.867852e-28
## F11    24.35   1.096058 22.21598 7.809554e-29
##    I(age - 8) 
##     Estimate Std. Error   t value     Pr(>|t|)
## M16    0.550  0.2929338 1.8775576 6.584707e-02
## M05    0.850  0.2929338 2.9016799 5.361639e-03
## M02    0.775  0.2929338 2.6456493 1.065760e-02
## M11    0.325  0.2929338 1.1094659 2.721458e-01
## M07    0.800  0.2929338 2.7309929 8.511442e-03
## M08    0.375  0.2929338 1.2801529 2.059634e-01
## M03    0.750  0.2929338 2.5603058 1.328807e-02
## M12    1.000  0.2929338 3.4137411 1.222240e-03
## M13    1.950  0.2929338 6.6567951 1.485652e-08
## M14    0.525  0.2929338 1.7922141 7.870160e-02
## M09    0.975  0.2929338 3.3283976 1.577941e-03
## M15    1.125  0.2929338 3.8404587 3.247135e-04
## M06    0.675  0.2929338 2.3042752 2.508117e-02
## M04    0.175  0.2929338 0.5974047 5.527342e-01
## M01    0.950  0.2929338 3.2430540 2.030113e-03
## M10    0.750  0.2929338 2.5603058 1.328807e-02
## F10    0.450  0.2929338 1.5361835 1.303325e-01
## F09    0.275  0.2929338 0.9387788 3.520246e-01
## F06    0.375  0.2929338 1.2801529 2.059634e-01
## F01    0.375  0.2929338 1.2801529 2.059634e-01
## F05    0.275  0.2929338 0.9387788 3.520246e-01
## F07    0.550  0.2929338 1.8775576 6.584707e-02
## F02    0.800  0.2929338 2.7309929 8.511442e-03
## F08    0.175  0.2929338 0.5974047 5.527342e-01
## F03    0.850  0.2929338 2.9016799 5.361639e-03
## F04    0.475  0.2929338 1.6215270 1.107298e-01
## F11    0.675  0.2929338 2.3042752 2.508117e-02
## 
## Residual standard error: 1.31004 on 54 degrees of freedom

Let us now plot the estimated intercepts and slopes against each other.

pairs(fitslm2, id = 0.01, adj = -0.5)

plot of chunk unnamed-chunk-23

The arguments id and adj are used to identify outliers. The subject M13 had an unusually large slope (fast skull growth). The intercepts and slopes, however, do not seem to be highly correlated. This would be different if we did not shift age by 8 years:

pairs(fitslm, id = 0.01, adj = -0.5)

plot of chunk unnamed-chunk-24

This plot shows a relatively large correlation between the intercepts and slopes.

Next, we will calculate and plot confidence intervals for the subject-specific intercepts and slopes.

plot(intervals(fitslm2))

plot of chunk unnamed-chunk-25

This plot tells us better how much the intercepts and slopes vary between individuals. It seems that the variability in the slopes is not as large as that in the intercepts.

Even better insight into the structure of the subject-specific coefficients is obtained if we use detrended outcomes. In our case, we may want to adjust the outcomes for Sex. This means, we take residuals from the overall OLS as outcomes when fitting the subject-specific models.

Ort  <- transform(Ort, resdistance = residuals(fitlm))
resfitslm  <- lmList(resdistance ~ age | Subject, data = Ort)
resfitslm2 <- lmList(resdistance ~ I(age - 8) | Subject, data = Ort)
plot(resfitslm)

plot of chunk unnamed-chunk-28

pairs(resfitslm, id = 0.01, adj = -0.5)