We will need library("nlme")
to access functions for longitudinal data analysis and library("lattice")
for high-level plotting functions.
The examples use dataset Orthodont available in the package nlme
. Access it by calling data(Orthodont, package = "nlme")
.
This data frame contains the following columns:
distance
A numeric vector of distances (mm) from the pituitary (hypophysis gland) to the pterygomaxillary fissure (see the picture below). These distances are measured on x-ray images of the skull.age
A numeric vector of ages of the subject (yr).Subject
An ordered factor indicating the subject on which the measurement was made. The levels are labelled M01
to M16
for the males and F01
to F13
for the females. The ordering is by increasing average distance within sex.Sex
A factor with levels Male
and Female
.There are four measurements of distance
on each subject, obtained at ages 8, 10, 12, and 14. The goal of the analysis is to describe how the distance changes with age, how the growth varies between subjects and how it depends on gender.
As you can see by checking
str(Orthodont)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 108 obs. of 4 variables:
## $ distance: num 26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
## $ age : num 8 10 12 14 8 10 12 14 8 10 ...
## $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "outer")=Class 'formula' language ~Sex
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "formula")=Class 'formula' language distance ~ age | Subject
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Age"
## ..$ y: chr "Distance from pituitary to pterygomaxillary fissure"
## - attr(*, "units")=List of 2
## ..$ x: chr "(yr)"
## ..$ y: chr "(mm)"
## - attr(*, "FUN")=function (x)
## ..- attr(*, "source")= chr "function (x) max(x, na.rm = TRUE)"
## - attr(*, "order.groups")= logi TRUE
head(Orthodont)
## Grouped Data: distance ~ age | Subject
## distance age Subject Sex
## 1 26.0 8 M01 Male
## 2 25.0 10 M01 Male
## 3 29.0 12 M01 Male
## 4 31.0 14 M01 Male
## 5 21.5 8 M02 Male
## 6 22.5 10 M02 Male
the dataset has an additional special structure. It is a groupedData
object. Let us explain how such objects can be constructed.
Remove the groupedData structure from Orthodont by creating a classical data.frame called Ort
:
Ort <- as.data.frame(Orthodont)
str(Ort)
## 'data.frame': 108 obs. of 4 variables:
## $ distance: num 26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
## $ age : num 8 10 12 14 8 10 12 14 8 10 ...
## $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
head(Ort)
## distance age Subject Sex
## 1 26.0 8 M01 Male
## 2 25.0 10 M01 Male
## 3 29.0 12 M01 Male
## 4 31.0 14 M01 Male
## 5 21.5 8 M02 Male
## 6 22.5 10 M02 Male
Now recreate a grouped data object it by running the command
Ort2 <- groupedData(formula = distance ~ age | Subject,
data = Ort, outer = ~Sex,
labels = list(x = "Age", y = "Distance from pituitary to pterygomaxillary fissure"),
units = list(x = "[yr]", y = "[mm]"))
str(Ort2)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 108 obs. of 4 variables:
## $ distance: num 26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
## $ age : num 8 10 12 14 8 10 12 14 8 10 ...
## $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' language distance ~ age | Subject
## .. ..- attr(*, ".Environment")=<environment: 0x906d1a8>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Age"
## ..$ y: chr "Distance from pituitary to pterygomaxillary fissure"
## - attr(*, "units")=List of 2
## ..$ x: chr "[yr]"
## ..$ y: chr "[mm]"
## - attr(*, "outer")=Class 'formula' language ~Sex
## .. ..- attr(*, ".Environment")=<environment: 0x906d1a8>
## - attr(*, "FUN")=function (x)
## - attr(*, "order.groups")= logi TRUE
head(Ort2)
## Grouped Data: distance ~ age | Subject
## <environment: 0x906d1a8>
## distance age Subject Sex
## 1 26.0 8 M01 Male
## 2 25.0 10 M01 Male
## 3 29.0 12 M01 Male
## 4 31.0 14 M01 Male
## 5 21.5 8 M02 Male
## 6 22.5 10 M02 Male
This call to the groupedData
function specifies the following:
formula
defines the grouping structure, the response variable and the primary covariate (usually time in case of longitudinal data). The groups are specified by listing the variable Subject
after the vertical bar at the end of the formula. The primary covariate in this case is age.outer
defines additional outer covariates of interest. Outer covariate is a covariate measured at the group level (all measurements within the same group share the same value of the outer covariate). [The opposite: inner covariate.]labels
and units
arguments specify the labels and units for the response (component y
) and the primary covariate (component x
). These arguments are optional.See also help(groupedData)
. Functions available in library("nlme")
are able to use the structure specified by the groupedData
object.
The standard R functions for descriptive analysis may not be useful for describing grouped data because they do not take into account the group structure. Problems may be caused, e.g., by the repetition of the values of outer variables on each subject. For example:
summary(Ort)
## distance age Subject Sex
## Min. :16.50 Min. : 8.0 M16 : 4 Male :64
## 1st Qu.:22.00 1st Qu.: 9.5 M05 : 4 Female:44
## Median :23.75 Median :11.0 M02 : 4
## Mean :24.02 Mean :11.0 M11 : 4
## 3rd Qu.:26.00 3rd Qu.:12.5 M07 : 4
## Max. :31.50 Max. :14.0 M08 : 4
## (Other):84
Instead of using summary
, one can use gsummary
on groupedData
object to see summaries of variables by group. Continuous variables are summarized by mean, which can be changed by providing the argument FUN
.
gsummary(Ort2)
## distance age Subject Sex
## M16 23.000 11 M16 Male
## M05 23.000 11 M05 Male
## M02 23.375 11 M02 Male
## M11 23.625 11 M11 Male
## M07 23.750 11 M07 Male
## M08 23.875 11 M08 Male
## M03 24.250 11 M03 Male
## M12 24.250 11 M12 Male
## M13 24.250 11 M13 Male
## M14 24.875 11 M14 Male
## M09 25.125 11 M09 Male
## M15 25.875 11 M15 Male
## M06 26.375 11 M06 Male
## M04 26.625 11 M04 Male
## M01 27.750 11 M01 Male
## M10 29.500 11 M10 Male
## F10 18.500 11 F10 Female
## F09 21.125 11 F09 Female
## F06 21.125 11 F06 Female
## F01 21.375 11 F01 Female
## F05 22.625 11 F05 Female
## F07 23.000 11 F07 Female
## F02 23.000 11 F02 Female
## F08 23.375 11 F08 Female
## F03 23.750 11 F03 Female
## F04 24.875 11 F04 Female
## F11 26.375 11 F11 Female
gsummary(Ort2, FUN = max)
## distance age Subject Sex
## M16 25.0 14 M16 Male
## M05 26.0 14 M05 Male
## M02 26.5 14 M02 Male
## M11 25.0 14 M11 Male
## M07 26.5 14 M07 Male
## M08 25.5 14 M08 Male
## M03 27.5 14 M03 Male
## M12 28.0 14 M12 Male
## M13 29.5 14 M13 Male
## M14 26.0 14 M14 Male
## M09 31.0 14 M09 Male
## M15 30.0 14 M15 Male
## M06 28.5 14 M06 Male
## M04 27.5 14 M04 Male
## M01 31.0 14 M01 Male
## M10 31.5 14 M10 Male
## F10 19.5 14 F10 Female
## F09 22.0 14 F09 Female
## F06 22.5 14 F06 Female
## F01 23.0 14 F01 Female
## F05 23.5 14 F05 Female
## F07 25.0 14 F07 Female
## F02 25.5 14 F02 Female
## F08 24.0 14 F08 Female
## F03 26.0 14 F03 Female
## F04 26.5 14 F04 Female
## F11 28.0 14 F11 Female
This output can be further summarized across all groups by the function summary
:
summary(gsummary(Ort2))
## distance age Subject Sex
## Min. :18.50 Min. :11 M16 : 1 Male :16
## 1st Qu.:23.00 1st Qu.:11 M05 : 1 Female:11
## Median :23.75 Median :11 M02 : 1
## Mean :24.02 Mean :11 M11 : 1
## 3rd Qu.:25.00 3rd Qu.:11 M07 : 1
## Max. :29.50 Max. :11 M08 : 1
## (Other):21
Compare this to the output of
summary(Ort2)
## distance age Subject Sex
## Min. :16.50 Min. : 8.0 M16 : 4 Male :64
## 1st Qu.:22.00 1st Qu.: 9.5 M05 : 4 Female:44
## Median :23.75 Median :11.0 M02 : 4
## Mean :24.02 Mean :11.0 M11 : 4
## 3rd Qu.:26.00 3rd Qu.:12.5 M07 : 4
## Max. :31.50 Max. :14.0 M08 : 4
## (Other):84
The package nlme
adds methods for creating subject-level plots of groupedData
objects.
plot(Ort2)
Note that such plot is only useful if the number of subjects is not too high. With a high number of subjects, user-tailored plots must be used! Another strategy is to select (randomly) a reasonable number of subjects and then to create plots for only this subset.
A two-panel plot by gender is obtained by calling
plot(Ort2, outer = TRUE, aspect = "fill")
In case there are more subject-specific covariates in the data, only some of them can be selected to split the plot:
plot(Ort2, outer = ~Sex, aspect = "fill")
For more details, see help(plot.nfnGroupedData)
.
It is not appropriate to analyze longitudinal and in general grouped data by a linear model using least squares. Nevertheless, we will still try this model to compare the results with the analysis by the linear mixed model we will consider later. Fitting a classical linear model can (sometimes!!!) be useful in an exploratory part of the analysis. A classical linear model is useful exploratory tool in two ways:
Using ordinary least squares (OLS) applied to the full data set while ignoring dependencies. This basically means fitting a marginal model in which we (mostly incorrectly) assume homoscedastic and independent errors. This is especially useful to decide (at least preliminary) on the fixed effects
part of the model based on residual plots used for a classical linear model. Nevertheless, do not forget that formal tests are in general invalid due to correlated errors which are also not necessarily homoscedastic.
Using OLS applied separately to each subject. This is especially useful to decide (again at least preliminary) on the random effects
part of the model. When doing so, we can
Use residuals from point 1 as outcome (= detrended observations);
Assume (and estimate) common residual variance for all subjects (which is basically achieved by considering Subject
as a categorical covariate (factor
) in the linear model).
In case there are no time-dependent regressors except time and no or only few (and categorical) outer
regressors, steps 1 and 2 can be done almost together.
To obtain a meaningful interpretation of the intercept, we subtract 8 (the age at the initial measurement) from age. Then the intercept estimates the expected value at the start of the study.
Let us first fit a classical linear model while assuming a linear trend over time with possibly different parameters of the line for male and female.
fitlm <- lm(distance ~ I(age - 8) * Sex, data = Ort2)
summary(fitlm)
##
## Call:
## lm(formula = distance ~ I(age - 8) * Sex, data = Ort2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6156 -1.3219 -0.1682 1.3299 5.2469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.6156 0.4721 47.907 < 2e-16 ***
## I(age - 8) 0.7844 0.1262 6.217 1.07e-08 ***
## SexFemale -1.4065 0.7396 -1.902 0.060 .
## I(age - 8):SexFemale -0.3048 0.1977 -1.542 0.126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.257 on 104 degrees of freedom
## Multiple R-squared: 0.4227, Adjusted R-squared: 0.4061
## F-statistic: 25.39 on 3 and 104 DF, p-value: 2.108e-12
Classical basic residual plots:
par(mfrow = c(2, 2))
plot(fitlm, pch = 23, col = "darkblue", bg = "skyblue")
par(mfrow = c(1, 1))
Especially from the first plot we can conclude that the linear trend with parameters of the lines depending on Sex
is a reasonable model for the marginal evolution of the response, i.e., for the fixed effects
part of the subsequent linear mixed model.
One more classical plot (mainly to see how the trends for male and female differ):
COL <- c("darkblue", "red")
BG <- c("skyblue", "pink")
PCH <- c(24, 25)
names(COL) <- names(BG) <- names(PCH) <- c("Male", "Female")
plot(distance ~ age, col = COL[Sex], pch = PCH[Sex], bg = BG[Sex], data = Ort2)
abline(lm(distance ~ age, data = subset(Ort2, Sex == "Female")), col="red", lwd = 2)
abline(lm(distance ~ age, data = subset(Ort2, Sex == "Male")), col = "blue", lwd = 2)
However, the residuals for measurements taken on the same subject are correlated, as the following plot testifies.
bwplot(Ort2$Subject ~ residuals(fitlm))
But we should not be surprised by that…
Let us now explore subject-specific evolutions of the response over time. In our dataset, we have just one outer
covariate (Sex
) which, moreover, has only two levels. Hence it is not really necessary to detrend the outcomes before fitting subject-specific models. We can easily take Sex
into account at later stage as will be shown.
We will now consider fitting subject-specific regression lines to these data. At the same time, we will assume the same residual variance for all subjects. The subject-specific regression lines are obtained from the function lmList
.
fitslm <- lmList(Ort2)
The information on how to identify the subjects and what models to fit is extracted from the groupedData
structure of Ort2
. The structure of the model can also be specified explicitely (and must be specified in this way if a simpler linear model is to be fitted than that implied by the groupedData
structure). The lmList
function can also be used with data stored in a standard data.frame
:
fitslm <- lmList(distance ~ age | Subject, data = Ort)
The estimated subject-specific coefficients (intercepts and slopes) can be summarized and the residuals plotted easily.
summary(fitslm)
## Call:
## Model: distance ~ age | Subject
## Data: Ort
##
## Coefficients:
## (Intercept)
## Estimate Std. Error t value Pr(>|t|)
## M16 16.95 3.288173 5.1548379 3.695247e-06
## M05 13.65 3.288173 4.1512411 1.181678e-04
## M02 14.85 3.288173 4.5161854 3.458934e-05
## M11 20.05 3.288173 6.0976106 1.188838e-07
## M07 14.95 3.288173 4.5465974 3.116705e-05
## M08 19.75 3.288173 6.0063745 1.665712e-07
## M03 16.00 3.288173 4.8659237 1.028488e-05
## M12 13.25 3.288173 4.0295930 1.762580e-04
## M13 2.80 3.288173 0.8515366 3.982319e-01
## M14 19.10 3.288173 5.8086964 3.449588e-07
## M09 14.40 3.288173 4.3793313 5.509579e-05
## M15 13.50 3.288173 4.1056231 1.373664e-04
## M06 18.95 3.288173 5.7630783 4.078189e-07
## M04 24.70 3.288173 7.5117696 6.081644e-10
## M01 17.30 3.288173 5.2612799 2.523621e-06
## M10 21.25 3.288173 6.4625549 3.065505e-08
## F10 13.55 3.288173 4.1208291 1.306536e-04
## F09 18.10 3.288173 5.5045761 1.047769e-06
## F06 17.00 3.288173 5.1700439 3.499774e-06
## F01 17.25 3.288173 5.2460739 2.665260e-06
## F05 19.60 3.288173 5.9607565 1.971127e-07
## F07 16.95 3.288173 5.1548379 3.695247e-06
## F02 14.20 3.288173 4.3185072 6.763806e-05
## F08 21.45 3.288173 6.5233789 2.443813e-08
## F03 14.40 3.288173 4.3793313 5.509579e-05
## F04 19.65 3.288173 5.9759625 1.863600e-07
## F11 18.95 3.288173 5.7630783 4.078189e-07
## age
## Estimate Std. Error t value Pr(>|t|)
## M16 0.550 0.2929338 1.8775576 6.584707e-02
## M05 0.850 0.2929338 2.9016799 5.361639e-03
## M02 0.775 0.2929338 2.6456493 1.065760e-02
## M11 0.325 0.2929338 1.1094659 2.721458e-01
## M07 0.800 0.2929338 2.7309929 8.511442e-03
## M08 0.375 0.2929338 1.2801529 2.059634e-01
## M03 0.750 0.2929338 2.5603058 1.328807e-02
## M12 1.000 0.2929338 3.4137411 1.222240e-03
## M13 1.950 0.2929338 6.6567951 1.485652e-08
## M14 0.525 0.2929338 1.7922141 7.870160e-02
## M09 0.975 0.2929338 3.3283976 1.577941e-03
## M15 1.125 0.2929338 3.8404587 3.247135e-04
## M06 0.675 0.2929338 2.3042752 2.508117e-02
## M04 0.175 0.2929338 0.5974047 5.527342e-01
## M01 0.950 0.2929338 3.2430540 2.030113e-03
## M10 0.750 0.2929338 2.5603058 1.328807e-02
## F10 0.450 0.2929338 1.5361835 1.303325e-01
## F09 0.275 0.2929338 0.9387788 3.520246e-01
## F06 0.375 0.2929338 1.2801529 2.059634e-01
## F01 0.375 0.2929338 1.2801529 2.059634e-01
## F05 0.275 0.2929338 0.9387788 3.520246e-01
## F07 0.550 0.2929338 1.8775576 6.584707e-02
## F02 0.800 0.2929338 2.7309929 8.511442e-03
## F08 0.175 0.2929338 0.5974047 5.527342e-01
## F03 0.850 0.2929338 2.9016799 5.361639e-03
## F04 0.475 0.2929338 1.6215270 1.107298e-01
## F11 0.675 0.2929338 2.3042752 2.508117e-02
##
## Residual standard error: 1.31004 on 54 degrees of freedom
plot(fitslm)
However, we would rather adjust the intercept to the expected distance at the age 8, so we redefine the fitted model as follows:
fitslm2 <- lmList(distance ~ I(age - 8), data = Ort2)
or
fitslm2 <- lmList(distance ~ I(age - 8) | Subject, data = Ort)
summary(fitslm2)
## Call:
## Model: distance ~ I(age - 8) | Subject
## Data: Ort
##
## Coefficients:
## (Intercept)
## Estimate Std. Error t value Pr(>|t|)
## M16 21.35 1.096058 19.47890 4.359278e-26
## M05 20.45 1.096058 18.65778 3.319674e-25
## M02 21.05 1.096058 19.20519 8.514661e-26
## M11 22.65 1.096058 20.66497 2.596148e-27
## M07 21.35 1.096058 19.47890 4.359278e-26
## M08 22.75 1.096058 20.75621 2.100698e-27
## M03 22.00 1.096058 20.07194 1.046935e-26
## M12 21.25 1.096058 19.38766 5.444880e-26
## M13 18.40 1.096058 16.78744 4.366662e-23
## M14 23.30 1.096058 21.25800 6.639625e-28
## M09 22.20 1.096058 20.25441 6.794018e-27
## M15 22.50 1.096058 20.52812 3.571686e-27
## M06 24.35 1.096058 22.21598 7.809554e-29
## M04 26.10 1.096058 23.81261 2.589060e-30
## M01 24.90 1.096058 22.71778 2.621421e-29
## M10 27.25 1.096058 24.86183 3.054764e-31
## F10 17.15 1.096058 15.64699 1.031736e-21
## F09 20.30 1.096058 18.52092 4.686286e-25
## F06 20.00 1.096058 18.24721 9.391685e-25
## F01 20.25 1.096058 18.47530 5.259209e-25
## F05 21.80 1.096058 19.88946 1.618187e-26
## F07 21.35 1.096058 19.47890 4.359278e-26
## F02 20.60 1.096058 18.79463 2.355967e-25
## F08 22.85 1.096058 20.84744 1.701036e-27
## F03 21.20 1.096058 19.34205 6.087007e-26
## F04 23.45 1.096058 21.39486 4.867852e-28
## F11 24.35 1.096058 22.21598 7.809554e-29
## I(age - 8)
## Estimate Std. Error t value Pr(>|t|)
## M16 0.550 0.2929338 1.8775576 6.584707e-02
## M05 0.850 0.2929338 2.9016799 5.361639e-03
## M02 0.775 0.2929338 2.6456493 1.065760e-02
## M11 0.325 0.2929338 1.1094659 2.721458e-01
## M07 0.800 0.2929338 2.7309929 8.511442e-03
## M08 0.375 0.2929338 1.2801529 2.059634e-01
## M03 0.750 0.2929338 2.5603058 1.328807e-02
## M12 1.000 0.2929338 3.4137411 1.222240e-03
## M13 1.950 0.2929338 6.6567951 1.485652e-08
## M14 0.525 0.2929338 1.7922141 7.870160e-02
## M09 0.975 0.2929338 3.3283976 1.577941e-03
## M15 1.125 0.2929338 3.8404587 3.247135e-04
## M06 0.675 0.2929338 2.3042752 2.508117e-02
## M04 0.175 0.2929338 0.5974047 5.527342e-01
## M01 0.950 0.2929338 3.2430540 2.030113e-03
## M10 0.750 0.2929338 2.5603058 1.328807e-02
## F10 0.450 0.2929338 1.5361835 1.303325e-01
## F09 0.275 0.2929338 0.9387788 3.520246e-01
## F06 0.375 0.2929338 1.2801529 2.059634e-01
## F01 0.375 0.2929338 1.2801529 2.059634e-01
## F05 0.275 0.2929338 0.9387788 3.520246e-01
## F07 0.550 0.2929338 1.8775576 6.584707e-02
## F02 0.800 0.2929338 2.7309929 8.511442e-03
## F08 0.175 0.2929338 0.5974047 5.527342e-01
## F03 0.850 0.2929338 2.9016799 5.361639e-03
## F04 0.475 0.2929338 1.6215270 1.107298e-01
## F11 0.675 0.2929338 2.3042752 2.508117e-02
##
## Residual standard error: 1.31004 on 54 degrees of freedom
Let us now plot the estimated intercepts and slopes against each other.
pairs(fitslm2, id = 0.01, adj = -0.5)
The arguments id
and adj
are used to identify outliers. The subject M13 had an unusually large slope (fast skull growth). The intercepts and slopes, however, do not seem to be highly correlated. This would be different if we did not shift age by 8 years:
pairs(fitslm, id = 0.01, adj = -0.5)
This plot shows a relatively large correlation between the intercepts and slopes.
Next, we will calculate and plot confidence intervals for the subject-specific intercepts and slopes.
plot(intervals(fitslm2))
This plot tells us better how much the intercepts and slopes vary between individuals. It seems that the variability in the slopes is not as large as that in the intercepts.
Even better insight into the structure of the subject-specific coefficients is obtained if we use detrended outcomes. In our case, we may want to adjust the outcomes for Sex
. This means, we take residuals from the overall OLS as outcomes when fitting the subject-specific models.
Ort <- transform(Ort, resdistance = residuals(fitlm))
resfitslm <- lmList(resdistance ~ age | Subject, data = Ort)
resfitslm2 <- lmList(resdistance ~ I(age - 8) | Subject, data = Ort)
plot(resfitslm)
pairs(resfitslm, id = 0.01, adj = -0.5)