Základy regrese | NMFM 334

Letný semester 2024 | Cvičenie 6 | 02.04.2024
HTML Markdown – Rmd source file (UTF encoding)

Outline of the lab session no.5:

Model assumptions and diagnostic tools
Graphical diagnostic tools

Firstly, some necessary R libraries and the working dataset:

 library("mffSM")
 library("splines")       # B-splines modelling
 library("MASS")          # function stdres for standardized residuals
 
 ### Working data 
 data(Dris, package = "mffSM")
 help("Dris")
 
 head(Dris)

 ##   yield   N  P   K Ca Mg
 ## 1  5.47 470 47 320 47 11
 ## 2  5.63 530 48 357 60 16
 ## 3  5.63 530 48 310 63 16
 ## 4  4.84 482 47 357 47 13
 ## 5  4.84 506 48 294 52 15
 ## 6  4.21 500 45 283 61 14

 summary(Dris)

 ##      yield             N               P               K        
 ##  Min.   :1.920   Min.   :268.0   Min.   :32.00   Min.   :200.0  
 ##  1st Qu.:4.040   1st Qu.:427.0   1st Qu.:43.00   1st Qu.:338.0  
 ##  Median :4.840   Median :473.0   Median :49.00   Median :377.0  
 ##  Mean   :4.864   Mean   :469.9   Mean   :48.65   Mean   :375.4  
 ##  3rd Qu.:5.558   3rd Qu.:518.0   3rd Qu.:54.00   3rd Qu.:407.0  
 ##  Max.   :8.792   Max.   :657.0   Max.   :72.00   Max.   :580.0  
 ##        Ca              Mg       
 ##  Min.   :29.00   Min.   : 8.00  
 ##  1st Qu.:41.75   1st Qu.:10.00  
 ##  Median :49.00   Median :11.00  
 ##  Mean   :51.45   Mean   :11.64  
 ##  3rd Qu.:59.00   3rd Qu.:13.00  
 ##  Max.   :95.00   Max.   :22.00

1. Basic model diagnostics

We will use the following model at the beginning:

 aLogMg <- lm(yield ~ log2(Mg), data = Dris)
 
 summary(aLogMg)

 ## 
 ## Call:
 ## lm(formula = yield ~ log2(Mg), data = Dris)
 ## 
 ## Residuals:
 ##     Min      1Q  Median      3Q     Max 
 ## -3.1194 -0.7412 -0.0741  0.7451  3.9841 
 ## 
 ## Coefficients:
 ##             Estimate Std. Error t value Pr(>|t|)    
 ## (Intercept)   1.4851     0.7790   1.907   0.0574 .  
 ## log2(Mg)      0.9605     0.2208   4.349 1.77e-05 ***
 ## ---
 ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 ## 
 ## Residual standard error: 1.07 on 366 degrees of freedom
 ## Multiple R-squared:  0.04915,   Adjusted R-squared:  0.04655 
 ## F-statistic: 18.92 on 1 and 366 DF,  p-value: 1.772e-05

We will use the model above and we will closely inspect the model residuals. Three specific plots will be of particular interest:

Raw residuals $u_1, \dots, u_n$, that are supposed to satisfy:
- $E[u_i | \boldsymbol{X}_i] = 0$
- $Var(u_i|\boldsymbol{X}_i) = \sigma^2 m_{ii}$, where $m_ii$ is a diagonal element of the matrix $\mathbb{M} = (\mathbb{i} - \mathbb{H})$
- in a normal linear model, it also holds that $u_i$ follow the normal distribution, however, with different variance for different $i \in \{1, \dots, n\}$
Standardized residuals $v_i = u_i / \sqrt{MSe \times m_Pii{}}$, for $i = 1, \dots, n$ that satify
- $E[v_i | \boldsymbol{X}_i] = 0$
- $Var(v_i | \boldsymbol{X}_i) = 1$
- however, the standardized residuals are not normal – not even under the normal linear model

plotLM(aLogMg)

Indidividual work

How can we manually reconstruct the plots above?

First plot:

plot(aLogMg, which = 1)

### manual computation
points(fitted(aLogMg),residuals(aLogMg),lwd=2,col="blue4",bg="skyblue",pch=21)
lines(lowess(fitted(aLogMg),residuals(aLogMg)),col=2,lwd=3)

Second plot:

plot(aLogMg, which = 2, ask = FALSE)

### standardized residuals
H = qr.Q(aLogMg[["qr"]]) %*% t(qr.Q(aLogMg[["qr"]]))
M = diag(nrow(H))-H
stdres = residuals(aLogMg)/sqrt(deviance(aLogMg)/aLogMg$df.residual*diag(M))   
qq = qqnorm(stdres,plot.it=FALSE)
points(qq$x, qq$y,lwd=2,col="blue4",bg="skyblue",pch=21)
all.equal(qq$y,stdres)

## [1] TRUE

qqline(stdres,lwd=2,lty=3)

Third plot

plot(aLogMg, which = 3, ask = FALSE)  

### manual computation
points(fitted(aLogMg),sqrt(abs(stdres)),lwd=2,col="blue4",bg="skyblue",pch=21)
lines(lowess(fitted(aLogMg),sqrt(abs(stdres))),col=2,lwd=3)

Note, that there are actually infinitely many possibilities how to check that $E(u_i| \boldsymbol{X}_i) = 0$ (a scatterplot of fitted values vs. residuals is only one of them, other common possibility is to plot the covariate values vs. the residuals.

plot(log2(Dris$Mg), residuals(aLogMg), xlab=expression(log[2](Mg)), 
    ylab="Residuals", main="Residuals vs. covariate", pch = 21, col = "blue4", 
    bg = "skyblue");
lines(lowess(residuals(aLogMg)~log2(Dris$Mg)), col="red")

Also check the folowing and explain the results:

sum(aLogMg$residuals)

## [1] -5.08274e-14

sum(aLogMg$residuals*log2(Dris$Mg))

## [1] -2.429376e-13

2. Polynomials and splines as regressors

For some example of possible model improvements, we will consider a different dataset now and we will fit a series of different model that we will mutually compare.

A dataset providing a series of (indepeendent) measurements of head acceleration in a simulated motorcycle accident (used to test crash helmets) is given in the datafile Motocycle (within the R packages mffSM).

data(Motorcycle, package = "mffSM")
help(Motorcycle)

head(Motorcycle)

##   time haccel
## 1  2.4    0.0
## 2  2.6   -1.3
## 3  3.2   -2.7
## 4  3.6    0.0
## 5  4.0   -2.7
## 6  6.2   -2.7

dim(Motorcycle)

## [1] 133   2

summary(Motorcycle)

##       time           haccel       
##  Min.   : 2.40   Min.   :-134.00  
##  1st Qu.:15.80   1st Qu.: -54.90  
##  Median :24.00   Median : -13.30  
##  Mean   :25.55   Mean   : -25.55  
##  3rd Qu.:35.20   3rd Qu.:   0.00  
##  Max.   :65.60   Max.   :  75.00

We are primarily interested in how the head acceleration (covariate haccel) depends on the time after the accident (given in miliseconds – the timecovaraite).

BTY <- "o";
par(mfrow = c(1, 1), bty = BTY, mar = c(4, 4, 1, 1) + 0.1)
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
    xlab = "Time [ms]", ylab = "Head acceleration [g]")

fit0 = lm(haccel~time, data=Motorcycle)
abline(fit0)

Using the diagnostic plots, we can clearly see that the model denoted as fit0 is hardly appropriate:

plotLM(fit0)

As far as there is a quite flexible shape observed in the plot we need some more flexibile model to capture the structure of the underlying conditional expectation of the acceleraton given the time. We will use polynomials for fiting the model (while still staying within the linear regression framework).

### Fitting polynomial of various degrees
fit6 <- lm(haccel~poly(time, 6, raw=TRUE), data=Motorcycle)  
fit12 <- lm(haccel~poly(time, 12, raw=TRUE), data=Motorcycle)

The model can be visualized as follows:

xgrid <- seq(0, 60, length = 500)

par(mfrow = c(1, 1), bty = BTY, mar = c(4, 4, 1, 1) + 0.1)
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
    xlab = "Time [ms]", ylab = "Head acceleration [g]")
lines(xgrid, predict(fit6, newdata=data.frame(time=xgrid)), lwd=2, col="blue")
lines(xgrid, predict(fit12, newdata=data.frame(time=xgrid)), lwd=2, 
     col="darkgreen")
legend(37,-70, lty=rep(1, 2), col=c("blue", "darkgreen"), 
      legend=paste("p=", c(6,12),sep=""), lwd=c(2,2), cex=1.2)

Classical polynomials are, however, not computationally very effective. Especially for higher order of the polynomial approximation the solution tends to be unstable.

One possible solution how to introduce some reasonable stability into the model where some higher order polynomials are used, are splines.

### Choice of knots 
knots <- c(0, 11, 12, 13, 20, 30, 32, 34, 40, 50, 60) 
(inner <- knots[-c(1, length(knots))])  # inner knots

## [1] 11 12 13 20 30 32 34 40 50

(bound <- knots[c(1, length(knots))])   # boundary knots

## [1]  0 60

### B-splines of degree 3
DEGREE <- 3; # piecewise cubic 
Bgrid <- bs(xgrid, knots = inner, Boundary.knots = bound, degree = DEGREE, 
           intercept = TRUE)
dim(Bgrid)   # 13-dimensional linear space of piecewise polynomials

## [1] 500  13

COL <- rainbow_hcl(ncol(Bgrid), c = 80)
par(mfrow = c(1, 1), bty = BTY, mar = c(4, 4, 1, 1) + 0.1)
plot(xgrid, Bgrid[,1], type = "l", col = COL[1], lwd = 2, xlab = "x", 
    ylab = "B(x)", xlim = range(knots), ylim = range(Bgrid, na.rm = TRUE))
for (j in 2:ncol(Bgrid)) lines(xgrid, Bgrid[,j], col = COL[j], lwd = 2)
points(knots, rep(0, length(knots)), pch = 21, bg = "blue", col = "skyblue", 
      cex = 1.5)

Thus, we can used B-splines to fit the model.
More details about the B-splines can be found, for instance, here http://www.chebfun.org/examples/approx/BSplineConv.html

m1 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
                    degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)
names(m1$coefficients) <- paste("B", 1:13)
summary(m1)

## 
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
##     degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.857 -11.740  -0.268  10.356  55.249 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## B 1   -11.614     83.252  -0.140    0.889    
## B 2    12.450     81.282   0.153    0.879    
## B 3   -13.988     45.345  -0.308    0.758    
## B 4     2.984     18.894   0.158    0.875    
## B 5     6.115     12.139   0.504    0.615    
## B 6  -237.283     18.850 -12.588  < 2e-16 ***
## B 7    17.346     14.094   1.231    0.221    
## B 8    53.248     12.635   4.214 4.88e-05 ***
## B 9     5.102     13.051   0.391    0.697    
## B 10   12.609     17.146   0.735    0.464    
## B 11  -21.738     23.825  -0.912    0.363    
## B 12   10.879     16.284   0.668    0.505    
## B 13    7.885     17.624   0.447    0.655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared:  0.8447,    Adjusted R-squared:  0.8278 
## F-statistic: 50.19 on 13 and 120 DF,  p-value: < 2.2e-16

Note, that as the intercept is not formally included in in the model (because of the -1 in the model formula), neither ‘Multiple R-squared’ nor the ‘F-statistic’ quantity in the output are meaningful numbers.

### Indidividual work Can you explain why the values reported for $R^2$ and $R_{adj}^2$ are not appropriate? What is the right meaning of this numbers? Can you manually reconstruct them?

The $R^2$ value from the model is

summary(m1)$r.squared

## [1] 0.8446597

and the manual calculations will give

1 - deviance(m1)/sum(Motorcycle$haccel^2)

## [1] 0.8446597

sum(fitted(m1)^2)/sum(Motorcycle$haccel^2)

## [1] 0.8446597

Interpretation the F-test from the summary output above is a test of a submodel

summary(m1)$fstatistic

##     value     numdf     dendf 
##  50.19211  13.00000 120.00000

# the usual F-test is the following
m2 <- lm(haccel ~ 1, data = Motorcycle);
anova(m2,m1)

## Analysis of Variance Table
## 
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE, 
##     intercept = TRUE) - 1
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    132 308223                                 
## 2    120  61362 12    246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# in a model without intercept it is this one
(m0 <- lm(haccel ~ -1, data=Motorcycle))

## 
## Call:
## lm(formula = haccel ~ -1, data = Motorcycle)
## 
## No coefficients

anova(m0,m1)

## Analysis of Variance Table
## 
## Model 1: haccel ~ -1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE, 
##     intercept = TRUE) - 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    133 395017                                  
## 2    120  61362 13    333655 50.192 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Nevertheless, as the intercept paramter is in the linear span of the columns of $, it makes sense to calculcate the (usual) multiple $R^2$ value. This can do it ‘manually’ as follows:
```
var(fitted(m1))/var(Motorcycle$haccel)
```
```
## [1] 0.8009163
```
```
cor(Motorcycle$haccel, fitted(m1))^2
```
```
## [1] 0.8009163
```
```
1 - deviance(m1)/deviance(m2)
```
```
## [1] 0.8009163
```

Alternatively one can use B-splines (the basis without intercept) with an intercept in the model to get some meaningful numbers in the output

m0 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
                    degree = DEGREE, intercept = FALSE), data = Motorcycle)
names(m0$coefficients) <- c("(Intercept)", paste("B", 1:12))
summary(m0)

## 
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
##     degree = DEGREE, intercept = FALSE), data = Motorcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.857 -11.740  -0.268  10.356  55.249 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -11.614     83.252  -0.140   0.8893  
## B 1           24.064    161.970   0.149   0.8821  
## B 2           -2.374     57.835  -0.041   0.9673  
## B 3           14.598     93.327   0.156   0.8760  
## B 4           17.729     80.496   0.220   0.8261  
## B 5         -225.669     88.698  -2.544   0.0122 *
## B 6           28.960     82.780   0.350   0.7271  
## B 7           64.862     84.773   0.765   0.4457  
## B 8           16.716     83.997   0.199   0.8426  
## B 9           24.223     85.172   0.284   0.7766  
## B 10         -10.124     86.455  -0.117   0.9070  
## B 11          22.493     84.873   0.265   0.7915  
## B 12          19.498     85.159   0.229   0.8193  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared:  0.8009,    Adjusted R-squared:  0.781 
## F-statistic: 40.23 on 12 and 120 DF,  p-value: < 2.2e-16

anova(m2,m0)

## Analysis of Variance Table
## 
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE, 
##     intercept = FALSE)
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    132 308223                                 
## 2    120  61362 12    246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about the interpretation of the classical diagnostic plots for the model fitted with B-slines?

plotLM(m1)