Základy regrese | NMFM 334

Letný semester 2024 | Cvičenie 7 | 09.04.2024
HTML Markdown – Rmd source file (UTF encoding)

Outline of the lab session no.6:

Statistical inference in a normal linear model
Confidence intervals, confidence regions, statistical tests
Prediction interval
Different bounds for and along the linear regression line

We will consider a simple linear regression model (modelling the amount of yield given the information about the concentration of magnesium – Mg). For some model improvements we will use the logarithm transformation of the independent covariate (Mg).

 library("mffSM")
 library("splines")       # B-splines modelling
 library("MASS")          # function stdres for standardized residuals
 
 ### Working data 
 data(Dris, package = "mffSM")
 help("Dris")
 
 head(Dris)

 ##   yield   N  P   K Ca Mg
 ## 1  5.47 470 47 320 47 11
 ## 2  5.63 530 48 357 60 16
 ## 3  5.63 530 48 310 63 16
 ## 4  4.84 482 47 357 47 13
 ## 5  4.84 506 48 294 52 15
 ## 6  4.21 500 45 283 61 14

 summary(Dris)

 ##      yield             N               P               K        
 ##  Min.   :1.920   Min.   :268.0   Min.   :32.00   Min.   :200.0  
 ##  1st Qu.:4.040   1st Qu.:427.0   1st Qu.:43.00   1st Qu.:338.0  
 ##  Median :4.840   Median :473.0   Median :49.00   Median :377.0  
 ##  Mean   :4.864   Mean   :469.9   Mean   :48.65   Mean   :375.4  
 ##  3rd Qu.:5.558   3rd Qu.:518.0   3rd Qu.:54.00   3rd Qu.:407.0  
 ##  Max.   :8.792   Max.   :657.0   Max.   :72.00   Max.   :580.0  
 ##        Ca              Mg       
 ##  Min.   :29.00   Min.   : 8.00  
 ##  1st Qu.:41.75   1st Qu.:10.00  
 ##  Median :49.00   Median :11.00  
 ##  Mean   :51.45   Mean   :11.64  
 ##  3rd Qu.:59.00   3rd Qu.:13.00  
 ##  Max.   :95.00   Max.   :22.00

 ### a simple linear model fitted in the last session
 aLogMg <- lm(yield ~ log2(Mg), data = Dris)
 
 summary(aLogMg)

 ## 
 ## Call:
 ## lm(formula = yield ~ log2(Mg), data = Dris)
 ## 
 ## Residuals:
 ##     Min      1Q  Median      3Q     Max 
 ## -3.1194 -0.7412 -0.0741  0.7451  3.9841 
 ## 
 ## Coefficients:
 ##             Estimate Std. Error t value Pr(>|t|)    
 ## (Intercept)   1.4851     0.7790   1.907   0.0574 .  
 ## log2(Mg)      0.9605     0.2208   4.349 1.77e-05 ***
 ## ---
 ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 ## 
 ## Residual standard error: 1.07 on 366 degrees of freedom
 ## Multiple R-squared:  0.04915,   Adjusted R-squared:  0.04655 
 ## F-statistic: 18.92 on 1 and 366 DF,  p-value: 1.772e-05

We will use the model above and we will construct three different types of the confidence intervals/bounds for the model:

Pointwise confidence intervals (region/band) around the regression line
Confidence region (band) FOR the (whole) regression line simultaneously
Prediction interval (for some new given value of \(\boldsymbol{X}\))

1. Pointwise confidence intervals (region/band) around the regression line<

Firstly, use the help session in R to find out, what is the purpose of the predict command? Try the following:

 help(predict)
 help(predict.lm)

Indidividual work

How can we manually reconstruct the plots above?

What is the difference between the outputs from “predict” and “fitted”? A few values are shown in the output below.

data.frame(Predict = predict(aLogMg), Fitted = fitted(aLogMg))[1:10,]

##     Predict   Fitted
## 1  4.807918 4.807918
## 2  5.327136 5.327136
## 3  5.327136 5.327136
## 4  5.039407 5.039407
## 5  5.237704 5.237704
## 6  5.142100 5.142100
## 7  5.142100 5.142100
## 8  5.327136 5.327136
## 9  5.327136 5.327136
## 10 5.142100 5.142100

all.equal(predict(aLogMg), fitted(aLogMg))

## [1] TRUE

Now, we will use the same command, however, for some new value of \(X\) (concentration of magnesium)

predict(aLogMg, newdata = data.frame(Mg = 10))

##        1 
## 4.675846

(point.fit<-aLogMg$coef[1]+aLogMg$coef[2]*log2(10))

## (Intercept) 
##    4.675846

It can be even obtained with the corresponding standard error

predict(aLogMg, newdata = data.frame(Mg = 10), se.fit = TRUE)

## $fit
##        1 
## 4.675846 
## 
## $se.fit
## [1] 0.07063962
## 
## $df
## [1] 366
## 
## $residual.scale
## [1] 1.069745

The standard error can be also reconstructed manually

lvec <- c(1, log2(10))
(se.fit <- c(sqrt(t(lvec) %*% vcov(aLogMg) %*% lvec)))

## [1] 0.07063962

all.equal(predict(aLogMg, newdata = data.frame(Mg = 10), se.fit = TRUE)$se.fit,
         se.fit)

## [1] TRUE

# Recall - residual degrees of freedom, i.e. $df
aLogMg$df.residual

## [1] 366

# sqrt(MS_e) = estimate of sigma, i.e. $residual scale
sqrt(deviance(aLogMg)/aLogMg$df.residual)

## [1] 1.069745

summary(aLogMg)$sigma

## [1] 1.069745

Alternativelly, the output can be also obtained from the LSest() function in the R library mffSM

LSest(aLogMg, L = c(1, log2(10)))

##   Estimate Std. Error  t value    P value    Lower    Upper
## 1 4.675846 0.07063962 66.19296 < 2.22e-16 4.536935 4.814756

Whan we are interested in the estimate of the conditional mean \(E[Y | \boldsymbol{X} = \boldsymbol{x}]\) we can either provide the point estimate or the interval estimate instead. For the interval estimate, we need the confidence level and the estimate of the variance parameter.

It can be given automatically as follows;

predict(aLogMg, newdata = data.frame(Mg = 10), interval = "confidence", 
       level = 0.95)

##        fit      lwr      upr
## 1 4.675846 4.536935 4.814756

Or, alternatively, we can reconstruct everything manually:

### Manual computation of the confidence interval 
crossprod(c(1, log2(10)), coef(aLogMg))   ## What is this?

##          [,1]
## [1,] 4.675846

as.numeric(crossprod(c(1, log2(10)), coef(aLogMg))) + 
 c(-1, 1) * se.fit * qt(0.975, df = aLogMg[["df.residual"]])

## [1] 4.536935 4.814756

### We can calculate a sequence of confidence intervals for E(Y|x) 
### for different values of x
predict(aLogMg, newdata = data.frame(Mg = 10:20), interval = "confidence", 
       level = 0.95)

##         fit      lwr      upr
## 1  4.675846 4.536935 4.814756
## 2  4.807918 4.695321 4.920516
## 3  4.928491 4.815074 5.041908
## 4  5.039407 4.904194 5.174620
## 5  5.142100 4.975415 5.308785
## 6  5.237704 5.036446 5.438962
## 7  5.327136 5.090943 5.563328
## 8  5.411144 5.140735 5.681553
## 9  5.490349 5.186859 5.793839
## 10 5.565271 5.229971 5.900570
## 11 5.636348 5.270529 6.002168

predict(aLogMg, newdata = data.frame(Mg = 10:20), interval = "confidence", 
       level = 0.95, se.fit = TRUE)

## $fit
##         fit      lwr      upr
## 1  4.675846 4.536935 4.814756
## 2  4.807918 4.695321 4.920516
## 3  4.928491 4.815074 5.041908
## 4  5.039407 4.904194 5.174620
## 5  5.142100 4.975415 5.308785
## 6  5.237704 5.036446 5.438962
## 7  5.327136 5.090943 5.563328
## 8  5.411144 5.140735 5.681553
## 9  5.490349 5.186859 5.793839
## 10 5.565271 5.229971 5.900570
## 11 5.636348 5.270529 6.002168
## 
## $se.fit
##          1          2          3          4          5          6          7 
## 0.07063962 0.05725882 0.05767573 0.06875935 0.08476368 0.10234493 0.12011023 
##          8          9         10         11 
## 0.13751003 0.15433283 0.17050848 0.18602860 
## 
## $df
## [1] 366
## 
## $residual.scale
## [1] 1.069745

### A denser grid: preparation of a grid of points for the pointwise confidence 
### band around the regression line
x.Mg <- seq(min(Dris[, "Mg"]) - 1, max(Dris[, "Mg"]) + 1, length = 101)

### Calculate the confidence intervals
ciEY.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), interval = "confidence", 
                      level = 0.95)

Observations together with the estimated regression line and a sequence of confidence intervals for \(E[Y|\boldsymbol{X} = \boldsymbol{x}]\) are visualized as a band – confidence region (band) AROUND the regression line.

plot(yield ~ log2(Mg), pch = 23, col = "orange", bg = "beige", 
    xlab = "Log2(Mg)", ylab = "Yield", data = Dris)
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)

### Where is the band narrowest? 
lines(rep(mean(log2(Dris$Mg)), 2), c(min(Dris$yield), max(Dris$yield)))
lines(c(min(log2(Dris$Mg)), max(log2(Dris$Mg))), rep(mean(Dris$yield), 2))

Everything can be also plotted with respect to the original scale of the magnesium concentration:

plot(yield ~ Mg, pch = 23, col = "orange", bg = "beige", xlab = "Mg", 
    ylab = "Yield", data = Dris)
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)

### What about now?
lines(rep(mean(Dris$Mg), 2), c(min(Dris$yield), max(Dris$yield)))
lines(c(min(Dris$Mg), max(Dris$Mg)), rep(mean(Dris$yield), 2))

2. Confidence region (band) FOR the (whole) regression line simultaneously

The second option provides a confidence band for the whole regression line simultaneously. Note, that there is diffferent interpretation of the point-wise band constructed above (where each interval can be interpreted only individually, for some given concentration of magnesium – generally, for some given value of the explanatory vector \(\boldsymbol{X} = \boldsymbol{x}\)).

Some theoretical details towards the band FOR the whole regression line simultaneously were given during the lecture, nevertheless, the code provided below can be considered as purely illustrative. It is only important to understand the difference between point-wise band “AROUND” the regression line constructed in 1). and the simultaneous band “FOR” the whole regression line at once.

ciEY2.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), se.fit = TRUE)
forEY.aLogMg <- data.frame(fit = ciEY2.aLogMg[["fit"]],
         lwr = ciEY2.aLogMg[["fit"]] - 
                  ciEY2.aLogMg[["se.fit"]] * 
                    sqrt(qf(0.95, 2, ciEY2.aLogMg[["df"]])*2),
         upr = ciEY2.aLogMg[["fit"]] + 
                  ciEY2.aLogMg[["se.fit"]] * 
                    sqrt(qf(0.95, 2, ciEY2.aLogMg[["df"]])*2))
head(forEY.aLogMg)

##        fit      lwr      upr
## 1 4.181597 3.772108 4.591086
## 2 4.212914 3.820056 4.605772
## 3 4.243538 3.866851 4.620226
## 4 4.273501 3.912532 4.634470
## 5 4.302829 3.957135 4.648524
## 6 4.331550 4.000692 4.662407

The computation of the confidence band is based on the confidence ellipse for the regression parameters (beta_0, beta_1)

confidenceEllipse(aLogMg)

It can be visualized on the logarithm scale

plot(yield ~ log2(Mg), pch = 23, col = "orange", bg = "beige", xlab = "Log(Mg)",
    ylab = "Yield", data = Dris)
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "darkgreen")
### Band AROUND
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
### Band FOR
lines(log2(x.Mg), forEY.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(log2(x.Mg), forEY.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend(4, 8.5, legend = c("For", "Around"), col = c("blue2", "red3"), 
      lty = c(4, 2), lwd = 2)

or the original scale of the logarithm concentration:

plot(yield ~ Mg, pch = 23, col = "orange", bg = "beige", xlab = "Mg", 
    ylab = "Yield", data = Dris)
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "darkgreen")
### Band AROUND
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
### Band FOR
lines(x.Mg, forEY.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(x.Mg, forEY.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend(18, 8.5, legend = c("For", "Around"), col = c("blue2", "red3"), 
      lty = c(4, 2), lwd = 2)

3. Prediction interval (for some new given value of \(\boldsymbol{X}\))

Finally, we will compare both confidence bands constructed above with one more confidence band – the prediction band (which will be again interpreted in a point-wise manner as the first confidence band around the regression line).

The idea is to construct an interval which covers—with probability \(1 - \alpha\) the NEW value of the response \(Y\) obtained for \(\boldsymbol{X}\) being set to \(\boldsymbol{x}\).

Compare the following two commands:

predict(aLogMg, newdata = data.frame(Mg = 10), interval = "prediction")

##        fit      lwr      upr
## 1 4.675846 2.567646 6.784046

predict(aLogMg, newdata = data.frame(Mg = 10), interval = "confidence")

##        fit      lwr      upr
## 1 4.675846 4.536935 4.814756

Manual calculation of the prediction interval can follow the lines below:

s = summary(aLogMg)$sigma
se.fit2 = sqrt(s^2+(se.fit)^2)
as.numeric(crossprod(c(1, log2(10)), coef(aLogMg))) + 
 c(-1, 1) * se.fit2 * qt(0.975, df = aLogMg[["df.residual"]])

## [1] 2.567646 6.784046

### Prediction band - a sequence of prediction intervals for a dense grid of 
### x-values
pred.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), interval = "prediction", 
                      level = 0.95)
head(pred.aLogMg)

##        fit      lwr      upr
## 1 4.181597 2.052618 6.310576
## 2 4.212914 2.085942 6.339886
## 3 4.243538 2.118440 6.368637
## 4 4.273501 2.150150 6.396852
## 5 4.302829 2.181106 6.424553
## 6 4.331550 2.211341 6.451759

Finally, we can visualize the result – using the logarithm scale

plot(yield ~ log2(Mg), pch = 23, col = "orange", bg = "beige", xlab = "Log(Mg)",
    ylab = "Yield", data = Dris)
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
### Band AROUND the regression line
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
### Prediction band
lines(log2(x.Mg), pred.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(log2(x.Mg), pred.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("topleft", legend = c("Prediction", "Around the regression line"), 
      col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)

or using the original scale of the magnesium concentration:

plot(yield ~ Mg, pch = 23, col = "orange", bg = "beige", xlab = "Mg", 
    ylab = "Yield", data = Dris)
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
### Band AROUND the regression line
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
### Prediction band
lines(x.Mg, pred.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(x.Mg, pred.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("bottomright", legend = c("Prediction", "Around the regression line"), 
      col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)

Indidividual work

How can we manually reconstruct the plots above?

Recall the theoretical foundations of all three confidence bands constructed above. What are the main principial differences?
What are particular utilization purposees for each of the confidence bands constructed above? What type inferential tool do they offer?