Download this R markdown as: R, Rmd.
Outline of this lab session:
anova, drop1, add1,
vif.library("mffSM") # Cars2004nh
library("colorspace") # rainbow_hcl() colors
library("psych") # pairs.panels()
library("car") # vif()
We will work again with the dataset peat and
Cars2004nh from previous exercises.
peat <- read.csv("http://www.karlin.mff.cuni.cz/~vavraj/nmfm334/data/peat.csv",
header = TRUE, stringsAsFactors = TRUE)
data(Cars2004nh, package = "mffSM")
Given response \(\mathbf{Y}\) and available set of covariates \(\mathbb{Z}\) we consider two different model matrices \(\mathbb{X}^{0}\) and \(\mathbb{X}^1\) of ranks \(r_0\) and \(r_1\). These correspond to two different models: \[ \mathrm{M}_0: \mathbb{Y} \mid \mathbb{Z} \sim \left(\mathbb{X}^0 \boldsymbol{\beta}^0, \, \sigma_0^2 \mathbb{I}_n\right), \qquad \mathrm{M}_1: \mathbb{Y} \mid \mathbb{Z} \sim \left(\mathbb{X}^1 \boldsymbol{\beta}^1, \, \sigma_1^2 \mathbb{I}_n\right) \] Model \(\mathrm{M}_0\) is considered to be a submodel of model \(\mathrm{M}_1\) if the linear subspace of \(\mathbb{R}^n\) generated by the columns of \(\mathbb{X}^0\) (\(\mathcal{M}(\mathbb{X}^0)\)) is a subspace of the space generated by the columns of \(\mathbb{X}^1\). In particular, \(\mathcal{M}(\mathbb{X}^0) \subset \mathcal{M}(\mathbb{X}^1)\).
Let us label the quantities computed from models \(\mathrm{M}_0\) and \(\mathrm{M}_1\) with a subscript or superscript \(0\) or \(1\), respectively. If submodel \(\mathrm{M}_0\) holds, then
If additionally, a normal linear model is assumed, \(\mathbb{Y} \mid \mathbb{Z} \sim \mathsf{N}\left(\mathbb{X}^0 \boldsymbol{\beta}^0, \, \sigma_0^2 \mathbb{I}_n\right)\), then
\[ F = \frac{\frac{\mathsf{RSS}_0 - \mathsf{RSS}_1}{r_1 - r_0}}{\frac{\mathsf{RSS}_1}{n - r_1}} \sim \mathsf{F}_{r_1 - r_0, n - r_1}. \]
This result can used for testing the following statistical hypotheses: \[\begin{align*} \mathrm{H}_0:& \text{ model } \mathrm{M}_0 \text{ holds, i.e. } \mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^0), \\ \mathrm{H}_1:& \text{ model } \mathrm{M}_1 \text{ holds, i.e. } \mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^1) \setminus \mathcal{M}(\mathbb{X}^0). \end{align*}\] This aims at answering the following questions:
\(F\)-test on submodel on significance level \(\alpha\):
Consider the peat data from previous exercises.

We will compute the \(F\)-test for comparison of the additive (parallel lines) vs. interaction (group-specific lines) model. Clearly, additive model is a submodel of the interaction model.
fit_add <- lm(N ~ depth + group, data = peat)
fit_int <- lm(N ~ depth * group, data = peat)
(RSS0 <- sum(residuals(fit_add)^2))
## [1] 5.567
(RSS1 <- sum(residuals(fit_int)^2))
## [1] 2.568
(D2 <- RSS0 - RSS1)
## [1] 2.999
(df_dif <- fit_add$df.residual - fit_int$df.residual)
## [1] 3
(Fstat <- D2/df_dif / (RSS1/fit_int$df.residual))
## [1] 54.48
(Fcritical <- qf(0.95, df1 = df_dif, df2 = fit_int$df.residual))
## [1] 2.669
(rejectH0 <- (Fstat > Fcritical))
## [1] TRUE
(pval <- pf(Fstat, df1 = df_dif, df2 = fit_int$df.residual, lower.tail = FALSE))
## [1] 2.129e-23
Beyond any doubt we proved that the interaction model significantly
improves the additive model. The submodel \(F\)-test is already implemented by
anova function.
anova(fit_add, fit_int)
## Analysis of Variance Table
##
## Model 1: N ~ depth + group
## Model 2: N ~ depth * group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 143 5.57
## 2 140 2.57 3 3 54.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This function also excepts also a single model and provides intermediate results for model construction. In this case it sequentially compares the following models:
N ~ 1 (intercept-only model),N ~ depth (adds one coefficient),N ~ depth + group (adds three coefficients),N ~ depth + group + depth:group (adds three
coefficients).anova(fit_int)
## Analysis of Variance Table
##
## Response: N
## Df Sum Sq Mean Sq F value Pr(>F)
## depth 1 0.63 0.626 34.1 3.4e-08 ***
## group 3 3.34 1.113 60.7 < 2e-16 ***
## depth:group 3 3.00 1.000 54.5 < 2e-16 ***
## Residuals 140 2.57 0.018
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that the tests provided by anova this way depend
on the order of the terms.
Also note that when compared to the intercept-only model we get the
\(F\)-test provided in the basic
summary of the model:
summary(fit_int)
##
## Call:
## lm(formula = N ~ depth * group, data = peat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3683 -0.0847 -0.0079 0.0611 0.3771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.84696 0.05119 16.54 < 2e-16 ***
## depth 0.00505 0.00572 0.88 0.38
## groupCB-VJJ 0.32211 0.06442 5.00 1.7e-06 ***
## groupVJJ -0.06404 0.07537 -0.85 0.40
## groupVJJ-CB 0.07591 0.06442 1.18 0.24
## depth:groupCB-VJJ -0.03260 0.00739 -4.41 2.0e-05 ***
## depth:groupVJJ 0.04394 0.00831 5.29 4.7e-07 ***
## depth:groupVJJ-CB 0.04159 0.00739 5.63 9.6e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.135 on 140 degrees of freedom
## Multiple R-squared: 0.731, Adjusted R-squared: 0.717
## F-statistic: 54.2 on 7 and 140 DF, p-value: <2e-16
anova(lm(N ~ 1, data = peat), fit_int)
## Analysis of Variance Table
##
## Model 1: N ~ 1
## Model 2: N ~ depth * group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 147 9.53
## 2 140 2.57 7 6.96 54.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the rest of the exercise we will work with Cars2004nh
dataset from mffSM package. It contains information about
$n=$425 new cars of the year 2004. We will work with only a subset of
variables suitable for predicting the consumption of the vehicle:
y <- "consumption"
x <- c("ftype", "fdrive", "engine.size", "ncylinder", "horsepower", "weight", "length", "width", "wheel.base")
head(Cars2004nh[, c(y, x)])
## consumption ftype fdrive engine.size ncylinder horsepower weight length width wheel.base
## 1 7.65 personal front 1.6 4 103 1075 424 168 249
## 2 7.65 personal front 1.6 4 103 1065 389 168 249
## 3 7.70 personal front 2.2 4 140 1187 465 175 264
## 4 7.70 personal front 2.2 4 140 1214 465 173 264
## 5 7.70 personal front 2.2 4 140 1187 465 175 264
## 6 7.30 personal front 2.0 4 132 1171 442 170 267
summary(Cars2004nh[, c(y, x)])
## consumption ftype fdrive engine.size ncylinder horsepower weight
## Min. : 5.65 personal:242 front:223 Min. :1.30 Min. :-1.00 Min. :100 Min. : 923
## 1st Qu.: 9.65 wagon : 30 rear :110 1st Qu.:2.40 1st Qu.: 4.00 1st Qu.:165 1st Qu.:1412
## Median :10.70 SUV : 60 4x4 : 92 Median :3.00 Median : 6.00 Median :210 Median :1577
## Mean :10.75 pickup : 24 Mean :3.21 Mean : 5.79 Mean :217 Mean :1626
## 3rd Qu.:11.65 sport : 49 3rd Qu.:3.90 3rd Qu.: 6.00 3rd Qu.:255 3rd Qu.:1804
## Max. :21.55 minivan : 20 Max. :8.30 Max. :12.00 Max. :500 Max. :3261
## NA's :14 NA's :2
## length width wheel.base
## Min. :363 Min. :163 Min. :226
## 1st Qu.:450 1st Qu.:175 1st Qu.:262
## Median :472 Median :180 Median :272
## Mean :471 Mean :181 Mean :275
## 3rd Qu.:490 3rd Qu.:185 3rd Qu.:284
## Max. :577 Max. :206 Max. :366
## NA's :26 NA's :28 NA's :2
# psych::pairs.panels(Cars2004nh[, c(y,"weight", "iweight", "lweight")])
# keep only
data <- na.omit(Cars2004nh[, c(y, x)])
# exploration
table(data$ncylinder)
##
## 4 5 6 8 12
## 121 7 179 75 2
data$fcyl <- cut(data$ncylinder, breaks = c(4, 6, 8, 12), right = FALSE, include.lowest = TRUE)
summary(data$fcyl)
## [4,6) [6,8) [8,12]
## 128 179 77
data$ftype <- factor(data$ftype, levels = setdiff(levels(data$ftype), "pickup")) # pickups not represented in data
# consumption vs categorical variables
xcat <- c("ftype", "fdrive", "fcyl")
par(mfrow = c(1, length(xcat)), mar = c(4,4,0.5,0.5))
for(x in xcat){
plot(as.formula(paste("consumption ~ ", x)), data = data)
}

# consumption vs numeric variables
xnum <- c("engine.size", "horsepower", "weight", "length", "width", "wheel.base")
psych::pairs.panels(data[, c(y,xnum)])
The ultimate goal is to find as simple as possible (for interpretation)
model for explaining consumption of a vehicle given other
characteristics. First, we fit an additive model with 3 categorical and
6 numerical regressors.
fit_add <- lm(consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + weight + length + width + wheel.base, data = data)
summary(fit_add)
##
## Call:
## lm(formula = consumption ~ ftype + fdrive + fcyl + engine.size +
## horsepower + weight + length + width + wheel.base, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.029 -0.465 -0.035 0.433 3.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.860512 1.326066 7.44 7.3e-13 ***
## ftypewagon 0.004821 0.145516 0.03 0.9736
## ftypeSUV 0.844818 0.164007 5.15 4.2e-07 ***
## ftypesport 0.112602 0.175649 0.64 0.5219
## ftypeminivan 0.200317 0.214683 0.93 0.3514
## fdriverear 0.177462 0.117085 1.52 0.1305
## fdrive4x4 0.239530 0.118021 2.03 0.0431 *
## fcyl[6,8) 0.281606 0.141894 1.98 0.0479 *
## fcyl[8,12] -0.004098 0.274750 -0.01 0.9881
## engine.size 0.470259 0.117939 3.99 8.1e-05 ***
## horsepower 0.004388 0.001038 4.23 3.0e-05 ***
## weight 0.004572 0.000348 13.12 < 2e-16 ***
## length -0.005002 0.002774 -1.80 0.0722 .
## width -0.011962 0.009478 -1.26 0.2077
## wheel.base -0.017872 0.005462 -3.27 0.0012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.716 on 369 degrees of freedom
## Multiple R-squared: 0.884, Adjusted R-squared: 0.88
## F-statistic: 202 on 14 and 369 DF, p-value: <2e-16
The model could be sequentially built by adding one variable at a
time, which can be monitored by anova:
anova(fit_add)
## Analysis of Variance Table
##
## Response: consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## ftype 4 605 151.3 295.16 < 2e-16 ***
## fdrive 2 225 112.7 219.89 < 2e-16 ***
## fcyl 2 411 205.6 401.11 < 2e-16 ***
## engine.size 1 72 72.1 140.70 < 2e-16 ***
## horsepower 1 42 41.7 81.28 < 2e-16 ***
## weight 1 68 67.9 132.50 < 2e-16 ***
## length 1 18 18.1 35.23 6.8e-09 ***
## width 1 1 1.0 2.03 0.1554
## wheel.base 1 5 5.5 10.71 0.0012 **
## Residuals 369 189 0.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, as mentioned above the individual tests are heavily dependent on the order of the variables despite reaching the same model in the end:
anova(lm(consumption ~ width + wheel.base + length + ftype + engine.size + fcyl + horsepower + weight + fdrive, data = data))
## Analysis of Variance Table
##
## Response: consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## width 1 702 702 1369.70 < 2e-16 ***
## wheel.base 1 2 2 2.95 0.08682 .
## length 1 48 48 92.88 < 2e-16 ***
## ftype 4 273 68 133.02 < 2e-16 ***
## engine.size 1 248 248 483.75 < 2e-16 ***
## fcyl 2 9 4 8.61 0.00022 ***
## horsepower 1 51 51 98.58 < 2e-16 ***
## weight 1 114 114 222.76 < 2e-16 ***
## fdrive 2 3 1 2.58 0.07734 .
## Residuals 369 189 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The submodel test compares two models where both or just the larger ones need to hold. So when an important variable is missing the conclusions may be invalid.
Hence, we need a different kind of test to determine the significance
of a variable. The function drop1 compares the original
large model with many different submodels, which are created by
dropping one term without breaking the hierarchy:
drop1(fit_add, test = "F")
## Single term deletions
##
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 189 -241.9
## ftype 4 15.1 204 -220.4 7.36 1.0e-05 ***
## fdrive 2 2.6 192 -240.5 2.58 0.0773 .
## fcyl 2 7.1 196 -231.7 6.94 0.0011 **
## engine.size 1 8.2 197 -227.7 15.90 8.1e-05 ***
## horsepower 1 9.2 198 -225.7 17.89 3.0e-05 ***
## weight 1 88.3 278 -96.8 172.21 < 2e-16 ***
## length 1 1.7 191 -240.5 3.25 0.0722 .
## width 1 0.8 190 -242.2 1.59 0.2077
## wheel.base 1 5.5 195 -232.9 10.71 0.0012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we know which variables are significant while adjusting for the
effect of other remaining variables. The insignificant ones seem to be
redundant, thus, we can remove them from the model by
update function. Now all remaining variables are
significant:
fit1 <- update(fit_add, .~. - fdrive - length - width)
anova(fit1, fit_add) # fit_add is
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight +
## wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 373 198
## 2 369 189 4 8.86 4.32 0.002 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(fit1, test = "F")
## Single term deletions
##
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight +
## wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 198 -232.3
## ftype 4 21.8 220 -200.1 10.29 6.5e-08 ***
## fcyl 2 6.9 205 -223.1 6.54 0.0016 **
## engine.size 1 4.8 203 -225.0 9.13 0.0027 **
## horsepower 1 13.7 212 -208.7 25.76 6.1e-07 ***
## weight 1 114.7 313 -58.9 215.98 < 2e-16 ***
## wheel.base 1 26.3 224 -186.4 49.58 9.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, by removing multiple variables at once we might have
eliminated too many variables. It can happen that one of the eliminated
variables may still significantly improve a model. As we can see in
anova output above fit_add is a significant
improvement of fit1 despite all eliminated variables being
insignificant in fit_add. Similarly as we can check
dropping one variable, we can also inspect, how model improves upon
adding one more term with add1 function.
The scope parameter should be a vector of terms that we
wish to add to the current model. Or alternatively set
scope as a larger model, where only terms that do not
appear in fit1 will be added - one at a time:
# add1(fit1, scope = c("fdrive", "length", "width"), test = "F")
add1(fit1, scope = fit_add, test = "F")
## Single term additions
##
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight +
## wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 198 -232
## fdrive 2 4.82 193 -238 4.63 0.0104 *
## length 1 5.34 193 -241 10.31 0.0014 **
## width 1 3.22 195 -237 6.14 0.0136 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems like adding length back could improve the
model:
fit2 <- update(fit1, .~. + length)
anova(fit2, fit_add) # fit2 comparable to fit_add
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight +
## wheel.base + length
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 372 193
## 2 369 189 3 3.51 2.29 0.078 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(fit2, test = "F") # all variables are significant
## Single term deletions
##
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight +
## wheel.base + length
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 193 -240.8
## ftype 4 14.7 207 -220.6 7.08 1.7e-05 ***
## fcyl 2 7.5 200 -230.2 7.21 0.00085 ***
## engine.size 1 7.4 200 -228.3 14.27 0.00018 ***
## horsepower 1 10.6 203 -222.3 20.41 8.4e-06 ***
## weight 1 118.0 311 -59.3 227.84 < 2e-16 ***
## wheel.base 1 6.1 199 -230.8 11.85 0.00064 ***
## length 1 5.3 198 -232.3 10.31 0.00144 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(fit2, scope = fit_add, test = "F") # none other can improve the model
## Single term additions
##
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight +
## wheel.base + length
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 193 -241
## fdrive 2 2.698 190 -242 2.63 0.074 .
## width 1 0.872 192 -240 1.69 0.195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After adding just length we end with model
fit2 that
drop1 output),fdrive and width) do not
improve the model (\(p > 0.05\) in
add1 output).So far we have neglected an option for the effects of variables to be modified by other variables: different lines for different groups, etc. We will start with a model with all possible variables and eliminate the insignificant terms on our way to final model. Here, we will limit ourselves to two-way interactions that are usually sufficient for modelling:
fit_int <- lm(consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + weight + length + width + wheel.base)^2, data = data)
anova(fit_add, fit_int) # two-way interaction improve additive model
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 369 189
## 2 288 113 81 76.2 2.4 5.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(d1fit_int <- drop1(fit_int, test = "F")) # however, majority seem irrelevant in the full model
## Single term deletions
##
## Model:
## consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base)^2
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 113 -278
## ftype:fdrive 7 0.77 114 -289 0.28 0.961
## ftype:fcyl 7 1.92 115 -285 0.70 0.673
## ftype:engine.size 4 3.22 116 -275 2.05 0.087 .
## ftype:horsepower 4 1.15 114 -282 0.73 0.570
## ftype:weight 4 1.32 114 -281 0.84 0.498
## ftype:length 4 0.41 113 -284 0.26 0.903
## ftype:width 4 3.77 117 -273 2.40 0.050 .
## ftype:wheel.base 4 1.60 115 -280 1.02 0.399
## fdrive:fcyl 4 1.92 115 -279 1.22 0.302
## fdrive:engine.size 2 0.36 113 -280 0.45 0.636
## fdrive:horsepower 2 0.34 113 -280 0.43 0.651
## fdrive:weight 2 0.09 113 -281 0.12 0.890
## fdrive:length 2 1.20 114 -278 1.53 0.219
## fdrive:width 2 1.05 114 -278 1.34 0.263
## fdrive:wheel.base 2 1.81 115 -276 2.30 0.102
## fcyl:engine.size 2 1.89 115 -275 2.40 0.092 .
## fcyl:horsepower 2 2.27 115 -274 2.90 0.057 .
## fcyl:weight 2 0.78 114 -279 0.99 0.374
## fcyl:length 2 0.90 114 -279 1.15 0.318
## fcyl:width 2 2.09 115 -275 2.66 0.071 .
## fcyl:wheel.base 2 0.28 113 -281 0.35 0.703
## engine.size:horsepower 1 0.62 114 -278 1.57 0.211
## engine.size:weight 1 1.71 115 -274 4.37 0.038 *
## engine.size:length 1 0.04 113 -280 0.11 0.736
## engine.size:width 1 0.00 113 -280 0.01 0.922
## engine.size:wheel.base 1 0.08 113 -279 0.21 0.647
## horsepower:weight 1 0.06 113 -280 0.15 0.699
## horsepower:length 1 0.01 113 -280 0.03 0.858
## horsepower:width 1 0.69 114 -277 1.75 0.186
## horsepower:wheel.base 1 0.65 114 -278 1.66 0.198
## weight:length 1 1.13 114 -276 2.88 0.091 .
## weight:width 1 0.06 113 -280 0.16 0.690
## weight:wheel.base 1 0.00 113 -280 0.00 0.956
## length:width 1 0.00 113 -280 0.01 0.915
## length:wheel.base 1 0.14 113 -279 0.35 0.552
## width:wheel.base 1 0.66 114 -277 1.69 0.194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(a1fit_int <- add1(fit_add, scope = fit_int, test = "F")) # a dozen improves the additive model
## Single term additions
##
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 189 -242
## ftype:fdrive 7 3.72 185 -236 1.04 0.40420
## ftype:fcyl 7 12.10 177 -253 3.53 0.00110 **
## ftype:engine.size 4 11.56 178 -258 5.94 0.00012 ***
## ftype:horsepower 4 3.33 186 -241 1.63 0.16501
## ftype:weight 4 5.42 184 -245 2.69 0.03088 *
## ftype:length 4 4.08 185 -242 2.01 0.09235 .
## ftype:width 4 3.04 186 -240 1.49 0.20408
## ftype:wheel.base 4 2.39 187 -239 1.17 0.32368
## fdrive:fcyl 4 5.89 183 -246 2.93 0.02085 *
## fdrive:engine.size 2 3.78 185 -246 3.74 0.02468 *
## fdrive:horsepower 2 1.53 188 -241 1.50 0.22539
## fdrive:weight 2 3.19 186 -244 3.14 0.04432 *
## fdrive:length 2 2.77 186 -244 2.73 0.06649 .
## fdrive:width 2 1.97 187 -242 1.93 0.14618
## fdrive:wheel.base 2 1.55 188 -241 1.51 0.22162
## fcyl:engine.size 2 5.12 184 -248 5.10 0.00653 **
## fcyl:horsepower 2 7.04 182 -252 7.09 0.00095 ***
## fcyl:weight 2 6.44 183 -251 6.46 0.00174 **
## fcyl:length 2 0.79 188 -240 0.77 0.46398
## fcyl:width 2 5.62 184 -250 5.62 0.00393 **
## fcyl:wheel.base 2 0.21 189 -238 0.21 0.81308
## engine.size:horsepower 1 0.62 189 -241 1.21 0.27261
## engine.size:weight 1 3.45 186 -247 6.85 0.00925 **
## engine.size:length 1 0.32 189 -240 0.63 0.42915
## engine.size:width 1 1.33 188 -243 2.60 0.10760
## engine.size:wheel.base 1 0.56 189 -241 1.10 0.29555
## horsepower:weight 1 0.02 189 -240 0.03 0.85696
## horsepower:length 1 2.70 186 -245 5.33 0.02149 *
## horsepower:width 1 0.76 188 -241 1.48 0.22458
## horsepower:wheel.base 1 0.02 189 -240 0.03 0.85637
## weight:length 1 0.59 189 -241 1.15 0.28358
## weight:width 1 1.12 188 -242 2.18 0.14030
## weight:wheel.base 1 0.31 189 -240 0.60 0.43840
## length:width 1 0.03 189 -240 0.06 0.80406
## length:wheel.base 1 0.07 189 -240 0.13 0.71576
## width:wheel.base 1 0.55 189 -241 1.08 0.29992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We start slowly by eliminating those interaction terms that have
\(p\)-value in drop1 lower
than \(0.5\), which is not a mistake,
we do not want to eliminate too much variables at the same time.
(insignif_d1 <- rownames(d1fit_int)[-1][d1fit_int$`Pr(>F)`[-1] > 0.5])
## [1] "ftype:fdrive" "ftype:fcyl" "ftype:horsepower" "ftype:length"
## [5] "fdrive:engine.size" "fdrive:horsepower" "fdrive:weight" "fcyl:wheel.base"
## [9] "engine.size:length" "engine.size:width" "engine.size:wheel.base" "horsepower:weight"
## [13] "horsepower:length" "weight:width" "weight:wheel.base" "length:width"
## [17] "length:wheel.base"
fit3 <- update(fit_int, as.formula(paste(".~. -", paste(insignif_d1, collapse = " - "))))
(d1fit3 <- drop1(fit3, test = "F")) # many remaining are significant
## Single term deletions
##
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl +
## fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size +
## fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width +
## engine.size:horsepower + engine.size:weight + horsepower:width +
## horsepower:wheel.base + weight:length + width:wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 125 -318
## ftype:engine.size 4 7.08 132 -305 4.64 0.00117 **
## ftype:weight 4 4.59 129 -312 3.01 0.01846 *
## ftype:width 4 9.08 134 -299 5.95 0.00012 ***
## ftype:wheel.base 4 5.42 130 -310 3.55 0.00746 **
## fdrive:fcyl 4 2.22 127 -319 1.45 0.21624
## fdrive:length 2 4.11 129 -309 5.39 0.00500 **
## fdrive:width 2 0.58 125 -320 0.76 0.46738
## fdrive:wheel.base 2 5.75 130 -305 7.54 0.00063 ***
## fcyl:engine.size 2 3.25 128 -312 4.26 0.01491 *
## fcyl:horsepower 2 2.19 127 -315 2.87 0.05822 .
## fcyl:weight 2 2.41 127 -315 3.15 0.04399 *
## fcyl:length 2 4.23 129 -309 5.55 0.00426 **
## fcyl:width 2 7.48 132 -300 9.81 7.3e-05 ***
## engine.size:horsepower 1 2.79 128 -311 7.32 0.00718 **
## engine.size:weight 1 3.12 128 -310 8.18 0.00452 **
## horsepower:width 1 4.04 129 -308 10.59 0.00125 **
## horsepower:wheel.base 1 0.03 125 -320 0.08 0.77590
## weight:length 1 4.34 129 -307 11.39 0.00083 ***
## width:wheel.base 1 5.52 130 -303 14.49 0.00017 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(fit3, scope = fit_int, test = "F") # none should be added back
## Single term additions
##
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl +
## fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size +
## fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width +
## engine.size:horsepower + engine.size:weight + horsepower:width +
## horsepower:wheel.base + weight:length + width:wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 125 -318
## ftype:fdrive 7 2.44 122 -312 0.91 0.497
## ftype:fcyl 7 2.56 122 -312 0.96 0.461
## ftype:horsepower 4 3.20 122 -320 2.13 0.077 .
## ftype:length 4 0.63 124 -312 0.41 0.803
## fdrive:engine.size 2 1.08 124 -317 1.42 0.244
## fdrive:horsepower 2 2.20 122 -321 2.92 0.056 .
## fdrive:weight 2 0.11 125 -314 0.14 0.867
## fcyl:wheel.base 2 0.97 124 -317 1.27 0.283
## engine.size:length 1 0.09 125 -316 0.25 0.619
## engine.size:width 1 0.07 125 -316 0.18 0.669
## engine.size:wheel.base 1 0.36 124 -317 0.94 0.332
## horsepower:weight 1 0.69 124 -318 1.82 0.178
## horsepower:length 1 0.02 125 -316 0.06 0.810
## weight:width 1 0.12 125 -316 0.31 0.579
## weight:wheel.base 1 0.00 125 -316 0.00 0.984
## length:width 1 0.04 125 -316 0.10 0.754
## length:wheel.base 1 0.73 124 -318 1.92 0.167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit3, fit_int) # comparable to all two-way interaction model
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl +
## fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size +
## fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width +
## engine.size:horsepower + engine.size:weight + horsepower:width +
## horsepower:wheel.base + weight:length + width:wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 327 125
## 2 288 113 39 11.7 0.76 0.85
anova(fit_add, fit3) # significantly improves additive model
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl +
## fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size +
## fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width +
## engine.size:horsepower + engine.size:weight + horsepower:width +
## horsepower:wheel.base + weight:length + width:wheel.base
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 369 189
## 2 327 125 42 64.5 4.03 3.9e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we eliminate at \(0.05\) level
for \(p\)-value in
drop1:
(insignif_d1 <- rownames(d1fit3)[-1][d1fit3$`Pr(>F)`[-1] > 0.05])
## [1] "fdrive:fcyl" "fdrive:width" "fcyl:horsepower" "horsepower:wheel.base"
fit4 <- update(fit3, as.formula(paste(".~. -", paste(insignif_d1, collapse = " - "))))
drop1(fit4, test = "F") # all remaining are significant
## Single term deletions
##
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:length +
## fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length +
## fcyl:width + engine.size:horsepower + engine.size:weight +
## horsepower:width + weight:length + width:wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 129 -322
## ftype:engine.size 4 8.72 138 -305 5.67 0.00020 ***
## ftype:weight 4 4.95 134 -316 3.22 0.01300 *
## ftype:width 4 10.69 140 -300 6.95 2.2e-05 ***
## ftype:wheel.base 4 5.48 135 -314 3.56 0.00734 **
## fdrive:length 2 4.07 133 -314 5.29 0.00549 **
## fdrive:wheel.base 2 5.32 135 -311 6.91 0.00114 **
## fcyl:engine.size 2 2.65 132 -318 3.44 0.03325 *
## fcyl:weight 2 3.58 133 -316 4.66 0.01011 *
## fcyl:length 2 4.92 134 -312 6.39 0.00188 **
## fcyl:width 2 8.76 138 -301 11.39 1.6e-05 ***
## engine.size:horsepower 1 1.55 131 -320 4.02 0.04576 *
## engine.size:weight 1 3.49 133 -314 9.07 0.00280 **
## horsepower:width 1 6.56 136 -305 17.06 4.6e-05 ***
## weight:length 1 3.43 133 -314 8.92 0.00303 **
## width:wheel.base 1 5.74 135 -307 14.92 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(fit4, scope = fit_int, test = "F") # none interaction should be added back
## Single term additions
##
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:length +
## fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length +
## fcyl:width + engine.size:horsepower + engine.size:weight +
## horsepower:width + weight:length + width:wheel.base
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 129 -322
## ftype:fdrive 7 1.39 128 -312 0.51 0.83
## ftype:fcyl 7 3.26 126 -318 1.22 0.29
## ftype:horsepower 4 1.94 127 -320 1.26 0.28
## ftype:length 4 0.78 128 -316 0.50 0.73
## fdrive:fcyl 4 1.91 127 -320 1.24 0.29
## fdrive:engine.size 2 0.16 129 -319 0.21 0.81
## fdrive:horsepower 2 0.33 129 -319 0.42 0.66
## fdrive:weight 2 0.17 129 -319 0.23 0.80
## fdrive:width 2 0.55 129 -320 0.72 0.49
## fcyl:horsepower 2 1.62 128 -323 2.12 0.12
## fcyl:wheel.base 2 1.19 128 -322 1.55 0.21
## engine.size:length 1 0.00 129 -320 0.00 1.00
## engine.size:width 1 0.00 129 -320 0.01 0.94
## engine.size:wheel.base 1 0.33 129 -321 0.86 0.35
## horsepower:weight 1 0.27 129 -321 0.71 0.40
## horsepower:length 1 0.07 129 -320 0.19 0.67
## horsepower:wheel.base 1 0.00 129 -320 0.00 0.95
## weight:width 1 0.11 129 -320 0.27 0.60
## weight:wheel.base 1 0.02 129 -320 0.05 0.83
## length:width 1 0.06 129 -320 0.14 0.71
## length:wheel.base 1 0.96 128 -323 2.52 0.11
anova(fit4, fit_int) # comparable to all two-way interaction model, spares 48 coefs
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:length +
## fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length +
## fcyl:width + engine.size:horsepower + engine.size:weight +
## horsepower:width + weight:length + width:wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 336 129
## 2 288 113 48 16.2 0.86 0.73
anova(fit_add, fit4) # significantly improves additive model
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:length +
## fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length +
## fcyl:width + engine.size:horsepower + engine.size:weight +
## horsepower:width + weight:length + width:wheel.base
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 369 189
## 2 336 129 33 59.9 4.72 4.6e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit4, fit3) # comparable to fit3, spares 9 coefs
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:length +
## fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length +
## fcyl:width + engine.size:horsepower + engine.size:weight +
## horsepower:width + weight:length + width:wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl +
## fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size +
## fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width +
## engine.size:horsepower + engine.size:weight + horsepower:width +
## horsepower:wheel.base + weight:length + width:wheel.base
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 336 129
## 2 327 125 9 4.56 1.33 0.22
The final model seems to be fit4 where every variable
appears in some interaction, even width and
fdrive which were excluded in model fit2 when
no interactions were considered:
anova(fit2, fit4)
## Analysis of Variance Table
##
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight +
## wheel.base + length
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower +
## weight + length + width + wheel.base + ftype:engine.size +
## ftype:weight + ftype:width + ftype:wheel.base + fdrive:length +
## fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length +
## fcyl:width + engine.size:horsepower + engine.size:weight +
## horsepower:width + weight:length + width:wheel.base
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 372 193
## 2 336 129 36 63.4 4.58 2.7e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
General exceptions from eliminating variables:
Elimination despite significance:
Our model was fitted on many highly correlated variables, see the exploration above. We compute the so called variance inflation factor (VIF) that reveals which variables may cause inflation in the estimation of variance of coefficients. It compares the current estimates of variance with the variability of estimates under orthogonal (uncorrelated) regressors.
car::vif(fit4, type = "predictor")
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## ftype 1.233e+11 26 1.634 engine.size, weight, width, wheel.base
## fdrive 1.058e+07 8 2.748 length, wheel.base
## fcyl 1.809e+09 16 1.947 engine.size, weight, length, width
## engine.size 6.213e+27 23 4.020 ftype, fcyl, horsepower, weight
## horsepower 1.680e+07 5 5.279 engine.size, width
## weight 6.364e+24 25 3.134 ftype, fcyl, engine.size, length
## length 1.789e+18 13 5.035 fdrive, fcyl, weight
## width 1.709e+20 21 3.032 ftype, fcyl, horsepower, wheel.base
## wheel.base 4.004e+19 19 3.280 ftype, fdrive, width
## Other Predictors
## ftype fdrive, fcyl, horsepower, length
## fdrive ftype, fcyl, engine.size, horsepower, weight, width
## fcyl ftype, fdrive, horsepower, wheel.base
## engine.size fdrive, length, width, wheel.base
## horsepower ftype, fdrive, fcyl, weight, length, wheel.base
## weight fdrive, horsepower, width, wheel.base
## length ftype, engine.size, horsepower, width, wheel.base
## width fdrive, engine.size, weight, length
## wheel.base fcyl, engine.size, horsepower, weight, length