Download this R markdown as: R, Rmd.

Outline of this lab session:

Loading the data and libraries

library("mffSM")         # Cars2004nh
library("colorspace")    # rainbow_hcl() colors
library("psych")         # pairs.panels()
library("car")           # vif()  

We will work again with the dataset peat and Cars2004nh from previous exercises.

peat <- read.csv("http://www.karlin.mff.cuni.cz/~vavraj/nmfm334/data/peat.csv", 
                 header = TRUE, stringsAsFactors = TRUE)
data(Cars2004nh, package = "mffSM")

Theory

Given response \(\mathbf{Y}\) and available set of covariates \(\mathbb{Z}\) we consider two different model matrices \(\mathbb{X}^{0}\) and \(\mathbb{X}^1\) of ranks \(r_0\) and \(r_1\). These correspond to two different models: \[ \mathrm{M}_0: \mathbb{Y} \mid \mathbb{Z} \sim \left(\mathbb{X}^0 \boldsymbol{\beta}^0, \, \sigma_0^2 \mathbb{I}_n\right), \qquad \mathrm{M}_1: \mathbb{Y} \mid \mathbb{Z} \sim \left(\mathbb{X}^1 \boldsymbol{\beta}^1, \, \sigma_1^2 \mathbb{I}_n\right) \] Model \(\mathrm{M}_0\) is considered to be a submodel of model \(\mathrm{M}_1\) if the linear subspace of \(\mathbb{R}^n\) generated by the columns of \(\mathbb{X}^0\) (\(\mathcal{M}(\mathbb{X}^0)\)) is a subspace of the space generated by the columns of \(\mathbb{X}^1\). In particular, \(\mathcal{M}(\mathbb{X}^0) \subset \mathcal{M}(\mathbb{X}^1)\).

Let us label the quantities computed from models \(\mathrm{M}_0\) and \(\mathrm{M}_1\) with a subscript or superscript \(0\) or \(1\), respectively. If submodel \(\mathrm{M}_0\) holds, then

If additionally, a normal linear model is assumed, \(\mathbb{Y} \mid \mathbb{Z} \sim \mathsf{N}\left(\mathbb{X}^0 \boldsymbol{\beta}^0, \, \sigma_0^2 \mathbb{I}_n\right)\), then

\[ F = \frac{\frac{\mathsf{RSS}_0 - \mathsf{RSS}_1}{r_1 - r_0}}{\frac{\mathsf{RSS}_1}{n - r_1}} \sim \mathsf{F}_{r_1 - r_0, n - r_1}. \]

This result can used for testing the following statistical hypotheses: \[\begin{align*} \mathrm{H}_0:& \text{ model } \mathrm{M}_0 \text{ holds, i.e. } \mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^0), \\ \mathrm{H}_1:& \text{ model } \mathrm{M}_1 \text{ holds, i.e. } \mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^1) \setminus \mathcal{M}(\mathbb{X}^0). \end{align*}\] This aims at answering the following questions:

\(F\)-test on submodel on significance level \(\alpha\):

F-test on a submodel

Consider the peat data from previous exercises.

We will compute the \(F\)-test for comparison of the additive (parallel lines) vs. interaction (group-specific lines) model. Clearly, additive model is a submodel of the interaction model.

fit_add <- lm(N ~ depth + group, data = peat)
fit_int <- lm(N ~ depth * group, data = peat)

(RSS0 <- sum(residuals(fit_add)^2))
## [1] 5.567
(RSS1 <- sum(residuals(fit_int)^2))
## [1] 2.568
(D2 <- RSS0 - RSS1)
## [1] 2.999
(df_dif <- fit_add$df.residual - fit_int$df.residual)
## [1] 3
(Fstat <- D2/df_dif / (RSS1/fit_int$df.residual))
## [1] 54.48
(Fcritical <- qf(0.95, df1 = df_dif, df2 = fit_int$df.residual))
## [1] 2.669
(rejectH0 <- (Fstat > Fcritical))
## [1] TRUE
(pval <- pf(Fstat, df1 = df_dif, df2 = fit_int$df.residual, lower.tail = FALSE))
## [1] 2.129e-23

Beyond any doubt we proved that the interaction model significantly improves the additive model. The submodel \(F\)-test is already implemented by anova function.

anova(fit_add, fit_int)
## Analysis of Variance Table
## 
## Model 1: N ~ depth + group
## Model 2: N ~ depth * group
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)    
## 1    143 5.57                             
## 2    140 2.57  3         3 54.5 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This function also excepts also a single model and provides intermediate results for model construction. In this case it sequentially compares the following models:

anova(fit_int)
## Analysis of Variance Table
## 
## Response: N
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## depth         1   0.63   0.626    34.1 3.4e-08 ***
## group         3   3.34   1.113    60.7 < 2e-16 ***
## depth:group   3   3.00   1.000    54.5 < 2e-16 ***
## Residuals   140   2.57   0.018                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the tests provided by anova this way depend on the order of the terms.

Also note that when compared to the intercept-only model we get the \(F\)-test provided in the basic summary of the model:

summary(fit_int)
## 
## Call:
## lm(formula = N ~ depth * group, data = peat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3683 -0.0847 -0.0079  0.0611  0.3771 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.84696    0.05119   16.54  < 2e-16 ***
## depth              0.00505    0.00572    0.88     0.38    
## groupCB-VJJ        0.32211    0.06442    5.00  1.7e-06 ***
## groupVJJ          -0.06404    0.07537   -0.85     0.40    
## groupVJJ-CB        0.07591    0.06442    1.18     0.24    
## depth:groupCB-VJJ -0.03260    0.00739   -4.41  2.0e-05 ***
## depth:groupVJJ     0.04394    0.00831    5.29  4.7e-07 ***
## depth:groupVJJ-CB  0.04159    0.00739    5.63  9.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.135 on 140 degrees of freedom
## Multiple R-squared:  0.731,  Adjusted R-squared:  0.717 
## F-statistic: 54.2 on 7 and 140 DF,  p-value: <2e-16
anova(lm(N ~ 1, data = peat), fit_int)
## Analysis of Variance Table
## 
## Model 1: N ~ 1
## Model 2: N ~ depth * group
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)    
## 1    147 9.53                             
## 2    140 2.57  7      6.96 54.2 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Building hierarchically well-structured model

In the rest of the exercise we will work with Cars2004nh dataset from mffSM package. It contains information about $n=$425 new cars of the year 2004. We will work with only a subset of variables suitable for predicting the consumption of the vehicle:

y <- "consumption"
x <- c("ftype", "fdrive", "engine.size", "ncylinder", "horsepower", "weight", "length", "width", "wheel.base")
head(Cars2004nh[, c(y, x)])
##   consumption    ftype fdrive engine.size ncylinder horsepower weight length width wheel.base
## 1        7.65 personal  front         1.6         4        103   1075    424   168        249
## 2        7.65 personal  front         1.6         4        103   1065    389   168        249
## 3        7.70 personal  front         2.2         4        140   1187    465   175        264
## 4        7.70 personal  front         2.2         4        140   1214    465   173        264
## 5        7.70 personal  front         2.2         4        140   1187    465   175        264
## 6        7.30 personal  front         2.0         4        132   1171    442   170        267
summary(Cars2004nh[, c(y, x)])
##   consumption         ftype       fdrive     engine.size     ncylinder       horsepower      weight    
##  Min.   : 5.65   personal:242   front:223   Min.   :1.30   Min.   :-1.00   Min.   :100   Min.   : 923  
##  1st Qu.: 9.65   wagon   : 30   rear :110   1st Qu.:2.40   1st Qu.: 4.00   1st Qu.:165   1st Qu.:1412  
##  Median :10.70   SUV     : 60   4x4  : 92   Median :3.00   Median : 6.00   Median :210   Median :1577  
##  Mean   :10.75   pickup  : 24               Mean   :3.21   Mean   : 5.79   Mean   :217   Mean   :1626  
##  3rd Qu.:11.65   sport   : 49               3rd Qu.:3.90   3rd Qu.: 6.00   3rd Qu.:255   3rd Qu.:1804  
##  Max.   :21.55   minivan : 20               Max.   :8.30   Max.   :12.00   Max.   :500   Max.   :3261  
##  NA's   :14                                                                              NA's   :2     
##      length        width       wheel.base 
##  Min.   :363   Min.   :163   Min.   :226  
##  1st Qu.:450   1st Qu.:175   1st Qu.:262  
##  Median :472   Median :180   Median :272  
##  Mean   :471   Mean   :181   Mean   :275  
##  3rd Qu.:490   3rd Qu.:185   3rd Qu.:284  
##  Max.   :577   Max.   :206   Max.   :366  
##  NA's   :26    NA's   :28    NA's   :2
# psych::pairs.panels(Cars2004nh[, c(y,"weight", "iweight", "lweight")])
# keep only 
data <- na.omit(Cars2004nh[, c(y, x)])

# exploration
table(data$ncylinder)
## 
##   4   5   6   8  12 
## 121   7 179  75   2
data$fcyl <- cut(data$ncylinder, breaks = c(4, 6, 8, 12), right = FALSE, include.lowest = TRUE)
summary(data$fcyl)
##  [4,6)  [6,8) [8,12] 
##    128    179     77
data$ftype <- factor(data$ftype, levels = setdiff(levels(data$ftype), "pickup")) # pickups not represented in data

# consumption vs categorical variables
xcat <- c("ftype", "fdrive", "fcyl")
par(mfrow = c(1, length(xcat)), mar = c(4,4,0.5,0.5))
for(x in xcat){
  plot(as.formula(paste("consumption ~ ", x)), data = data)
}

# consumption vs numeric variables
xnum <- c("engine.size", "horsepower", "weight", "length", "width", "wheel.base")
psych::pairs.panels(data[, c(y,xnum)])

The ultimate goal is to find as simple as possible (for interpretation) model for explaining consumption of a vehicle given other characteristics. First, we fit an additive model with 3 categorical and 6 numerical regressors.

fit_add <- lm(consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + weight + length + width + wheel.base, data = data)
summary(fit_add)
## 
## Call:
## lm(formula = consumption ~ ftype + fdrive + fcyl + engine.size + 
##     horsepower + weight + length + width + wheel.base, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.029 -0.465 -0.035  0.433  3.584 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.860512   1.326066    7.44  7.3e-13 ***
## ftypewagon    0.004821   0.145516    0.03   0.9736    
## ftypeSUV      0.844818   0.164007    5.15  4.2e-07 ***
## ftypesport    0.112602   0.175649    0.64   0.5219    
## ftypeminivan  0.200317   0.214683    0.93   0.3514    
## fdriverear    0.177462   0.117085    1.52   0.1305    
## fdrive4x4     0.239530   0.118021    2.03   0.0431 *  
## fcyl[6,8)     0.281606   0.141894    1.98   0.0479 *  
## fcyl[8,12]   -0.004098   0.274750   -0.01   0.9881    
## engine.size   0.470259   0.117939    3.99  8.1e-05 ***
## horsepower    0.004388   0.001038    4.23  3.0e-05 ***
## weight        0.004572   0.000348   13.12  < 2e-16 ***
## length       -0.005002   0.002774   -1.80   0.0722 .  
## width        -0.011962   0.009478   -1.26   0.2077    
## wheel.base   -0.017872   0.005462   -3.27   0.0012 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.716 on 369 degrees of freedom
## Multiple R-squared:  0.884,  Adjusted R-squared:  0.88 
## F-statistic:  202 on 14 and 369 DF,  p-value: <2e-16

The model could be sequentially built by adding one variable at a time, which can be monitored by anova:

anova(fit_add)
## Analysis of Variance Table
## 
## Response: consumption
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## ftype         4    605   151.3  295.16 < 2e-16 ***
## fdrive        2    225   112.7  219.89 < 2e-16 ***
## fcyl          2    411   205.6  401.11 < 2e-16 ***
## engine.size   1     72    72.1  140.70 < 2e-16 ***
## horsepower    1     42    41.7   81.28 < 2e-16 ***
## weight        1     68    67.9  132.50 < 2e-16 ***
## length        1     18    18.1   35.23 6.8e-09 ***
## width         1      1     1.0    2.03  0.1554    
## wheel.base    1      5     5.5   10.71  0.0012 ** 
## Residuals   369    189     0.5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, as mentioned above the individual tests are heavily dependent on the order of the variables despite reaching the same model in the end:

anova(lm(consumption ~ width + wheel.base + length + ftype + engine.size + fcyl + horsepower + weight + fdrive, data = data))
## Analysis of Variance Table
## 
## Response: consumption
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## width         1    702     702 1369.70 < 2e-16 ***
## wheel.base    1      2       2    2.95 0.08682 .  
## length        1     48      48   92.88 < 2e-16 ***
## ftype         4    273      68  133.02 < 2e-16 ***
## engine.size   1    248     248  483.75 < 2e-16 ***
## fcyl          2      9       4    8.61 0.00022 ***
## horsepower    1     51      51   98.58 < 2e-16 ***
## weight        1    114     114  222.76 < 2e-16 ***
## fdrive        2      3       1    2.58 0.07734 .  
## Residuals   369    189       1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The submodel test compares two models where both or just the larger ones need to hold. So when an important variable is missing the conclusions may be invalid.

Hence, we need a different kind of test to determine the significance of a variable. The function drop1 compares the original large model with many different submodels, which are created by dropping one term without breaking the hierarchy:

drop1(fit_add, test = "F")
## Single term deletions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##             Df Sum of Sq RSS    AIC F value  Pr(>F)    
## <none>                   189 -241.9                    
## ftype        4      15.1 204 -220.4    7.36 1.0e-05 ***
## fdrive       2       2.6 192 -240.5    2.58  0.0773 .  
## fcyl         2       7.1 196 -231.7    6.94  0.0011 ** 
## engine.size  1       8.2 197 -227.7   15.90 8.1e-05 ***
## horsepower   1       9.2 198 -225.7   17.89 3.0e-05 ***
## weight       1      88.3 278  -96.8  172.21 < 2e-16 ***
## length       1       1.7 191 -240.5    3.25  0.0722 .  
## width        1       0.8 190 -242.2    1.59  0.2077    
## wheel.base   1       5.5 195 -232.9   10.71  0.0012 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we know which variables are significant while adjusting for the effect of other remaining variables. The insignificant ones seem to be redundant, thus, we can remove them from the model by update function. Now all remaining variables are significant:

fit1 <- update(fit_add, .~. - fdrive - length - width)
anova(fit1, fit_add)      # fit_add is 
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##   Res.Df RSS Df Sum of Sq    F Pr(>F)   
## 1    373 198                            
## 2    369 189  4      8.86 4.32  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(fit1, test = "F")
## Single term deletions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base
##             Df Sum of Sq RSS    AIC F value  Pr(>F)    
## <none>                   198 -232.3                    
## ftype        4      21.8 220 -200.1   10.29 6.5e-08 ***
## fcyl         2       6.9 205 -223.1    6.54  0.0016 ** 
## engine.size  1       4.8 203 -225.0    9.13  0.0027 ** 
## horsepower   1      13.7 212 -208.7   25.76 6.1e-07 ***
## weight       1     114.7 313  -58.9  215.98 < 2e-16 ***
## wheel.base   1      26.3 224 -186.4   49.58 9.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, by removing multiple variables at once we might have eliminated too many variables. It can happen that one of the eliminated variables may still significantly improve a model. As we can see in anova output above fit_add is a significant improvement of fit1 despite all eliminated variables being insignificant in fit_add. Similarly as we can check dropping one variable, we can also inspect, how model improves upon adding one more term with add1 function. The scope parameter should be a vector of terms that we wish to add to the current model. Or alternatively set scope as a larger model, where only terms that do not appear in fit1 will be added - one at a time:

# add1(fit1, scope = c("fdrive", "length", "width"), test = "F")
add1(fit1, scope = fit_add, test = "F")
## Single term additions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base
##        Df Sum of Sq RSS  AIC F value Pr(>F)   
## <none>              198 -232                  
## fdrive  2      4.82 193 -238    4.63 0.0104 * 
## length  1      5.34 193 -241   10.31 0.0014 **
## width   1      3.22 195 -237    6.14 0.0136 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems like adding length back could improve the model:

fit2 <- update(fit1, .~. + length)
anova(fit2, fit_add)                    # fit2 comparable to fit_add
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##   Res.Df RSS Df Sum of Sq    F Pr(>F)  
## 1    372 193                           
## 2    369 189  3      3.51 2.29  0.078 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(fit2, test = "F")                 # all variables are significant
## Single term deletions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
##             Df Sum of Sq RSS    AIC F value  Pr(>F)    
## <none>                   193 -240.8                    
## ftype        4      14.7 207 -220.6    7.08 1.7e-05 ***
## fcyl         2       7.5 200 -230.2    7.21 0.00085 ***
## engine.size  1       7.4 200 -228.3   14.27 0.00018 ***
## horsepower   1      10.6 203 -222.3   20.41 8.4e-06 ***
## weight       1     118.0 311  -59.3  227.84 < 2e-16 ***
## wheel.base   1       6.1 199 -230.8   11.85 0.00064 ***
## length       1       5.3 198 -232.3   10.31 0.00144 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(fit2, scope = fit_add, test = "F") # none other can improve the model
## Single term additions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
##        Df Sum of Sq RSS  AIC F value Pr(>F)  
## <none>              193 -241                 
## fdrive  2     2.698 190 -242    2.63  0.074 .
## width   1     0.872 192 -240    1.69  0.195  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After adding just length we end with model fit2 that

Considering two-way interactions

So far we have neglected an option for the effects of variables to be modified by other variables: different lines for different groups, etc. We will start with a model with all possible variables and eliminate the insignificant terms on our way to final model. Here, we will limit ourselves to two-way interactions that are usually sufficient for modelling:

fit_int <- lm(consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + weight + length + width + wheel.base)^2, data = data)
anova(fit_add, fit_int)                     # two-way interaction improve additive model
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##   Res.Df RSS Df Sum of Sq   F  Pr(>F)    
## 1    369 189                             
## 2    288 113 81      76.2 2.4 5.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(d1fit_int <- drop1(fit_int, test = "F"))   # however, majority seem irrelevant in the full model
## Single term deletions
## 
## Model:
## consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##                        Df Sum of Sq RSS  AIC F value Pr(>F)  
## <none>                              113 -278                 
## ftype:fdrive            7      0.77 114 -289    0.28  0.961  
## ftype:fcyl              7      1.92 115 -285    0.70  0.673  
## ftype:engine.size       4      3.22 116 -275    2.05  0.087 .
## ftype:horsepower        4      1.15 114 -282    0.73  0.570  
## ftype:weight            4      1.32 114 -281    0.84  0.498  
## ftype:length            4      0.41 113 -284    0.26  0.903  
## ftype:width             4      3.77 117 -273    2.40  0.050 .
## ftype:wheel.base        4      1.60 115 -280    1.02  0.399  
## fdrive:fcyl             4      1.92 115 -279    1.22  0.302  
## fdrive:engine.size      2      0.36 113 -280    0.45  0.636  
## fdrive:horsepower       2      0.34 113 -280    0.43  0.651  
## fdrive:weight           2      0.09 113 -281    0.12  0.890  
## fdrive:length           2      1.20 114 -278    1.53  0.219  
## fdrive:width            2      1.05 114 -278    1.34  0.263  
## fdrive:wheel.base       2      1.81 115 -276    2.30  0.102  
## fcyl:engine.size        2      1.89 115 -275    2.40  0.092 .
## fcyl:horsepower         2      2.27 115 -274    2.90  0.057 .
## fcyl:weight             2      0.78 114 -279    0.99  0.374  
## fcyl:length             2      0.90 114 -279    1.15  0.318  
## fcyl:width              2      2.09 115 -275    2.66  0.071 .
## fcyl:wheel.base         2      0.28 113 -281    0.35  0.703  
## engine.size:horsepower  1      0.62 114 -278    1.57  0.211  
## engine.size:weight      1      1.71 115 -274    4.37  0.038 *
## engine.size:length      1      0.04 113 -280    0.11  0.736  
## engine.size:width       1      0.00 113 -280    0.01  0.922  
## engine.size:wheel.base  1      0.08 113 -279    0.21  0.647  
## horsepower:weight       1      0.06 113 -280    0.15  0.699  
## horsepower:length       1      0.01 113 -280    0.03  0.858  
## horsepower:width        1      0.69 114 -277    1.75  0.186  
## horsepower:wheel.base   1      0.65 114 -278    1.66  0.198  
## weight:length           1      1.13 114 -276    2.88  0.091 .
## weight:width            1      0.06 113 -280    0.16  0.690  
## weight:wheel.base       1      0.00 113 -280    0.00  0.956  
## length:width            1      0.00 113 -280    0.01  0.915  
## length:wheel.base       1      0.14 113 -279    0.35  0.552  
## width:wheel.base        1      0.66 114 -277    1.69  0.194  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(a1fit_int <- add1(fit_add, scope = fit_int, test = "F"))  # a dozen improves the additive model
## Single term additions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##                        Df Sum of Sq RSS  AIC F value  Pr(>F)    
## <none>                              189 -242                    
## ftype:fdrive            7      3.72 185 -236    1.04 0.40420    
## ftype:fcyl              7     12.10 177 -253    3.53 0.00110 ** 
## ftype:engine.size       4     11.56 178 -258    5.94 0.00012 ***
## ftype:horsepower        4      3.33 186 -241    1.63 0.16501    
## ftype:weight            4      5.42 184 -245    2.69 0.03088 *  
## ftype:length            4      4.08 185 -242    2.01 0.09235 .  
## ftype:width             4      3.04 186 -240    1.49 0.20408    
## ftype:wheel.base        4      2.39 187 -239    1.17 0.32368    
## fdrive:fcyl             4      5.89 183 -246    2.93 0.02085 *  
## fdrive:engine.size      2      3.78 185 -246    3.74 0.02468 *  
## fdrive:horsepower       2      1.53 188 -241    1.50 0.22539    
## fdrive:weight           2      3.19 186 -244    3.14 0.04432 *  
## fdrive:length           2      2.77 186 -244    2.73 0.06649 .  
## fdrive:width            2      1.97 187 -242    1.93 0.14618    
## fdrive:wheel.base       2      1.55 188 -241    1.51 0.22162    
## fcyl:engine.size        2      5.12 184 -248    5.10 0.00653 ** 
## fcyl:horsepower         2      7.04 182 -252    7.09 0.00095 ***
## fcyl:weight             2      6.44 183 -251    6.46 0.00174 ** 
## fcyl:length             2      0.79 188 -240    0.77 0.46398    
## fcyl:width              2      5.62 184 -250    5.62 0.00393 ** 
## fcyl:wheel.base         2      0.21 189 -238    0.21 0.81308    
## engine.size:horsepower  1      0.62 189 -241    1.21 0.27261    
## engine.size:weight      1      3.45 186 -247    6.85 0.00925 ** 
## engine.size:length      1      0.32 189 -240    0.63 0.42915    
## engine.size:width       1      1.33 188 -243    2.60 0.10760    
## engine.size:wheel.base  1      0.56 189 -241    1.10 0.29555    
## horsepower:weight       1      0.02 189 -240    0.03 0.85696    
## horsepower:length       1      2.70 186 -245    5.33 0.02149 *  
## horsepower:width        1      0.76 188 -241    1.48 0.22458    
## horsepower:wheel.base   1      0.02 189 -240    0.03 0.85637    
## weight:length           1      0.59 189 -241    1.15 0.28358    
## weight:width            1      1.12 188 -242    2.18 0.14030    
## weight:wheel.base       1      0.31 189 -240    0.60 0.43840    
## length:width            1      0.03 189 -240    0.06 0.80406    
## length:wheel.base       1      0.07 189 -240    0.13 0.71576    
## width:wheel.base        1      0.55 189 -241    1.08 0.29992    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We start slowly by eliminating those interaction terms that have \(p\)-value in drop1 lower than \(0.5\), which is not a mistake, we do not want to eliminate too much variables at the same time.

(insignif_d1 <- rownames(d1fit_int)[-1][d1fit_int$`Pr(>F)`[-1] > 0.5])
##  [1] "ftype:fdrive"           "ftype:fcyl"             "ftype:horsepower"       "ftype:length"          
##  [5] "fdrive:engine.size"     "fdrive:horsepower"      "fdrive:weight"          "fcyl:wheel.base"       
##  [9] "engine.size:length"     "engine.size:width"      "engine.size:wheel.base" "horsepower:weight"     
## [13] "horsepower:length"      "weight:width"           "weight:wheel.base"      "length:width"          
## [17] "length:wheel.base"
fit3 <- update(fit_int, as.formula(paste(".~. -", paste(insignif_d1, collapse = " - "))))
(d1fit3 <- drop1(fit3, test = "F"))         # many remaining are significant
## Single term deletions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##                        Df Sum of Sq RSS  AIC F value  Pr(>F)    
## <none>                              125 -318                    
## ftype:engine.size       4      7.08 132 -305    4.64 0.00117 ** 
## ftype:weight            4      4.59 129 -312    3.01 0.01846 *  
## ftype:width             4      9.08 134 -299    5.95 0.00012 ***
## ftype:wheel.base        4      5.42 130 -310    3.55 0.00746 ** 
## fdrive:fcyl             4      2.22 127 -319    1.45 0.21624    
## fdrive:length           2      4.11 129 -309    5.39 0.00500 ** 
## fdrive:width            2      0.58 125 -320    0.76 0.46738    
## fdrive:wheel.base       2      5.75 130 -305    7.54 0.00063 ***
## fcyl:engine.size        2      3.25 128 -312    4.26 0.01491 *  
## fcyl:horsepower         2      2.19 127 -315    2.87 0.05822 .  
## fcyl:weight             2      2.41 127 -315    3.15 0.04399 *  
## fcyl:length             2      4.23 129 -309    5.55 0.00426 ** 
## fcyl:width              2      7.48 132 -300    9.81 7.3e-05 ***
## engine.size:horsepower  1      2.79 128 -311    7.32 0.00718 ** 
## engine.size:weight      1      3.12 128 -310    8.18 0.00452 ** 
## horsepower:width        1      4.04 129 -308   10.59 0.00125 ** 
## horsepower:wheel.base   1      0.03 125 -320    0.08 0.77590    
## weight:length           1      4.34 129 -307   11.39 0.00083 ***
## width:wheel.base        1      5.52 130 -303   14.49 0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(fit3, scope = fit_int, test = "F")     # none should be added back
## Single term additions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##                        Df Sum of Sq RSS  AIC F value Pr(>F)  
## <none>                              125 -318                 
## ftype:fdrive            7      2.44 122 -312    0.91  0.497  
## ftype:fcyl              7      2.56 122 -312    0.96  0.461  
## ftype:horsepower        4      3.20 122 -320    2.13  0.077 .
## ftype:length            4      0.63 124 -312    0.41  0.803  
## fdrive:engine.size      2      1.08 124 -317    1.42  0.244  
## fdrive:horsepower       2      2.20 122 -321    2.92  0.056 .
## fdrive:weight           2      0.11 125 -314    0.14  0.867  
## fcyl:wheel.base         2      0.97 124 -317    1.27  0.283  
## engine.size:length      1      0.09 125 -316    0.25  0.619  
## engine.size:width       1      0.07 125 -316    0.18  0.669  
## engine.size:wheel.base  1      0.36 124 -317    0.94  0.332  
## horsepower:weight       1      0.69 124 -318    1.82  0.178  
## horsepower:length       1      0.02 125 -316    0.06  0.810  
## weight:width            1      0.12 125 -316    0.31  0.579  
## weight:wheel.base       1      0.00 125 -316    0.00  0.984  
## length:width            1      0.04 125 -316    0.10  0.754  
## length:wheel.base       1      0.73 124 -318    1.92  0.167  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit3, fit_int)                        # comparable to all two-way interaction model
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1    327 125                         
## 2    288 113 39      11.7 0.76   0.85
anova(fit_add, fit3)                        # significantly improves additive model
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
## 1    369 189                              
## 2    327 125 42      64.5 4.03 3.9e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we eliminate at \(0.05\) level for \(p\)-value in drop1:

(insignif_d1 <- rownames(d1fit3)[-1][d1fit3$`Pr(>F)`[-1] > 0.05])
## [1] "fdrive:fcyl"           "fdrive:width"          "fcyl:horsepower"       "horsepower:wheel.base"
fit4 <- update(fit3, as.formula(paste(".~. -", paste(insignif_d1, collapse = " - "))))
drop1(fit4, test = "F")                     # all remaining are significant
## Single term deletions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##                        Df Sum of Sq RSS  AIC F value  Pr(>F)    
## <none>                              129 -322                    
## ftype:engine.size       4      8.72 138 -305    5.67 0.00020 ***
## ftype:weight            4      4.95 134 -316    3.22 0.01300 *  
## ftype:width             4     10.69 140 -300    6.95 2.2e-05 ***
## ftype:wheel.base        4      5.48 135 -314    3.56 0.00734 ** 
## fdrive:length           2      4.07 133 -314    5.29 0.00549 ** 
## fdrive:wheel.base       2      5.32 135 -311    6.91 0.00114 ** 
## fcyl:engine.size        2      2.65 132 -318    3.44 0.03325 *  
## fcyl:weight             2      3.58 133 -316    4.66 0.01011 *  
## fcyl:length             2      4.92 134 -312    6.39 0.00188 ** 
## fcyl:width              2      8.76 138 -301   11.39 1.6e-05 ***
## engine.size:horsepower  1      1.55 131 -320    4.02 0.04576 *  
## engine.size:weight      1      3.49 133 -314    9.07 0.00280 ** 
## horsepower:width        1      6.56 136 -305   17.06 4.6e-05 ***
## weight:length           1      3.43 133 -314    8.92 0.00303 ** 
## width:wheel.base        1      5.74 135 -307   14.92 0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(fit4, scope = fit_int, test = "F")     # none interaction should be added back
## Single term additions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##                        Df Sum of Sq RSS  AIC F value Pr(>F)
## <none>                              129 -322               
## ftype:fdrive            7      1.39 128 -312    0.51   0.83
## ftype:fcyl              7      3.26 126 -318    1.22   0.29
## ftype:horsepower        4      1.94 127 -320    1.26   0.28
## ftype:length            4      0.78 128 -316    0.50   0.73
## fdrive:fcyl             4      1.91 127 -320    1.24   0.29
## fdrive:engine.size      2      0.16 129 -319    0.21   0.81
## fdrive:horsepower       2      0.33 129 -319    0.42   0.66
## fdrive:weight           2      0.17 129 -319    0.23   0.80
## fdrive:width            2      0.55 129 -320    0.72   0.49
## fcyl:horsepower         2      1.62 128 -323    2.12   0.12
## fcyl:wheel.base         2      1.19 128 -322    1.55   0.21
## engine.size:length      1      0.00 129 -320    0.00   1.00
## engine.size:width       1      0.00 129 -320    0.01   0.94
## engine.size:wheel.base  1      0.33 129 -321    0.86   0.35
## horsepower:weight       1      0.27 129 -321    0.71   0.40
## horsepower:length       1      0.07 129 -320    0.19   0.67
## horsepower:wheel.base   1      0.00 129 -320    0.00   0.95
## weight:width            1      0.11 129 -320    0.27   0.60
## weight:wheel.base       1      0.02 129 -320    0.05   0.83
## length:width            1      0.06 129 -320    0.14   0.71
## length:wheel.base       1      0.96 128 -323    2.52   0.11
anova(fit4, fit_int)                        # comparable to all two-way interaction model, spares 48 coefs
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1    336 129                         
## 2    288 113 48      16.2 0.86   0.73
anova(fit_add, fit4)                        # significantly improves additive model
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
## 1    369 189                              
## 2    336 129 33      59.9 4.72 4.6e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit4, fit3)                           # comparable to fit3, spares 9 coefs
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1    336 129                         
## 2    327 125  9      4.56 1.33   0.22

The final model seems to be fit4 where every variable appears in some interaction, even width and fdrive which were excluded in model fit2 when no interactions were considered:

anova(fit2, fit4)
## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
## 1    372 193                              
## 2    336 129 36      63.4 4.58 2.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

General exceptions from eliminating variables:

  • compulsory variables we are obliged to adjust the effects for (age, sex in medical studies),
  • variables crucial for answering the research question (treatment group in medical studies).

Elimination despite significance:

  • adjusting for multiple testing with lower significance threshold to keep only the strong effects to keep model parsimonious,
  • to avoid multicollinearity: highly correlated explanatory variables. It does not hurt estimation of \(\mathbb{X} \boldsymbol{\beta}\), but the precision of the estimates of regression coefficients \(\boldsymbol{\beta}\))

Our model was fitted on many highly correlated variables, see the exploration above. We compute the so called variance inflation factor (VIF) that reveals which variables may cause inflation in the estimation of variance of coefficients. It compares the current estimates of variance with the variability of estimates under orthogonal (uncorrelated) regressors.

car::vif(fit4, type = "predictor")
##                  GVIF Df GVIF^(1/(2*Df))                         Interacts With
## ftype       1.233e+11 26           1.634 engine.size, weight, width, wheel.base
## fdrive      1.058e+07  8           2.748                     length, wheel.base
## fcyl        1.809e+09 16           1.947     engine.size, weight, length, width
## engine.size 6.213e+27 23           4.020        ftype, fcyl, horsepower, weight
## horsepower  1.680e+07  5           5.279                     engine.size, width
## weight      6.364e+24 25           3.134       ftype, fcyl, engine.size, length
## length      1.789e+18 13           5.035                   fdrive, fcyl, weight
## width       1.709e+20 21           3.032    ftype, fcyl, horsepower, wheel.base
## wheel.base  4.004e+19 19           3.280                   ftype, fdrive, width
##                                                Other Predictors
## ftype                          fdrive, fcyl, horsepower, length
## fdrive      ftype, fcyl, engine.size, horsepower, weight, width
## fcyl                      ftype, fdrive, horsepower, wheel.base
## engine.size                   fdrive, length, width, wheel.base
## horsepower      ftype, fdrive, fcyl, weight, length, wheel.base
## weight                    fdrive, horsepower, width, wheel.base
## length        ftype, engine.size, horsepower, width, wheel.base
## width                       fdrive, engine.size, weight, length
## wheel.base        fcyl, engine.size, horsepower, weight, length

Individual work