F-test on submodel

Download this R markdown as: R, Rmd.

Outline of this lab session:

F-test on submodel,
anova, drop1, add1, vif.

Loading the data and libraries

library("mffSM")         # Cars2004nh
library("colorspace")    # rainbow_hcl() colors
library("psych")         # pairs.panels()
library("car")           # vif()

We will work again with the dataset peat and Cars2004nh from previous exercises.

peat <- read.csv("http://www.karlin.mff.cuni.cz/~vavraj/nmfm334/data/peat.csv", 
                 header = TRUE, stringsAsFactors = TRUE)
data(Cars2004nh, package = "mffSM")

Theory

Given response \(\mathbf{Y}\) and available set of covariates \(\mathbb{Z}\) we consider two different model matrices \(\mathbb{X}^{0}\) and \(\mathbb{X}^1\) of ranks \(r_0\) and \(r_1\). These correspond to two different models: \[ \mathrm{M}_0: \mathbb{Y} \mid \mathbb{Z} \sim \left(\mathbb{X}^0 \boldsymbol{\beta}^0, \, \sigma_0^2 \mathbb{I}_n\right), \qquad \mathrm{M}_1: \mathbb{Y} \mid \mathbb{Z} \sim \left(\mathbb{X}^1 \boldsymbol{\beta}^1, \, \sigma_1^2 \mathbb{I}_n\right) \] Model \(\mathrm{M}_0\) is considered to be a submodel of model \(\mathrm{M}_1\) if the linear subspace of \(\mathbb{R}^n\) generated by the columns of \(\mathbb{X}^0\) (\(\mathcal{M}(\mathbb{X}^0)\)) is a subspace of the space generated by the columns of \(\mathbb{X}^1\). In particular, \(\mathcal{M}(\mathbb{X}^0) \subset \mathcal{M}(\mathbb{X}^1)\).

Model \(\mathrm{M}_0\) holds means \(\mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^0) \subset \mathcal{M}(\mathbb{X}^1)\). Which means that the larger model \(\mathrm{M}^1\) also holds.
Model \(\mathrm{M}_0\) does not hold but \(\mathrm{M}_1\) holds means that \(\mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^1) \setminus \mathcal{M}(\mathbb{X}^0)\).

Let us label the quantities computed from models \(\mathrm{M}_0\) and \(\mathrm{M}_1\) with a subscript or superscript \(0\) or \(1\), respectively. If submodel \(\mathrm{M}_0\) holds, then

\(\widehat{\mathbf{Y}}^0\) is the best linear unbiased estimator of \(\mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right]\),
submodel residual mean square \(\frac{\mathsf{RSS}_0}{n-r_0}\) is unbiased estimator of \(\sigma_0^2\),
random vector \(\mathbf{D} = \widehat{\mathbf{Y}}^1 - \widehat{\mathbf{Y}}^0 = \widehat{\mathbf{U}}^0 - \widehat{\mathbf{U}}^1\) satisfies \(\|\mathbf{D}\|^2 = \mathsf{RSS}_0 - \mathsf{RSS}_1\),
vectors \(\widehat{\mathbf{Y}}^{0}\), \(\mathbf{D}\) and \(\mathbf{U}^1\) are uncorrelated.

If additionally, a normal linear model is assumed, \(\mathbb{Y} \mid \mathbb{Z} \sim \mathsf{N}\left(\mathbb{X}^0 \boldsymbol{\beta}^0, \, \sigma_0^2 \mathbb{I}_n\right)\), then

\(\|\mathbf{D}\|^2 = \mathsf{RSS}_0 - \mathsf{RSS}_1\) and \(\|\mathbf{U}^1\|^2 = \mathsf{RSS}_1\) are independent and

\[ F = \frac{\frac{\mathsf{RSS}_0 - \mathsf{RSS}_1}{r_1 - r_0}}{\frac{\mathsf{RSS}_1}{n - r_1}} \sim \mathsf{F}_{r_1 - r_0, n - r_1}. \]

This result can used for testing the following statistical hypotheses: \[\begin{align*} \mathrm{H}_0:& \text{ model } \mathrm{M}_0 \text{ holds, i.e. } \mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^0), \\ \mathrm{H}_1:& \text{ model } \mathrm{M}_1 \text{ holds, i.e. } \mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right] \in \mathcal{M}(\mathbb{X}^1) \setminus \mathcal{M}(\mathbb{X}^0). \end{align*}\] This aims at answering the following questions:

Is model \(\mathrm{M}_1\) significantly better than model \(\mathrm{M}_0\)?
Does larger regression space provide a significantly better expression for \(\mathsf{E} \left[\mathbf{Y} \mid \mathbb{Z}\right]\) over the smaller regression space?

\(F\)-test on submodel on significance level \(\alpha\):

test statistic \(F\) with observed value \(f\),
reject \(\mathrm{H}_0\) if \(f \geq \mathcal{F}_{r_1 - r_0, n-r_1}(1-\alpha)\) (\(\mathcal{F}\) quantile function of \(\mathsf{F}\) distribution),
\(p\)-value computed as \(p = 1- \mathsf{CDF}^{\mathsf{F}}_{r_1 - r_0, n-r_1}(f)\) (cumulative distribution function of \(\mathsf{F}\) distribution).

F-test on a submodel

Consider the peat data from previous exercises.

We will compute the \(F\)-test for comparison of the additive (parallel lines) vs. interaction (group-specific lines) model. Clearly, additive model is a submodel of the interaction model.

fit_add <- lm(N ~ depth + group, data = peat)
fit_int <- lm(N ~ depth * group, data = peat)

(RSS0 <- sum(residuals(fit_add)^2))

## [1] 5.566937

(RSS1 <- sum(residuals(fit_int)^2))

## [1] 2.568338

(D2 <- RSS0 - RSS1)

## [1] 2.998599

(df_dif <- fit_add$df.residual - fit_int$df.residual)

## [1] 3

(Fstat <- D2/df_dif / (RSS1/fit_int$df.residual))

## [1] 54.4845

(Fcritical <- qf(0.95, df1 = df_dif, df2 = fit_int$df.residual))

## [1] 2.669256

(rejectH0 <- (Fstat > Fcritical))

## [1] TRUE

(pval <- pf(Fstat, df1 = df_dif, df2 = fit_int$df.residual, lower.tail = FALSE))

## [1] 2.128525e-23

Beyond any doubt we proved that the interaction model significantly improves the additive model. The submodel \(F\)-test is already implemented by anova function.

anova(fit_add, fit_int)

## Analysis of Variance Table
## 
## Model 1: N ~ depth + group
## Model 2: N ~ depth * group
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    143 5.5669                                  
## 2    140 2.5683  3    2.9986 54.484 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This function also excepts only a single model and provides intermediate results for model construction. In this case it sequentially compares the following models:

N ~ 1 (intercept-only model),
N ~ depth (adds one coefficient),
N ~ depth + group (adds three coefficients),
N ~ depth + group + depth:group (adds three coefficients).

anova(fit_int)

## Analysis of Variance Table
## 
## Response: N
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## depth         1 0.6263 0.62628  34.138  3.44e-08 ***
## group         3 3.3386 1.11287  60.663 < 2.2e-16 ***
## depth:group   3 2.9986 0.99953  54.484 < 2.2e-16 ***
## Residuals   140 2.5683 0.01835                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the tests provided by anova this way depend on the order of the terms.

anova(lm(N ~ group * depth, data = peat))

## Analysis of Variance Table
## 
## Response: N
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## group         3 3.2724 1.09079  59.459 < 2.2e-16 ***
## depth         1 0.6925 0.69251  37.749 7.877e-09 ***
## group:depth   3 2.9986 0.99953  54.484 < 2.2e-16 ***
## Residuals   140 2.5683 0.01835                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Also note that when compared to the intercept-only model we get the \(F\)-test provided in the basic summary of the model:

summary(fit_int)

## 
## Call:
## lm(formula = N ~ depth * group, data = peat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36829 -0.08469 -0.00789  0.06112  0.37713 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.846956   0.051193  16.544  < 2e-16 ***
## depth              0.005048   0.005724   0.882    0.379    
## groupCB-VJJ        0.322113   0.064417   5.000 1.68e-06 ***
## groupVJJ          -0.064041   0.075374  -0.850    0.397    
## groupVJJ-CB        0.075906   0.064417   1.178    0.241    
## depth:groupCB-VJJ -0.032600   0.007389  -4.412 2.03e-05 ***
## depth:groupVJJ     0.043942   0.008312   5.287 4.68e-07 ***
## depth:groupVJJ-CB  0.041591   0.007389   5.629 9.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1354 on 140 degrees of freedom
## Multiple R-squared:  0.7306, Adjusted R-squared:  0.7171 
## F-statistic: 54.23 on 7 and 140 DF,  p-value: < 2.2e-16

anova(lm(N ~ 1, data = peat), fit_int)

## Analysis of Variance Table
## 
## Model 1: N ~ 1
## Model 2: N ~ depth * group
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    147 9.5318                                  
## 2    140 2.5683  7    6.9635 54.226 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Building hierarchically well-structured model

In the rest of the exercise we will work with Cars2004nh dataset from mffSM package. It contains information about \(n=\) 425 new cars of the year 2004. We will work with only a subset of variables suitable for predicting the consumption of the vehicle:

y <- "consumption"
x <- c("ftype", "fdrive", "engine.size", "ncylinder", "horsepower", "weight", "length", "width", "wheel.base")
head(Cars2004nh[, c(y, x)])

##   consumption    ftype fdrive engine.size ncylinder horsepower weight length width wheel.base
## 1        7.65 personal  front         1.6         4        103   1075    424   168        249
## 2        7.65 personal  front         1.6         4        103   1065    389   168        249
## 3        7.70 personal  front         2.2         4        140   1187    465   175        264
## 4        7.70 personal  front         2.2         4        140   1214    465   173        264
## 5        7.70 personal  front         2.2         4        140   1187    465   175        264
## 6        7.30 personal  front         2.0         4        132   1171    442   170        267

summary(Cars2004nh[, c(y, x)])

##   consumption         ftype       fdrive     engine.size      ncylinder        horsepower        weight    
##  Min.   : 5.65   personal:242   front:223   Min.   :1.300   Min.   :-1.000   Min.   :100.0   Min.   : 923  
##  1st Qu.: 9.65   wagon   : 30   rear :110   1st Qu.:2.400   1st Qu.: 4.000   1st Qu.:165.0   1st Qu.:1412  
##  Median :10.70   SUV     : 60   4x4  : 92   Median :3.000   Median : 6.000   Median :210.0   Median :1577  
##  Mean   :10.75   pickup  : 24               Mean   :3.208   Mean   : 5.791   Mean   :216.8   Mean   :1626  
##  3rd Qu.:11.65   sport   : 49               3rd Qu.:3.900   3rd Qu.: 6.000   3rd Qu.:255.0   3rd Qu.:1804  
##  Max.   :21.55   minivan : 20               Max.   :8.300   Max.   :12.000   Max.   :500.0   Max.   :3261  
##  NA's   :14                                                                                  NA's   :2     
##      length          width         wheel.base   
##  Min.   :363.0   Min.   :163.0   Min.   :226.0  
##  1st Qu.:450.0   1st Qu.:175.0   1st Qu.:262.0  
##  Median :472.0   Median :180.0   Median :272.0  
##  Mean   :470.6   Mean   :181.1   Mean   :274.9  
##  3rd Qu.:490.0   3rd Qu.:185.0   3rd Qu.:284.0  
##  Max.   :577.0   Max.   :206.0   Max.   :366.0  
##  NA's   :26      NA's   :28      NA's   :2

# psych::pairs.panels(Cars2004nh[, c(y,"weight", "iweight", "lweight")])
# keep only 
data <- na.omit(Cars2004nh[, c(y, x)])

# exploration
table(data$ncylinder)

## 
##   4   5   6   8  12 
## 121   7 179  75   2

data$fcyl <- cut(data$ncylinder, breaks = c(4, 6, 8, 12), right = FALSE, include.lowest = TRUE)
summary(data$fcyl)

##  [4,6)  [6,8) [8,12] 
##    128    179     77

data$ftype <- factor(data$ftype, levels = setdiff(levels(data$ftype), "pickup")) # pickups not represented in data

# consumption vs categorical variables
xcat <- c("ftype", "fdrive", "fcyl")
par(mfrow = c(1, length(xcat)), mar = c(4,4,0.5,0.5))
for(x in xcat){
  plot(as.formula(paste("consumption ~ ", x)), data = data)
}

# consumption vs numeric variables
xnum <- c("engine.size", "horsepower", "weight", "length", "width", "wheel.base")
psych::pairs.panels(data[, c(y,xnum)])

The ultimate goal is to find as simple as possible (for interpretation) model for explaining consumption of a vehicle given other characteristics. First, we fit an additive model with 3 categorical and 6 numerical regressors.

fit_add <- lm(consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + weight + length + width + wheel.base, data = data)
summary(fit_add)

## 
## Call:
## lm(formula = consumption ~ ftype + fdrive + fcyl + engine.size + 
##     horsepower + weight + length + width + wheel.base, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0291 -0.4651 -0.0348  0.4332  3.5840 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.8605120  1.3260663   7.436 7.33e-13 ***
## ftypewagon    0.0048205  0.1455164   0.033  0.97359    
## ftypeSUV      0.8448182  0.1640069   5.151 4.22e-07 ***
## ftypesport    0.1126023  0.1756491   0.641  0.52188    
## ftypeminivan  0.2003173  0.2146830   0.933  0.35139    
## fdriverear    0.1774620  0.1170848   1.516  0.13046    
## fdrive4x4     0.2395297  0.1180205   2.030  0.04312 *  
## fcyl[6,8)     0.2816061  0.1418941   1.985  0.04793 *  
## fcyl[8,12]   -0.0040983  0.2747498  -0.015  0.98811    
## engine.size   0.4702588  0.1179395   3.987 8.06e-05 ***
## horsepower    0.0043882  0.0010376   4.229 2.96e-05 ***
## weight        0.0045723  0.0003484  13.123  < 2e-16 ***
## length       -0.0050020  0.0027741  -1.803  0.07219 .  
## width        -0.0119624  0.0094784  -1.262  0.20772    
## wheel.base   -0.0178717  0.0054620  -3.272  0.00117 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.716 on 369 degrees of freedom
## Multiple R-squared:  0.8845, Adjusted R-squared:  0.8801 
## F-statistic: 201.8 on 14 and 369 DF,  p-value: < 2.2e-16

The model could be sequentially built by adding one variable at a time, which can be monitored by anova:

anova(fit_add)

## Analysis of Variance Table
## 
## Response: consumption
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## ftype         4 605.26 151.314 295.1589 < 2.2e-16 ***
## fdrive        2 225.46 112.728 219.8913 < 2.2e-16 ***
## fcyl          2 411.26 205.631 401.1115 < 2.2e-16 ***
## engine.size   1  72.13  72.132 140.7035 < 2.2e-16 ***
## horsepower    1  41.67  41.666  81.2757 < 2.2e-16 ***
## weight        1  67.92  67.925 132.4966 < 2.2e-16 ***
## length        1  18.06  18.060  35.2279 6.766e-09 ***
## width         1   1.04   1.039   2.0266  0.155412    
## wheel.base    1   5.49   5.488  10.7059  0.001169 ** 
## Residuals   369 189.17   0.513                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, as mentioned above the individual tests are heavily dependent on the order of the variables despite reaching the same model in the end:

anova(lm(consumption ~ width + wheel.base + length + ftype + engine.size + fcyl + horsepower + weight + fdrive, data = data))

## Analysis of Variance Table
## 
## Response: consumption
##              Df Sum Sq Mean Sq   F value    Pr(>F)    
## width         1 702.18  702.18 1369.7021 < 2.2e-16 ***
## wheel.base    1   1.51    1.51    2.9481 0.0868198 .  
## length        1  47.62   47.62   92.8803 < 2.2e-16 ***
## ftype         4 272.77   68.19  133.0208 < 2.2e-16 ***
## engine.size   1 248.00  248.00  483.7547 < 2.2e-16 ***
## fcyl          2   8.83    4.41    8.6086 0.0002218 ***
## horsepower    1  50.54   50.54   98.5763 < 2.2e-16 ***
## weight        1 114.20  114.20  222.7604 < 2.2e-16 ***
## fdrive        2   2.64    1.32    2.5773 0.0773435 .  
## Residuals   369 189.17    0.51                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The submodel test compares two models where both or just the larger ones need to hold. So when an important variable is missing the conclusions may be invalid.

Hence, we need a different kind of test to determine the significance of a variable. The function drop1 compares the original large model with many different submodels, which are created by dropping one term without breaking the hierarchy:

drop1(fit_add, test = "F")

## Single term deletions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##             Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
## <none>                   189.17 -241.873                       
## ftype        4    15.091 204.26 -220.400   7.3591 1.036e-05 ***
## fdrive       2     2.643 191.81 -240.545   2.5773  0.077344 .  
## fcyl         2     7.115 196.28 -231.694   6.9395  0.001100 ** 
## engine.size  1     8.150 197.32 -227.674  15.8985 8.059e-05 ***
## horsepower   1     9.169 198.34 -225.697  17.8855 2.962e-05 ***
## weight       1    88.285 277.45  -96.795 172.2114 < 2.2e-16 ***
## length       1     1.667 190.84 -240.504   3.2511  0.072192 .  
## width        1     0.817 189.99 -242.219   1.5928  0.207721    
## wheel.base   1     5.488 194.66 -232.890  10.7059  0.001169 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we know which variables are significant while adjusting for the effect of other remaining variables. The insignificant ones seem to be redundant, thus, we can remove them from the model by update function. Now all remaining variables are significant:

fit1 <- update(fit_add, .~. - fdrive - length - width)
anova(fit1, fit_add)

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    373 198.03                                
## 2    369 189.17  4    8.8561 4.3188 0.001996 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(fit1, test = "F")

## Single term deletions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base
##             Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
## <none>                   198.03 -232.303                       
## ftype        4    21.843 219.87 -200.124  10.2859 6.542e-08 ***
## fcyl         2     6.949 204.97 -223.059   6.5447  0.001608 ** 
## engine.size  1     4.850 202.88 -225.012   9.1348  0.002681 ** 
## horsepower   1    13.679 211.70 -208.655  25.7649 6.099e-07 ***
## weight       1   114.666 312.69  -58.884 215.9842 < 2.2e-16 ***
## wheel.base   1    26.324 224.35 -186.376  49.5842 9.197e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, by removing multiple variables at once we might have eliminated too many variables. It can happen that one of the eliminated variables may still significantly improve a model. As we can see in anova output above fit_add is a significant improvement of fit1 despite all eliminated variables being insignificant in fit_add. Similarly as we can check dropping one variable, we can also inspect, how model improves upon adding one more term with add1 function. The scope parameter should be a vector of terms that we wish to add to the current model. Or alternatively set scope as a larger model, where only terms that do not appear in fit1 will be added - one at a time:

# add1(fit1, scope = c("fdrive", "length", "width"), test = "F")
add1(fit1, scope = fit_add, test = "F")

## Single term additions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base
##        Df Sum of Sq    RSS     AIC F value   Pr(>F)   
## <none>              198.03 -232.30                    
## fdrive  2    4.8187 193.21 -237.76  4.6265 0.010361 * 
## length  1    5.3414 192.68 -240.80 10.3123 0.001437 **
## width   1    3.2180 194.81 -236.59  6.1450 0.013621 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems like adding length back could improve the model:

fit2 <- update(fit1, .~. + length)
anova(fit2, fit_add)                    # fit2 comparable to fit_add

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    372 192.68                              
## 2    369 189.17  3    3.5147 2.2853 0.07848 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(fit2, test = "F")                 # all variables are significant

## Single term deletions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
##             Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
## <none>                   192.68 -240.803                       
## ftype        4    14.670 207.35 -220.627   7.0805 1.674e-05 ***
## fcyl         2     7.466 200.15 -230.206   7.2066 0.0008498 ***
## engine.size  1     7.392 200.07 -228.348  14.2704 0.0001843 ***
## horsepower   1    10.572 203.25 -222.293  20.4100 8.406e-06 ***
## weight       1   118.015 310.70  -59.339 227.8421 < 2.2e-16 ***
## wheel.base   1     6.139 198.82 -230.760  11.8523 0.0006414 ***
## length       1     5.341 198.03 -232.303  10.3123 0.0014365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(fit2, scope = fit_add, test = "F") # none other can improve the model

## Single term additions
## 
## Model:
## consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
##        Df Sum of Sq    RSS     AIC F value  Pr(>F)  
## <none>              192.68 -240.80                  
## fdrive  2   2.69812 189.99 -242.22  2.6273 0.07362 .
## width   1   0.87213 191.81 -240.54  1.6869 0.19482  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After adding just length we end with model fit2 that

is comparable (\(p > 0.05\)) to the original additive model with all variables,
all variables are significant (\(p < 0.05\) in drop1 output),
other variables (fdrive and width) do not improve the model (\(p > 0.05\) in add1 output).

Considering two-way interactions

So far we have neglected an option for the effects of variables to be modified by other variables: different lines for different groups, etc. We will start with a model with all possible variables and eliminate the insignificant terms on our way to final model. Here, we will limit ourselves to two-way interactions that are usually sufficient for modelling:

fit_int <- lm(consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + weight + length + width + wheel.base)^2, data = data)
anova(fit_add, fit_int)                     # two-way interaction improve additive model

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    369 189.17                                  
## 2    288 113.02 81    76.153 2.3958 5.722e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(d1fit_int <- drop1(fit_int, test = "F"))   # however, majority seem irrelevant in the full model

## Single term deletions
## 
## Model:
## consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##                        Df Sum of Sq    RSS     AIC F value  Pr(>F)  
## <none>                              113.02 -277.68                  
## ftype:fdrive            7    0.7716 113.79 -289.06  0.2809 0.96113  
## ftype:fcyl              7    1.9191 114.94 -285.21  0.6986 0.67322  
## ftype:engine.size       4    3.2190 116.23 -274.89  2.0508 0.08740 .
## ftype:horsepower        4    1.1508 114.17 -281.79  0.7331 0.56998  
## ftype:weight            4    1.3244 114.34 -281.20  0.8438 0.49838  
## ftype:length            4    0.4079 113.42 -284.29  0.2599 0.90348  
## ftype:width             4    3.7659 116.78 -273.09  2.3992 0.05031 .
## ftype:wheel.base        4    1.5952 114.61 -280.29  1.0163 0.39919  
## fdrive:fcyl             4    1.9176 114.93 -279.21  1.2217 0.30170  
## fdrive:engine.size      2    0.3558 113.37 -280.47  0.4534 0.63591  
## fdrive:horsepower       2    0.3370 113.35 -280.53  0.4294 0.65130  
## fdrive:weight           2    0.0911 113.11 -281.37  0.1161 0.89042  
## fdrive:length           2    1.1987 114.22 -277.62  1.5273 0.21887  
## fdrive:width            2    1.0536 114.07 -278.11  1.3424 0.26284  
## fdrive:wheel.base       2    1.8076 114.82 -275.58  2.3031 0.10178  
## fcyl:engine.size        2    1.8858 114.90 -275.32  2.4028 0.09228 .
## fcyl:horsepower         2    2.2727 115.29 -274.03  2.8958 0.05687 .
## fcyl:weight             2    0.7753 113.79 -279.05  0.9879 0.37362  
## fcyl:length             2    0.9034 113.92 -278.62  1.1511 0.31773  
## fcyl:width              2    2.0903 115.11 -274.64  2.6634 0.07143 .
## fcyl:wheel.base         2    0.2770 113.29 -280.74  0.3529 0.70295  
## engine.size:horsepower  1    0.6172 113.63 -277.58  1.5727 0.21083  
## engine.size:weight      1    1.7137 114.73 -273.90  4.3669 0.03752 *
## engine.size:length      1    0.0448 113.06 -279.52  0.1142 0.73562  
## engine.size:width       1    0.0038 113.02 -279.66  0.0096 0.92191  
## engine.size:wheel.base  1    0.0825 113.10 -279.39  0.2102 0.64692  
## horsepower:weight       1    0.0590 113.08 -279.48  0.1503 0.69856  
## horsepower:length       1    0.0126 113.03 -279.63  0.0320 0.85810  
## horsepower:width        1    0.6883 113.70 -277.34  1.7541 0.18641  
## horsepower:wheel.base   1    0.6519 113.67 -277.47  1.6612 0.19847  
## weight:length           1    1.1320 114.15 -275.85  2.8846 0.09051 .
## weight:width            1    0.0624 113.08 -279.46  0.1590 0.69034  
## weight:wheel.base       1    0.0012 113.02 -279.67  0.0030 0.95629  
## length:width            1    0.0044 113.02 -279.66  0.0113 0.91533  
## length:wheel.base       1    0.1389 113.16 -279.20  0.3539 0.55239  
## width:wheel.base        1    0.6650 113.68 -277.42  1.6945 0.19405  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(a1fit_int <- add1(fit_add, scope = fit_int, test = "F"))  # a dozen improves the additive model

## Single term additions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
##                        Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                              189.17 -241.87                      
## ftype:fdrive            7    3.7211 185.45 -235.50  1.0377 0.4042049    
## ftype:fcyl              7   12.0978 177.07 -253.25  3.5332 0.0010994 ** 
## ftype:engine.size       4   11.5556 177.61 -258.08  5.9368 0.0001227 ***
## ftype:horsepower        4    3.3282 185.84 -240.69  1.6342 0.1650099    
## ftype:weight            4    5.4219 183.75 -245.04  2.6926 0.0308789 *  
## ftype:length            4    4.0795 185.09 -242.24  2.0112 0.0923456 .  
## ftype:width             4    3.0427 186.13 -240.10  1.4917 0.2040791    
## ftype:wheel.base        4    2.3947 186.77 -238.76  1.1699 0.3236844    
## fdrive:fcyl             4    5.8866 183.28 -246.01  2.9307 0.0208522 *  
## fdrive:engine.size      2    3.7778 185.39 -245.62  3.7393 0.0246823 *  
## fdrive:horsepower       2    1.5298 187.64 -240.99  1.4960 0.2253856    
## fdrive:weight           2    3.1855 185.98 -244.39  3.1430 0.0443178 *  
## fdrive:length           2    2.7738 186.40 -243.54  2.7308 0.0664945 .  
## fdrive:width            2    1.9720 187.20 -241.90  1.9330 0.1461778    
## fdrive:wheel.base       2    1.5470 187.62 -241.03  1.5130 0.2216173    
## fcyl:engine.size        2    5.1166 184.05 -248.40  5.1013 0.0065280 ** 
## fcyl:horsepower         2    7.0358 182.13 -252.43  7.0886 0.0009537 ***
## fcyl:weight             2    6.4373 182.73 -251.17  6.4644 0.0017413 ** 
## fcyl:length             2    0.7900 188.38 -239.48  0.7695 0.4639793    
## fcyl:width              2    5.6239 183.54 -249.46  5.6225 0.0039340 ** 
## fcyl:wheel.base         2    0.2132 188.96 -238.31  0.2070 0.8130828    
## engine.size:horsepower  1    0.6185 188.55 -241.13  1.2072 0.2726124    
## engine.size:weight      1    3.4547 185.71 -246.95  6.8456 0.0092521 ** 
## engine.size:length      1    0.3215 188.85 -240.53  0.6265 0.4291480    
## engine.size:width       1    1.3281 187.84 -242.58  2.6018 0.1075978    
## engine.size:wheel.base  1    0.5624 188.61 -241.02  1.0973 0.2955463    
## horsepower:weight       1    0.0167 189.15 -239.91  0.0325 0.8569586    
## horsepower:length       1    2.7018 186.47 -245.40  5.3322 0.0214888 *  
## horsepower:width        1    0.7577 188.41 -241.41  1.4799 0.2245767    
## horsepower:wheel.base   1    0.0169 189.15 -239.91  0.0328 0.8563657    
## weight:length           1    0.5910 188.58 -241.07  1.1532 0.2835803    
## weight:width            1    1.1161 188.05 -242.15  2.1841 0.1402996    
## weight:wheel.base       1    0.3088 188.86 -240.50  0.6018 0.4384019    
## length:width            1    0.0317 189.14 -239.94  0.0616 0.8040616    
## length:wheel.base       1    0.0682 189.10 -240.01  0.1328 0.7157639    
## width:wheel.base        1    0.5523 188.62 -241.00  1.0776 0.2999184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We start slowly by eliminating those interaction terms that have \(p\)-value in drop1 lower than \(0.5\), which is not a mistake, we do not want to eliminate too much variables at the same time.

(insignif_d1 <- rownames(d1fit_int)[-1][d1fit_int$`Pr(>F)`[-1] > 0.5])

##  [1] "ftype:fdrive"           "ftype:fcyl"             "ftype:horsepower"       "ftype:length"          
##  [5] "fdrive:engine.size"     "fdrive:horsepower"      "fdrive:weight"          "fcyl:wheel.base"       
##  [9] "engine.size:length"     "engine.size:width"      "engine.size:wheel.base" "horsepower:weight"     
## [13] "horsepower:length"      "weight:width"           "weight:wheel.base"      "length:width"          
## [17] "length:wheel.base"

fit3 <- update(fit_int, as.formula(paste(".~. -", paste(insignif_d1, collapse = " - "))))
(d1fit3 <- drop1(fit3, test = "F"))         # many remaining are significant

## Single term deletions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##                        Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                              124.69 -317.93                      
## ftype:engine.size       4    7.0802 131.77 -304.72  4.6420 0.0011719 ** 
## ftype:weight            4    4.5891 129.28 -312.05  3.0088 0.0184637 *  
## ftype:width             4    9.0813 133.77 -298.93  5.9540 0.0001233 ***
## ftype:wheel.base        4    5.4174 130.11 -309.60  3.5518 0.0074576 ** 
## fdrive:fcyl             4    2.2165 126.91 -319.16  1.4532 0.2162380    
## fdrive:length           2    4.1071 128.80 -309.48  5.3854 0.0049985 ** 
## fdrive:width            2    0.5814 125.27 -320.14  0.7624 0.4673780    
## fdrive:wheel.base       2    5.7513 130.44 -304.61  7.5414 0.0006282 ***
## fcyl:engine.size        2    3.2488 127.94 -312.05  4.2600 0.0149141 *  
## fcyl:horsepower         2    2.1875 126.88 -315.25  2.8683 0.0582247 .  
## fcyl:weight             2    2.4051 127.09 -314.59  3.1537 0.0439944 *  
## fcyl:length             2    4.2328 128.92 -309.11  5.5502 0.0042616 ** 
## fcyl:width              2    7.4819 132.17 -299.55  9.8107 7.281e-05 ***
## engine.size:horsepower  1    2.7908 127.48 -311.43  7.3188 0.0071812 ** 
## engine.size:weight      1    3.1177 127.81 -310.45  8.1762 0.0045169 ** 
## horsepower:width        1    4.0397 128.73 -307.68 10.5942 0.0012532 ** 
## horsepower:wheel.base   1    0.0310 124.72 -319.83  0.0812 0.7759010    
## weight:length           1    4.3449 129.03 -306.78 11.3945 0.0008253 ***
## width:wheel.base        1    5.5246 130.21 -303.28 14.4883 0.0001684 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(fit3, scope = fit_int, test = "F")     # none should be added back

## Single term additions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##                        Df Sum of Sq    RSS     AIC F value  Pr(>F)  
## <none>                              124.69 -317.93                  
## ftype:fdrive            7    2.4399 122.25 -311.52  0.9124 0.49695  
## ftype:fcyl              7    2.5637 122.13 -311.91  0.9596 0.46081  
## ftype:horsepower        4    3.2015 121.49 -319.92  2.1280 0.07712 .
## ftype:length            4    0.6264 124.06 -311.86  0.4077 0.80305  
## fdrive:engine.size      2    1.0774 123.61 -317.26  1.4163 0.24410  
## fdrive:horsepower       2    2.1975 122.49 -320.76  2.9153 0.05561 .
## fdrive:weight           2    0.1095 124.58 -314.27  0.1428 0.86694  
## fcyl:wheel.base         2    0.9652 123.72 -316.91  1.2677 0.28286  
## engine.size:length      1    0.0946 124.59 -316.22  0.2476 0.61914  
## engine.size:width       1    0.0701 124.62 -316.14  0.1833 0.66884  
## engine.size:wheel.base  1    0.3601 124.33 -317.04  0.9443 0.33190  
## horsepower:weight       1    0.6940 124.00 -318.07  1.8247 0.17770  
## horsepower:length       1    0.0222 124.67 -316.00  0.0580 0.80989  
## weight:width            1    0.1181 124.57 -316.29  0.3091 0.57861  
## weight:wheel.base       1    0.0001 124.69 -315.93  0.0004 0.98430  
## length:width            1    0.0376 124.65 -316.04  0.0984 0.75395  
## length:wheel.base       1    0.7308 123.96 -318.19  1.9220 0.16659  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit3, fit_int)                        # comparable to all two-way interaction model

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    327 124.69                           
## 2    288 113.02 39    11.674 0.7628 0.8462

anova(fit_add, fit3)                        # significantly improves additive model

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    369 189.17                                  
## 2    327 124.69 42    64.479 4.0261 3.882e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we eliminate at \(0.05\) level for \(p\)-value in drop1:

(insignif_d1 <- rownames(d1fit3)[-1][d1fit3$`Pr(>F)`[-1] > 0.05])

## [1] "fdrive:fcyl"           "fdrive:width"          "fcyl:horsepower"       "horsepower:wheel.base"

fit4 <- update(fit3, as.formula(paste(".~. -", paste(insignif_d1, collapse = " - "))))
drop1(fit4, test = "F")                     # all remaining are significant

## Single term deletions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##                        Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                              129.25 -322.12                      
## ftype:engine.size       4    8.7170 137.97 -305.06  5.6651 0.0002010 ***
## ftype:weight            4    4.9527 134.21 -315.69  3.2187 0.0129974 *  
## ftype:width             4   10.6892 139.94 -299.61  6.9468 2.206e-05 ***
## ftype:wheel.base        4    5.4773 134.73 -314.19  3.5596 0.0073391 ** 
## fdrive:length           2    4.0676 133.32 -314.23  5.2869 0.0054866 ** 
## fdrive:wheel.base       2    5.3194 134.57 -310.64  6.9139 0.0011415 ** 
## fcyl:engine.size        2    2.6454 131.90 -318.34  3.4384 0.0332502 *  
## fcyl:weight             2    3.5838 132.84 -315.62  4.6581 0.0101055 *  
## fcyl:length             2    4.9192 134.17 -311.78  6.3938 0.0018825 ** 
## fcyl:width              2    8.7594 138.01 -300.94 11.3852 1.644e-05 ***
## engine.size:horsepower  1    1.5465 130.80 -319.56  4.0202 0.0457595 *  
## engine.size:weight      1    3.4876 132.74 -313.90  9.0661 0.0028012 ** 
## horsepower:width        1    6.5634 135.82 -305.10 17.0617 4.572e-05 ***
## weight:length           1    3.4297 132.68 -314.07  8.9157 0.0030349 ** 
## width:wheel.base        1    5.7406 134.99 -307.44 14.9230 0.0001344 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(fit4, scope = fit_int, test = "F")     # none interaction should be added back

## Single term additions
## 
## Model:
## consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##                        Df Sum of Sq    RSS     AIC F value Pr(>F)
## <none>                              129.25 -322.12               
## ftype:fdrive            7    1.3933 127.86 -312.29  0.5122 0.8253
## ftype:fcyl              7    3.2647 125.99 -317.95  1.2179 0.2922
## ftype:horsepower        4    1.9373 127.32 -319.92  1.2630 0.2843
## ftype:length            4    0.7814 128.47 -316.45  0.5049 0.7322
## fdrive:fcyl             4    1.9058 127.35 -319.83  1.2421 0.2928
## fdrive:engine.size      2    0.1635 129.09 -318.61  0.2115 0.8095
## fdrive:horsepower       2    0.3257 128.93 -319.09  0.4218 0.6562
## fdrive:weight           2    0.1742 129.08 -318.64  0.2254 0.7984
## fdrive:width            2    0.5533 128.70 -319.77  0.7179 0.4885
## fcyl:horsepower         2    1.6217 127.63 -322.97  2.1219 0.1214
## fcyl:wheel.base         2    1.1893 128.06 -321.67  1.5509 0.2136
## engine.size:length      1    0.0000 129.25 -320.12  0.0000 0.9952
## engine.size:width       1    0.0022 129.25 -320.13  0.0058 0.9393
## engine.size:wheel.base  1    0.3319 128.92 -321.11  0.8625 0.3537
## horsepower:weight       1    0.2748 128.98 -320.94  0.7138 0.3988
## horsepower:length       1    0.0719 129.18 -320.34  0.1866 0.6661
## horsepower:wheel.base   1    0.0014 129.25 -320.13  0.0035 0.9526
## weight:width            1    0.1055 129.15 -320.44  0.2735 0.6013
## weight:wheel.base       1    0.0189 129.24 -320.18  0.0489 0.8252
## length:width            1    0.0552 129.20 -320.29  0.1431 0.7055
## length:wheel.base       1    0.9640 128.29 -323.00  2.5173 0.1135

anova(fit4, fit_int)                        # comparable to all two-way interaction model, spares 48 coefs

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
## Model 2: consumption ~ (ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base)^2
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    336 129.25                           
## 2    288 113.02 48    16.238 0.8621 0.7282

anova(fit_add, fit4)                        # significantly improves additive model

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    369 189.17                                  
## 2    336 129.25 33    59.915 4.7198 4.612e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit4, fit3)                           # comparable to fit3, spares 9 coefs

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:fcyl + 
##     fdrive:length + fdrive:width + fdrive:wheel.base + fcyl:engine.size + 
##     fcyl:horsepower + fcyl:weight + fcyl:length + fcyl:width + 
##     engine.size:horsepower + engine.size:weight + horsepower:width + 
##     horsepower:wheel.base + weight:length + width:wheel.base
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    336 129.25                           
## 2    327 124.69  9     4.564 1.3299 0.2201

The final model seems to be fit4 where every variable appears in some interaction, even width and fdrive which were excluded in model fit2 when no interactions were considered:

anova(fit2, fit4)

## Analysis of Variance Table
## 
## Model 1: consumption ~ ftype + fcyl + engine.size + horsepower + weight + 
##     wheel.base + length
## Model 2: consumption ~ ftype + fdrive + fcyl + engine.size + horsepower + 
##     weight + length + width + wheel.base + ftype:engine.size + 
##     ftype:weight + ftype:width + ftype:wheel.base + fdrive:length + 
##     fdrive:wheel.base + fcyl:engine.size + fcyl:weight + fcyl:length + 
##     fcyl:width + engine.size:horsepower + engine.size:weight + 
##     horsepower:width + weight:length + width:wheel.base
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    372 192.68                                  
## 2    336 129.25 36     63.43 4.5802 2.658e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

General exceptions from eliminating variables:

compulsory variables we are obliged to adjust the effects for (age, sex in medical studies),
variables crucial for answering the research question (treatment group in medical studies).

Elimination despite significance:

adjusting for multiple testing with lower significance threshold to keep only the strong effects to keep model parsimonious,
to avoid multicollinearity: highly correlated explanatory variables. It does not hurt estimation of \(\mathbb{X} \boldsymbol{\beta}\), but the precision of the estimates of regression coefficients \(\boldsymbol{\beta}\))

Our model was fitted on many highly correlated variables, see the exploration above. We compute the so called variance inflation factor (VIF) that reveals which variables may cause inflation in the estimation of variance of coefficients. It compares the current estimates of variance with the variability of estimates under orthogonal (uncorrelated) regressors.

car::vif(fit4, type = "predictor")

##                     GVIF Df GVIF^(1/(2*Df))                         Interacts With
## ftype       1.232958e+11 26        1.634133 engine.size, weight, width, wheel.base
## fdrive      1.058330e+07  8        2.748140                     length, wheel.base
## fcyl        1.808539e+09 16        1.946666     engine.size, weight, length, width
## engine.size 6.213474e+27 23        4.019786        ftype, fcyl, horsepower, weight
## horsepower  1.679671e+07  5        5.278645                     engine.size, width
## weight      6.364075e+24 25        3.133825       ftype, fcyl, engine.size, length
## length      1.789088e+18 13        5.035288                   fdrive, fcyl, weight
## width       1.708979e+20 21        3.032018    ftype, fcyl, horsepower, wheel.base
## wheel.base  4.004078e+19 19        3.279860                   ftype, fdrive, width
##                                                Other Predictors
## ftype                          fdrive, fcyl, horsepower, length
## fdrive      ftype, fcyl, engine.size, horsepower, weight, width
## fcyl                      ftype, fdrive, horsepower, wheel.base
## engine.size                   fdrive, length, width, wheel.base
## horsepower      ftype, fdrive, fcyl, weight, length, wheel.base
## weight                    fdrive, horsepower, width, wheel.base
## length        ftype, engine.size, horsepower, width, wheel.base
## width                       fdrive, engine.size, weight, length
## wheel.base        fcyl, engine.size, horsepower, weight, length

car::vif(fit_add)

##                  GVIF Df GVIF^(1/(2*Df))
## ftype        8.283051  4        1.302488
## fdrive       2.737493  2        1.286288
## fcyl         7.636180  2        1.662337
## engine.size 10.586511  1        3.253692
## horsepower   3.905445  1        1.976220
## weight       9.183397  1        3.030412
## length       6.447956  1        2.539283
## width        4.879748  1        2.209015
## wheel.base   7.199453  1        2.683180

Individual work

Try to find a reasonable model for the data set you were assigned to analyse.
Make sure you can use the final model to answer your research problems.