Základy regrese | NMFM 334

Letný semester 2024 | Cvičenie 5 | 26.03.2024
HTML Markdown – Rmd source file (UTF encoding)



Outline of the lab session no.5:



For the purpose of this lab session we will consider the dataset Dris from the R library mffSM. We will be interested in the dependence of the yield on the amount of magnesium concentration (yield ~ Mg) and the amount of the calcium concentration (yield ~ Ca).

All together, there are 368 observations avaiable in the dataset and 6 different covariates (although, we will not use all of them). A brief insight about the data can be taken from the following code:

library("mffSM")

data(Dris, package = "mffSM")

dim(Dris)
## [1] 368   6
head(Dris)
##   yield   N  P   K Ca Mg
## 1  5.47 470 47 320 47 11
## 2  5.63 530 48 357 60 16
## 3  5.63 530 48 310 63 16
## 4  4.84 482 47 357 47 13
## 5  4.84 506 48 294 52 15
## 6  4.21 500 45 283 61 14
summary(Dris)
##      yield             N               P               K        
##  Min.   :1.920   Min.   :268.0   Min.   :32.00   Min.   :200.0  
##  1st Qu.:4.040   1st Qu.:427.0   1st Qu.:43.00   1st Qu.:338.0  
##  Median :4.840   Median :473.0   Median :49.00   Median :377.0  
##  Mean   :4.864   Mean   :469.9   Mean   :48.65   Mean   :375.4  
##  3rd Qu.:5.558   3rd Qu.:518.0   3rd Qu.:54.00   3rd Qu.:407.0  
##  Max.   :8.792   Max.   :657.0   Max.   :72.00   Max.   :580.0  
##        Ca              Mg       
##  Min.   :29.00   Min.   : 8.00  
##  1st Qu.:41.75   1st Qu.:10.00  
##  Median :49.00   Median :11.00  
##  Mean   :51.45   Mean   :11.64  
##  3rd Qu.:59.00   3rd Qu.:13.00  
##  Max.   :95.00   Max.   :22.00

Firstly, we visualize marginal dependence of yield on th magnesium concentration and the calcium concentration.

par(mfrow = c(1,2))

plot(yield ~ Mg, data = Dris, pch = 22, bg = "gray", xlab = "Magnesium concentration", ylab = "Yield")
plot(yield ~ Ca, data = Dris, pch = 22, bg = "gray", xlab = "Calcium concentration", ylab = "Yield")

Note, that the points are not equaly distributed along the \(x\) axis. There are way more points at the left corners of the scatterplots than in the right parts. Ths suggest, that a logarithmic transformation (of the magnesium and calcium concentration) could help to align the data more uniformly along the \(x\) axis,

For more intuitive interpretation we use the logarithm with the base two (meaning that unit increase of the logaritmically transformed covariates, i.e., \(\log_2(x) + 1 = \log_2 (x) \cdot \log_2 2 = \log_2 (2 x)\) equals double amount of the original covariate).

### Add transformations of Mg, Ca, N to the dataset 
Dris <- transform(Dris, lMg = log2(Mg), lMg2 = (log2(Mg))^2, lCa = log2(Ca), 
                  lCa2 = (log2(Ca))^2, lN = log2(N), lN2=(log2(N))^2)

par(mfrow = c(1,2))
plot(yield ~ lMg, data = Dris, pch = 22, bg = "gray", xlab = "Magnesium concentration [log scale]", ylab = "Yield")
plot(yield ~ lCa, data = Dris, pch = 22, bg = "gray", xlab = "Calcium concentration [log scale]", ylab = "Yield")



For the following, we will use the tranformed data.

1. Simple and multiple linear regression models

Let us start with two simple regression model (one explanatory variable only) and we compare these models with a multiple regression model (with both explanatory variables being used at the same time).

summary(m10 <- lm(yield ~ lMg, data = Dris))
## 
## Call:
## lm(formula = yield ~ lMg, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1194 -0.7412 -0.0741  0.7451  3.9841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.4851     0.7790   1.907   0.0574 .  
## lMg           0.9605     0.2208   4.349 1.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 366 degrees of freedom
## Multiple R-squared:  0.04915,    Adjusted R-squared:  0.04655 
## F-statistic: 18.92 on 1 and 366 DF,  p-value: 1.772e-05
summary(m01 <- lm(yield ~ lCa, data = Dris))
## 
## Call:
## lm(formula = yield ~ lCa, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9021 -0.8094 -0.0388  0.7211  3.9279 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4561     0.9119   5.983 5.22e-09 ***
## lCa          -0.1049     0.1614  -0.650    0.516    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.096 on 366 degrees of freedom
## Multiple R-squared:  0.001153,   Adjusted R-squared:  -0.001576 
## F-statistic: 0.4227 on 1 and 366 DF,  p-value: 0.516

Both marginal model can be now compared with a larger model, containing both predictor variables:

### Additive model
summary(m11 <- lm(yield ~ lMg + lCa, data = Dris))
## 
## Call:
## lm(formula = yield ~ lMg + lCa, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9514 -0.7210 -0.0838  0.7700  4.0120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.4026     0.9525   3.572 0.000402 ***
## lMg           1.3998     0.2531   5.530 6.11e-08 ***
## lCa          -0.6140     0.1805  -3.402 0.000742 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 365 degrees of freedom
## Multiple R-squared:  0.07838,    Adjusted R-squared:  0.07333 
## F-statistic: 15.52 on 2 and 365 DF,  p-value: 3.396e-07

Note, there quite different values for the estimated paramters when being interested in the effect of magnesium or the effect of calcium. Which model should be taken as a reference one?



Indidividual work

Interpret the models fitted above – try to explain the meaning of the estimated parameters. In particular, focus on the following:
  • Interpret the estimated regression coefficients in the models.
  • What is being tested on the rows started with ‘lMg’ and ‘lCa’?
  • When comparing all three models, which one is, according your our opinion, the most suitable one?



In the following, we will try slightly more complex (multiple) regression models:

summary(m20 <- lm(yield ~ lMg + lMg2, data = Dris))
## 
## Call:
## lm(formula = yield ~ lMg + lMg2, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1942 -0.7636 -0.0413  0.8089  3.8957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -18.7523     7.6276  -2.458  0.01442 * 
## lMg          12.3814     4.2881   2.887  0.00412 **
## lMg2         -1.6030     0.6011  -2.667  0.00800 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.061 on 365 degrees of freedom
## Multiple R-squared:  0.06732,    Adjusted R-squared:  0.06221 
## F-statistic: 13.17 on 2 and 365 DF,  p-value: 2.994e-06
summary(m02 <- lm(yield ~ lCa + lCa2, data = Dris))
## 
## Call:
## lm(formula = yield ~ lCa + lCa2, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9082 -0.8314 -0.0488  0.6949  3.8889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -4.6108    12.7956  -0.360    0.719
## lCa           3.4343     4.4900   0.765    0.445
## lCa2         -0.3098     0.3928  -0.789    0.431
## 
## Residual standard error: 1.097 on 365 degrees of freedom
## Multiple R-squared:  0.002853,   Adjusted R-squared:  -0.002611 
## F-statistic: 0.5222 on 2 and 365 DF,  p-value: 0.5937

Is there some statistical improvements in the two models above?

And what about the following models?

summary(m21 <- lm(yield ~ lMg + lMg2 + lCa, data = Dris)) 
## 
## Call:
## lm(formula = yield ~ lMg + lMg2 + lCa, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0260 -0.7554 -0.1068  0.7474  3.9233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.9375     7.5351  -2.248 0.025186 *  
## lMg          12.8838     4.2282   3.047 0.002479 ** 
## lMg2         -1.6116     0.5923  -2.721 0.006824 ** 
## lCa          -0.6160     0.1789  -3.444 0.000641 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.045 on 364 degrees of freedom
## Multiple R-squared:  0.09675,    Adjusted R-squared:  0.0893 
## F-statistic:    13 on 3 and 364 DF,  p-value: 4.408e-08
summary(m22 <- lm(yield ~ lMg + lMg2 + lCa + lCa2, data = Dris)) 
## 
## Call:
## lm(formula = yield ~ lMg + lMg2 + lCa + lCa2, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0262 -0.7583 -0.1083  0.7455  3.9182 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -18.26667   13.29798  -1.374  0.17040   
## lMg          12.78321    4.31420   2.963  0.00325 **
## lMg2         -1.59767    0.60418  -2.644  0.00854 **
## lCa          -0.08582    4.37164  -0.020  0.98435   
## lCa2         -0.04638    0.38209  -0.121  0.90345   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.047 on 363 degrees of freedom
## Multiple R-squared:  0.09678,    Adjusted R-squared:  0.08683 
## F-statistic: 9.724 on 4 and 363 DF,  p-value: 1.758e-07

What is now the effect of the magnesium concentration on the expected (estimated) amount of yield? How this effect can be quantified using different models?

We can actually visualize the effect of the magnesium concentration on the yeld (using different models fitted above). Specifically, we are interested in the effect of the change of lMg by additional log2(1.1) which corresponds with the increase of Mg by 10%). What will be the corresponding change of the expected abount of yield, i.e. \(E[yield | lMg, lCa]\)?

eps <- 0.1;
delta <- log2(1+eps);
lMg.grid <- seq(min(Dris$lMg), max(Dris$lMg)-delta, length = 1000); 
effect.lMg <- function(b1,b2) b1*delta + b2*(2*delta*lMg.grid + delta^2); 

par(mfrow = c(1,2))

plot(lMg.grid, effect.lMg(coef(m20)['lMg'], coef(m20)['lMg2']), col = "blue", 
               xlab = "log2(Mg)", ylab = "Change of expected yield", 
               main = "The effect of 10%-increase of Mg on the yield", 
               type="l", lwd=2, ylim=range(effect.lMg(coef(m20)['lMg'], 
               coef(m20)['lMg2']), effect.lMg(coef(m21)['lMg'], 
               coef(m21)['lMg2']), effect.lMg(coef(m22)['lMg'], 
               coef(m22)['lMg2'])));
lines(lMg.grid, effect.lMg(coef(m21)['lMg'], coef(m21)['lMg2']), 
                lwd=2, col="red")
lines(lMg.grid, effect.lMg(coef(m22)['lMg'], coef(m22)['lMg2']), 
                lwd=2, col="yellow", lty="dotted") 
abline(h=0, lty="dotted")
legend(3.25, -0.1, col=c("blue", "red", "yellow"), lwd=c(2,2,2), 
                lty=c("solid", "solid", "dotted"), 
                legend=c("m20", "m21", "m22"), cex=1.3)


Mg.grid <- seq(min(Dris$Mg), max(Dris$Mg)-1, length = 1000); 
effect.Mg <- function(b1,b2){
  return(b1*(log2(Mg.grid+1) - log2(Mg.grid)) + 
           b2*((log2(1+Mg.grid))^2 - (log2(Mg.grid))^2))
}

plot(Mg.grid, effect.Mg(coef(m20)['lMg'], coef(m20)['lMg2']), col = "blue",
              xlab = "Mg", ylab = "Change of expected yield",
              main = "The effect of 10%-increase of Mg on the yield",
              type="l", lwd=2, ylim=range(effect.Mg(coef(m20)['lMg'], 
              coef(m20)['lMg2']), effect.Mg(coef(m21)['lMg'], 
              coef(m21)['lMg2']), effect.Mg(coef(m22)['lMg'], 
              coef(m22)['lMg2'])));
lines(Mg.grid, effect.Mg(coef(m21)['lMg'], coef(m21)['lMg2']), 
              lwd=2, col="red")
lines(Mg.grid, effect.Mg(coef(m22)['lMg'], coef(m22)['lMg2']), 
              lwd=2, col="yellow", lty="dotted")
abline(h=0, lty="dotted")
legend(15, 0.3, col=c("blue", "red", "yellow"), lwd=c(2,2,2), 
            lty=c("solid", "solid", "dotted"), legend=c("m20", "m21", "m22"), 
            cex=1.3)
rug(Dris$Mg+runif(nrow(Dris))-0.5);

Considering the models above, how would you (statistically) answer a question whether that given the information about the magnesium (‘Mg’) the amount of yield is independent of the concentration of the calcium?

How would you test that given Mg (with the quadratic dependence) the amount of ‘yield’ is independent of Ca?

Can we say that the amount of yield is independent of Ca? Can we say that given Mg and Ca the amount of yield is independent of N?



2. Regression models with simple interations

The interactions in the model allow for modelling a flexible effect of some covariate given the value of some other covariate (or multiple covariates). We consider a simple model with one interaction term (denoted as ``lMg:lCa):

summary(m1_inter <- lm(yield ~ lMg + lCa + lMg:lCa, data = Dris))
## 
## Call:
## lm(formula = yield ~ lMg + lCa + lMg:lCa, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9833 -0.7326 -0.1275  0.7797  3.9535 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -23.653     11.878  -1.991  0.04719 * 
## lMg            9.046      3.356   2.696  0.00735 **
## lCa            4.159      2.096   1.984  0.04802 * 
## lMg:lCa       -1.346      0.589  -2.285  0.02288 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.049 on 364 degrees of freedom
## Multiple R-squared:  0.09141,    Adjusted R-squared:  0.08392 
## F-statistic: 12.21 on 3 and 364 DF,  p-value: 1.254e-07

What is the interpretation of the estimated parameters now? What is the effect of the magnesium concentration and the calcium concentration on the amount of yield? Compare the interaction model above with the mode below (multiple model with three explanatory variables, where the last variable \(Z\) is created artificiall from the magnesium concentration and the calcium concentration).

Dris$Z <- Dris$lMg * Dris$lCa
summary(m1_Z <- lm(yield ~ lMg + lCa + Z, data = Dris))
## 
## Call:
## lm(formula = yield ~ lMg + lCa + Z, data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9833 -0.7326 -0.1275  0.7797  3.9535 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -23.653     11.878  -1.991  0.04719 * 
## lMg            9.046      3.356   2.696  0.00735 **
## lCa            4.159      2.096   1.984  0.04802 * 
## Z             -1.346      0.589  -2.285  0.02288 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.049 on 364 degrees of freedom
## Multiple R-squared:  0.09141,    Adjusted R-squared:  0.08392 
## F-statistic: 12.21 on 3 and 364 DF,  p-value: 1.254e-07

Again, we can quantify the effect of the magnesium concentration on the yield however, this time we have to properly take into account the underlying concentration of calcium. How can be this done?

Mg.grid <- seq(min(Dris$Mg), max(Dris$Mg), length = 1000)
Ca.grid <- seq(min(Dris$Ca), max(Dris$Ca), length = 1000)
lCa.grid <- seq(min(Dris$lCa), max(Dris$lCa), length = 1000)
eps <- 0.1;
delta <- log2(1+eps)


par(mfrow = c(1, 2))
plot(lCa.grid, delta*coef(m1_inter)['lMg'] + 
               delta*coef(m1_inter)['lMg:lCa'] * lCa.grid, col = "blue", 
         xlab = "log2(Ca)", ylab = "Change of expected yield", 
         main = "The effect of 10%-increase of Mg on the yield", type="l", lwd=2)

plot(Ca.grid, coef(m1_inter)['lMg'] * (log2(Mg.grid+1)-log2(Mg.grid)) + 
              coef(m1_inter)['lMg:lCa'] * (log2(Mg.grid+1)-
              log2(Mg.grid)) * log2(Ca.grid), col = "blue", 
              xlab = "Ca", ylab = "Change of expected yield", 
              main = "The effect of increase of Mg by +1 on the yield", 
              type="l", lwd=2)



3. Interations with categorical covariates

Now, we will consider another dataset from the mffSM packages. Data are provided from the American Association of University Professors (AAUP) on annual faculty salary survey of American colleges and universities, 1995.

data(AAUP, package = "mffSM")

dim(AAUP)
## [1] 1161   17
head(AAUP)
##    FICE                      name state type salary.prof salary.assoc
## 1  1061 Alaska Pacific University    AK  IIB         454          382
## 2  1063     Univ.Alaska-Fairbanks    AK    I         686          560
## 3  1065     Univ.Alaska-Southeast    AK  IIA         533          494
## 4 11462     Univ.Alaska-Anchorage    AK  IIA         612          507
## 5  1002 Alabama Agri.&Mech. Univ.    AL  IIA         442          369
## 6  1004  University of Montevallo    AL  IIA         441          385
##   salary.assist salary compens.prof compens.assoc compens.assist compens n.prof
## 1           362    382          567           485            471     487      6
## 2           432    508          914           753            572     677     74
## 3           329    415          716           663            442     559      9
## 4           414    498          825           681            557     670    115
## 5           310    350          530           444            376     423     59
## 6           310    388          542           473            383     477     57
##   n.assoc n.assist n.instruct n.faculty
## 1      11        9          4        32
## 2     125      118         40       404
## 3      26       20          9        70
## 4     124      101         21       392
## 5      77      102         24       262
## 6      33       35          2       127
summary(AAUP)
##       FICE           name               state      type      salary.prof    
##  Min.   : 1002   Length:1161        PA     : 85   I  :180   Min.   : 270.0  
##  1st Qu.: 1903   Class :character   NY     : 81   IIA:363   1st Qu.: 440.0  
##  Median : 2668   Mode  :character   CA     : 54   IIB:618   Median : 506.0  
##  Mean   : 3052                      TX     : 54             Mean   : 524.1  
##  3rd Qu.: 3420                      OH     : 53             3rd Qu.: 600.0  
##  Max.   :29269                      IL     : 50             Max.   :1009.0  
##                                     (Other):784             NA's   :68      
##   salary.assoc   salary.assist       salary       compens.prof   
##  Min.   :234.0   Min.   :199.0   Min.   :232.0   Min.   : 319.0  
##  1st Qu.:367.0   1st Qu.:313.0   1st Qu.:352.0   1st Qu.: 547.0  
##  Median :413.0   Median :349.0   Median :407.0   Median : 635.0  
##  Mean   :416.4   Mean   :351.9   Mean   :420.4   Mean   : 653.5  
##  3rd Qu.:461.0   3rd Qu.:388.0   3rd Qu.:475.0   3rd Qu.: 753.0  
##  Max.   :733.0   Max.   :576.0   Max.   :866.0   Max.   :1236.0  
##  NA's   :36      NA's   :24                      NA's   :68      
##  compens.assoc   compens.assist     compens           n.prof     
##  Min.   :292.0   Min.   :246.0   Min.   : 265.0   Min.   :  0.0  
##  1st Qu.:456.0   1st Qu.:389.0   1st Qu.: 436.0   1st Qu.: 18.0  
##  Median :519.0   Median :437.0   Median : 510.0   Median : 40.0  
##  Mean   :523.8   Mean   :442.1   Mean   : 526.7   Mean   : 95.1  
##  3rd Qu.:583.0   3rd Qu.:493.0   3rd Qu.: 600.0   3rd Qu.:105.0  
##  Max.   :909.0   Max.   :717.0   Max.   :1075.0   Max.   :997.0  
##  NA's   :36      NA's   :24                                      
##     n.assoc          n.assist        n.instruct       n.faculty     
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :   7.0  
##  1st Qu.: 19.00   1st Qu.: 21.00   1st Qu.:  2.00   1st Qu.:  68.0  
##  Median : 38.00   Median : 40.00   Median :  6.00   Median : 132.0  
##  Mean   : 72.39   Mean   : 68.63   Mean   : 12.74   Mean   : 257.4  
##  3rd Qu.: 89.00   3rd Qu.: 92.00   3rd Qu.: 14.00   3rd Qu.: 323.0  
##  Max.   :721.00   Max.   :510.00   Max.   :178.00   Max.   :2261.0  
## 
### A data subset containing only relevant variables
aaup <- subset(AAUP, select = c("FICE", "name", "state", "type", "n.prof",
                                "n.assoc", "n.assist", "salary.assoc"))

dim(aaup)
## [1] 1161    8
head(aaup)
##    FICE                      name state type n.prof n.assoc n.assist
## 1  1061 Alaska Pacific University    AK  IIB      6      11        9
## 2  1063     Univ.Alaska-Fairbanks    AK    I     74     125      118
## 3  1065     Univ.Alaska-Southeast    AK  IIA      9      26       20
## 4 11462     Univ.Alaska-Anchorage    AK  IIA    115     124      101
## 5  1002 Alabama Agri.&Mech. Univ.    AL  IIA     59      77      102
## 6  1004  University of Montevallo    AL  IIA     57      33       35
##   salary.assoc
## 1          382
## 2          560
## 3          494
## 4          507
## 5          369
## 6          385
summary(aaup)
##       FICE           name               state      type         n.prof     
##  Min.   : 1002   Length:1161        PA     : 85   I  :180   Min.   :  0.0  
##  1st Qu.: 1903   Class :character   NY     : 81   IIA:363   1st Qu.: 18.0  
##  Median : 2668   Mode  :character   CA     : 54   IIB:618   Median : 40.0  
##  Mean   : 3052                      TX     : 54             Mean   : 95.1  
##  3rd Qu.: 3420                      OH     : 53             3rd Qu.:105.0  
##  Max.   :29269                      IL     : 50             Max.   :997.0  
##                                     (Other):784                            
##     n.assoc          n.assist       salary.assoc  
##  Min.   :  0.00   Min.   :  0.00   Min.   :234.0  
##  1st Qu.: 19.00   1st Qu.: 21.00   1st Qu.:367.0  
##  Median : 38.00   Median : 40.00   Median :413.0  
##  Mean   : 72.39   Mean   : 68.63   Mean   :416.4  
##  3rd Qu.: 89.00   3rd Qu.: 92.00   3rd Qu.:461.0  
##  Max.   :721.00   Max.   :510.00   Max.   :733.0  
##                                    NA's   :36
### considering only a subset of covariates
Data <-subset(aaup, complete.cases(aaup[, c("salary.assoc", "n.prof", "n.assoc",
                                            "n.assist", "type")]))



Indidividual work

  • Consider the dataset on annual income of university professors in USA and perfom simple exploratory analysis.
  • Focus on average yearly salaries for different types of the universities.
  • In our exploratory analysis try to take into account also an information about different numbers of univeristy professors (of different ranks) at the given university.



COL3 <- rainbow_hcl(3)
plot(salary.assoc ~ type, data = aaup, col = COL3, xlab = "Type of institution", 
                          ylab = "Salary associate professor [USD 100]")

We will fit a few different models that we will compare later. Firstly, a simple additive model:

summary(m1orig <- lm(salary.assoc ~ type + n.prof + n.assoc + n.assist,  data = Data))
## 
## Call:
## lm(formula = salary.assoc ~ type + n.prof + n.assoc + n.assist, 
##     data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -144.657  -39.679   -6.887   33.560  266.641 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 448.49675    8.61850  52.039  < 2e-16 ***
## typeIIA     -19.45462    7.18979  -2.706  0.00692 ** 
## typeIIB     -70.56945    8.20735  -8.598  < 2e-16 ***
## n.prof        0.10370    0.02760   3.758  0.00018 ***
## n.assoc       0.07075    0.05883   1.203  0.22940    
## n.assist     -0.06609    0.06445  -1.025  0.30538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.21 on 1119 degrees of freedom
## Multiple R-squared:  0.3408, Adjusted R-squared:  0.3379 
## F-statistic: 115.7 on 5 and 1119 DF,  p-value: < 2.2e-16

Interpret the estimated parameters in the model and compare it with the interpretation of the estimated parameters in the following model:

Data <- transform(Data, n.prof40 = n.prof - 40,
                  n.assoc40 = n.assoc - 40,
                  n.assist40 = n.assist - 40)

### Simple additive model with transformed variables 
summary(m1 <- lm(salary.assoc ~ type + n.prof40 + n.assoc40 + n.assist40, data = Data))
## 
## Call:
## lm(formula = salary.assoc ~ type + n.prof40 + n.assoc40 + n.assist40, 
##     data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -144.657  -39.679   -6.887   33.560  266.641 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 452.83099    7.50270  60.356  < 2e-16 ***
## typeIIA     -19.45462    7.18979  -2.706  0.00692 ** 
## typeIIB     -70.56945    8.20735  -8.598  < 2e-16 ***
## n.prof40      0.10370    0.02760   3.758  0.00018 ***
## n.assoc40     0.07075    0.05883   1.203  0.22940    
## n.assist40   -0.06609    0.06445  -1.025  0.30538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.21 on 1119 degrees of freedom
## Multiple R-squared:  0.3408, Adjusted R-squared:  0.3379 
## F-statistic: 115.7 on 5 and 1119 DF,  p-value: < 2.2e-16

And, once more the same model, however, with a different parametrization used for the categorical covariate of the university type:

###  contrast sum ("contr.sum") parametrization for "type"
m1B <- lm(salary.assoc ~ type +n.prof40 + n.assoc40 + n.assist40, data = Data,  contrasts = list(type = "contr.sum"))

Finally, one more complicated model with interations (in two different parametrization of the continuous covariates):

summary(m2orig <- lm(salary.assoc ~ type + n.prof*n.assoc + n.assist,  data = Data)) 
## 
## Call:
## lm(formula = salary.assoc ~ type + n.prof * n.assoc + n.assist, 
##     data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.779  -39.956   -6.683   33.105  280.196 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.946e+02  9.949e+00  39.662  < 2e-16 ***
## typeIIA         1.697e+00  7.237e+00   0.235  0.81461    
## typeIIB        -2.828e+01  8.994e+00  -3.144  0.00171 ** 
## n.prof          3.079e-01  3.376e-02   9.119  < 2e-16 ***
## n.assoc         4.171e-01  6.671e-02   6.252 5.76e-10 ***
## n.assist       -1.339e-01  6.229e-02  -2.150  0.03175 *  
## n.prof:n.assoc -8.444e-04  8.649e-05  -9.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.91 on 1118 degrees of freedom
## Multiple R-squared:  0.3926, Adjusted R-squared:  0.3894 
## F-statistic: 120.5 on 6 and 1118 DF,  p-value: < 2.2e-16
summary(m2 <- lm(salary.assoc ~ type + n.prof40*n.assoc40 + n.assist40, data = Data)) 
## 
## Call:
## lm(formula = salary.assoc ~ type + n.prof40 * n.assoc40 + n.assist40, 
##     data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.779  -39.956   -6.683   33.105  280.196 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.169e+02  8.091e+00  51.522  < 2e-16 ***
## typeIIA             1.697e+00  7.237e+00   0.235  0.81461    
## typeIIB            -2.828e+01  8.994e+00  -3.144  0.00171 ** 
## n.prof40            2.741e-01  3.173e-02   8.637  < 2e-16 ***
## n.assoc40           3.833e-01  6.494e-02   5.903 4.74e-09 ***
## n.assist40         -1.339e-01  6.229e-02  -2.150  0.03175 *  
## n.prof40:n.assoc40 -8.444e-04  8.649e-05  -9.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.91 on 1118 degrees of freedom
## Multiple R-squared:  0.3926, Adjusted R-squared:  0.3894 
## F-statistic: 120.5 on 6 and 1118 DF,  p-value: < 2.2e-16

Indidividual work

  • Compare both models and interpret the estimated parameters.
  • Focus on the estimated salaries for different types of university while controlling for specific amounts of the academic personel at the given university.



4. Hierarchical structure of the model

Finally, we will briefly compare a hierarchically well formulated model and a model that is non-hierarchical, Above, we already considered the model for the expected anual salary of the university professors in the USA and the proportion of the professors, associate professors and assistant professors were lowered by 40 (in order to ensure a better interpretation of the intercept parameter and better interpretation of the interaction terms).

Consider firstly a hierarchically well formulated model with and without the corresponding linear transformation of the number of professors.

summary(m3orig <- lm(salary.assoc ~ type + n.assoc * n.prof,  data = Data)) 
## 
## Call:
## lm(formula = salary.assoc ~ type + n.assoc * n.prof, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.782  -40.497   -6.265   33.621  283.148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.918e+02  9.883e+00  39.651  < 2e-16 ***
## typeIIA         1.750e+00  7.248e+00   0.241  0.80924    
## typeIIB        -2.684e+01  8.983e+00  -2.988  0.00287 ** 
## n.assoc         3.340e-01  5.447e-02   6.132  1.2e-09 ***
## n.prof          2.910e-01  3.289e-02   8.848  < 2e-16 ***
## n.assoc:n.prof -8.236e-04  8.609e-05  -9.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56 on 1119 degrees of freedom
## Multiple R-squared:  0.3901, Adjusted R-squared:  0.3874 
## F-statistic: 143.2 on 5 and 1119 DF,  p-value: < 2.2e-16
summary(m3 <- lm(salary.assoc ~ type + n.assoc40 * n.prof40, data = Data)) 
## 
## Call:
## lm(formula = salary.assoc ~ type + n.assoc40 * n.prof40, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.782  -40.497   -6.265   33.621  283.148 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.155e+02  8.080e+00  51.428  < 2e-16 ***
## typeIIA             1.750e+00  7.248e+00   0.241  0.80924    
## typeIIB            -2.684e+01  8.983e+00  -2.988  0.00287 ** 
## n.assoc40           3.010e-01  5.256e-02   5.728 1.31e-08 ***
## n.prof40            2.581e-01  3.090e-02   8.353  < 2e-16 ***
## n.assoc40:n.prof40 -8.236e-04  8.609e-05  -9.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56 on 1119 degrees of freedom
## Multiple R-squared:  0.3901, Adjusted R-squared:  0.3874 
## F-statistic: 143.2 on 5 and 1119 DF,  p-value: < 2.2e-16


### Indidividual work Both models above a hierarchically well formulated models.
  • Compare the estimated paramters and try to reconstract the parameter estimates from one model using the estimates from the second model..
  • Compare summary statistics for both models (i.e., degrees of freedom, residual sum of squares, \(F\)-statistic, etc. )



Now, we will fit two wery similar models but both will be non-hierarchical. Compare both models and try to understand the effect of the transformation used.

summary(m3orig <- lm(salary.assoc ~  - 1 + type + n.assoc * n.prof - n.prof,  data = Data)) 
## 
## Call:
## lm(formula = salary.assoc ~ -1 + type + n.assoc * n.prof - n.prof, 
##     data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -141.166  -40.814   -7.942   33.808  288.497 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## typeI           4.312e+02  9.122e+00  47.272  < 2e-16 ***
## typeIIA         4.120e+02  4.869e+00  84.621  < 2e-16 ***
## typeIIB         3.709e+02  2.763e+00 134.221  < 2e-16 ***
## n.assoc         3.933e-01  5.589e-02   7.037 3.41e-12 ***
## n.assoc:n.prof -3.559e-04  7.025e-05  -5.067 4.73e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.9 on 1120 degrees of freedom
## Multiple R-squared:  0.9813, Adjusted R-squared:  0.9812 
## F-statistic: 1.176e+04 on 5 and 1120 DF,  p-value: < 2.2e-16
summary(m3 <- lm(salary.assoc ~  - 1 + type + n.assoc40 * n.prof40 - n.prof40, data = Data)) 
## 
## Call:
## lm(formula = salary.assoc ~ -1 + type + n.assoc40 * n.prof40 - 
##     n.prof40, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -140.522  -40.753   -7.787   34.716  289.971 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## typeI               4.430e+02  7.602e+00  58.280  < 2e-16 ***
## typeIIA             4.258e+02  3.496e+00 121.800  < 2e-16 ***
## typeIIB             3.866e+02  2.512e+00 153.892  < 2e-16 ***
## n.assoc40           4.055e-01  5.259e-02   7.710 2.77e-14 ***
## n.assoc40:n.prof40 -4.337e-04  7.452e-05  -5.821 7.65e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.69 on 1120 degrees of freedom
## Multiple R-squared:  0.9814, Adjusted R-squared:  0.9814 
## F-statistic: 1.184e+04 on 5 and 1120 DF,  p-value: < 2.2e-16