Základy regrese | NMFM 334

Letný semester 2024 | Cvičenie 8 | 16.04.2024
HTML Markdown – Rmd source file (UTF encoding)

Brief outline of the work for the lab session no.8:

Homoscedastic vs. heteroscedastic data/model
Some tools to deal with non-normally distributed data or heteroscedastic data
Statistical inference in a model without normality (mean vs. variance structure)

We will use a small dataset (47 observations and 8 different covariates) containing some pices of information about fires that occurred in Chicago in 1970 while distinguishing for some information provided in the data—particularly, we will focus on the locality in Chicago (north vs. south) and the proportion of minority citizens. The dataset is loaded from the website and some brief insight is provided below.

 library("mffSM")
 
 chicago <- read.csv("http://www.karlin.mff.cuni.cz/~maciak/NMFM334/chicago.csv", 
                    header=T, stringsAsFactors = TRUE)
        
 chicago <- transform(chicago, fside = factor(side, levels = 0:1, 
                                      labels=c("North", "South")))          
                    
 head(chicago)

 ##   minor fire theft  old insur income side fside
 ## 1  10.0  6.2    29 60.4   0.0 11.744    0 North
 ## 2  22.2  9.5    44 76.5   0.1  9.323    0 North
 ## 3  19.6 10.5    36 73.5   1.2  9.948    0 North
 ## 4  17.3  7.7    37 66.9   0.5 10.656    0 North
 ## 5  24.5  8.6    53 81.4   0.7  9.730    0 North
 ## 6  54.0 34.1    68 52.6   0.3  8.231    0 North

 summary(chicago)

 ##      minor            fire           theft             old       
 ##  Min.   : 1.00   Min.   : 2.00   Min.   :  3.00   Min.   : 2.00  
 ##  1st Qu.: 3.75   1st Qu.: 5.65   1st Qu.: 22.00   1st Qu.:48.60  
 ##  Median :24.50   Median :10.40   Median : 29.00   Median :65.00  
 ##  Mean   :34.99   Mean   :12.28   Mean   : 32.36   Mean   :60.33  
 ##  3rd Qu.:57.65   3rd Qu.:16.05   3rd Qu.: 38.00   3rd Qu.:77.30  
 ##  Max.   :99.70   Max.   :39.70   Max.   :147.00   Max.   :90.10  
 ##      insur            income            side          fside   
 ##  Min.   :0.0000   Min.   : 5.583   Min.   :0.0000   North:25  
 ##  1st Qu.:0.0000   1st Qu.: 8.447   1st Qu.:0.0000   South:22  
 ##  Median :0.4000   Median :10.694   Median :0.0000             
 ##  Mean   :0.6149   Mean   :10.696   Mean   :0.4681             
 ##  3rd Qu.:0.9000   3rd Qu.:11.989   3rd Qu.:1.0000             
 ##  Max.   :2.2000   Max.   :21.480   Max.   :1.0000

The data contain information from Chicago insurance redlining in 47 districts in 1970 where minor stands for the percentage of minority (0 - 100) in a given district, fire stands for the number of fires per 100 households during the reference period, theft denotes the number of reported thefts per 1000 inhabitants, old gives percentage of residential units built before 1939, insur states the number of policies per 100 household, income is the median income (USD 1000) per household, and, finally, side gives the location of the district within Chicago (0 = North, 1 = South)

Indidividual work

How can we manually reconstruct the plots above?

Use the data on fires in Chicago (the dataset loaded above) and perform some basic explanatory analysis while focusing on the number of fires (the dependent variable fire and two explanatory variables—the proportion of the minority (minor) and the location in Chicago—side).
What are you expectations about the form of the regression function when being interested in modeling of the conditional number of fires \(E[Y | minor, side]\), while conditioning on the minority proportion and the locality information?
Are there some issues that you should be concern with? What about the asssumptions on normal regression model or an ordinary regression model (without the assumption of normality)?
What particular graphical tools can be used to visualy verify the assumptions imposed on the model?

1.Expected number of fires given minority and locality

We start by looking for a model of dependence of the number of fires (dependent variable fire) on the percentage of minority (minor). We are also aware of the fact that the form of this dependence may differ in north and south parts of Chicago.

XLAB <- "Minority (%)"
YLAB <- "Fires (#/100 households)"
COLS <- c(North = "darkgreen", South = "red")
BGS <- c(North = "aquamarine", South = "orange")
PCHS <- c(North = 21, South = 23)
plot(fire ~ minor, col = COLS[fside], bg = BGS[fside], pch = PCHS[fside], 
                  xlab = XLAB, ylab = YLAB, data = chicago)
legend(5, 40, legend = c("North", "South"), pch = PCHS, 
             col = COLS, pt.bg = BGS, bty = "n")
lines(with(subset(chicago, fside == "North"), lowess(minor, fire)), 
                                             col = COLS["North"], lwd = 2)
lines(with(subset(chicago, fside == "South"), lowess(minor, fire)), 
                                             col = COLS["South"], lwd = 2)

How would you interpret the plot? Do we need interactions in the model? Does the side of Chicago modify the relationship between the number of fires and the percentage of minority?

Lets compare the following three models. Based on the output describe the effect of the percentage of minority on the number of fires. Describe also the effect of the side.

m <- list()
m[["Line"]]  <- lm(fire ~ minor * fside, data = chicago)
m[["Log2"]]   <- lm(fire ~ log2(minor) * fside, data = chicago)
m[["Parab"]] <- lm(fire ~ (minor + I(minor^2)) * fside, data = chicago)

summary(m[["Line"]])

## 
## Call:
## lm(formula = fire ~ minor * fside, data = chicago)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.262  -3.340  -1.474   3.075  18.303 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.19515    1.69212   3.070   0.0037 ** 
## minor             0.32274    0.04896   6.592 5.03e-08 ***
## fsideSouth        1.37454    3.10252   0.443   0.6600    
## minor:fsideSouth -0.20812    0.06589  -3.159   0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.535 on 43 degrees of freedom
## Multiple R-squared:  0.5387,    Adjusted R-squared:  0.5065 
## F-statistic: 16.74 on 3 and 43 DF,  p-value: 2.376e-07

summary(m[["Log2"]])

## 
## Call:
## lm(formula = fire ~ log2(minor) * fside, data = chicago)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5972 -4.6595 -0.8746  2.2849 17.3832 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -0.1727     2.2718  -0.076   0.9397    
## log2(minor)              3.9807     0.5982   6.655 4.08e-08 ***
## fsideSouth               2.9180     4.1885   0.697   0.4898    
## log2(minor):fsideSouth  -2.0180     0.8960  -2.252   0.0295 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.441 on 43 degrees of freedom
## Multiple R-squared:  0.5518,    Adjusted R-squared:  0.5206 
## F-statistic: 17.65 on 3 and 43 DF,  p-value: 1.291e-07

summary(m[["Parab"]])

## 
## Call:
## lm(formula = fire ~ (minor + I(minor^2)) * fside, data = chicago)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7633  -3.9179  -0.5716   3.0956  14.3219 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.021900   1.822544   1.109  0.27373    
## minor                  0.754990   0.144585   5.222 5.48e-06 ***
## I(minor^2)            -0.005287   0.001685  -3.137  0.00315 ** 
## fsideSouth             1.721106   3.419367   0.503  0.61742    
## minor:fsideSouth      -0.435921   0.194558  -2.241  0.03053 *  
## I(minor^2):fsideSouth  0.003173   0.002118   1.498  0.14180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.855 on 41 degrees of freedom
## Multiple R-squared:  0.6469,    Adjusted R-squared:  0.6038 
## F-statistic: 15.02 on 5 and 41 DF,  p-value: 2.228e-08

And the corresponding diagnostics:

plotLM(m[["Line"]])

plotLM(m[["Log2"]])

plotLM(m[["Parab"]])

For all three models there can be some issues spotted in the diagnostic plots above (heteroscedasticity, conditional mean specification, maybe normality issues). In addition, there is also a bunch of points scattered around the origin and relatively only a few observations are given for larger proportion of the minority or the larger amount of fires. This suggets that log transformation (of the response and the predictor as well) could help.

2. Response (logarithmic) transformation

Typically, when the residual variance increases with the response expectation, log-transformation of the response often ensures a homoscedastic model (Assumption A3a from the lecture).

Firstly, compare the plot below (on the log scale on \(y\) axes) with the previous plot on the original scale on the \(y\) axes:

LYLAB <- "Log(Fires)"
par(mfrow = c(1, 1))
plot(log(fire) ~ minor, col = COLS[fside], bg = BGS[fside], pch = PCHS[fside], 
                       xlab = XLAB, ylab = LYLAB, data = chicago)
legend(5, 3.5, legend = c("North", "South"), pch = PCHS, 
              col = COLS, pt.bg = BGS, bty = "n")
lines(with(subset(chicago, fside == "North"), lowess(minor, log(fire))), 
                                             col = COLS["North"], lwd = 2)
lines(with(subset(chicago, fside == "South"), lowess(minor, log(fire))), 
                                             col = COLS["South"], lwd = 2)

Using the proposed transformation of the response we also change the underlying data. This means, that instead of the model \[ E[Y_i | \boldsymbol{X}_i] = \boldsymbol{X}_i^\top \boldsymbol{\beta} \] fitted based on the random sample \(\{(Y_i, \boldsymbol{X}_i);±i = 1, \dots, n\}\) we fit another modelof the form \[ E[\log(Y_i) | \boldsymbol{X}_i] = \boldsymbol{X}_i^\top \boldsymbol{\beta} \] which is, however, based on the different random sample \(\{(\log(Y_i), \boldsymbol{X}_i);±i = 1, \dots, n\}\). This also has some crucial consequences. The logarithmic transformation will help to deal with some issues regarding heteroscedastic data but, on the other hand, it will introduce new problemes regarding the model interpretability.

The idea is that the final model should be (always) interpreted with respect to the original data—the information about the number of fires (fire)—not the logarithm of the number of fires (log(fire)).

For some better interpretation we will also use the log transformation of the independent variable – the information about the proportion of minority – minor. What is the interpretation of the following model:

summary(m <- lm(log(fire) ~ log2(minor) * fside, data = chicago))

## 
## Call:
## lm(formula = log(fire) ~ log2(minor) * fside, data = chicago)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75178 -0.33949 -0.07901  0.34564  0.84890 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.01616    0.14635   6.943 1.55e-08 ***
## log2(minor)             0.35961    0.03853   9.332 6.74e-12 ***
## fsideSouth              0.27963    0.26983   1.036   0.3058    
## log2(minor):fsideSouth -0.14411    0.05772  -2.497   0.0164 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4149 on 43 degrees of freedom
## Multiple R-squared:  0.7278,    Adjusted R-squared:  0.7088 
## F-statistic: 38.33 on 3 and 43 DF,  p-value: 3.234e-12

Recall the Jensen inequality and the fact that \[ E[\log(Y_i) | \boldsymbol{X}_i] \neq \log(E[Y_i | \boldsymbol{X}_i]). \] What are the consequences with respect to the interpretation of the parameter estimates provided in the summary output above? What would be a suitable interpretaion that we can use in this situation?

It is easy to visualize the fitted regression function. In addition, we can also add the confidence bands AROUND the regression function and the prediction bands. However, given the previous problems related to the Jensen inequality, only the prediction bands make sense for this model and the confidence (AROUND) bands does not have any good interpretation here.

lpred <- 500
pminor <- seq(1, 100, length = lpred)
pdata <- data.frame(minor = rep(pminor, 2), 
                   fside = factor(rep(c("North", "South"), 
                   each = lpred)))
                   
cifit <-predict(m, newdata = pdata, interval = "confidence")
pfit <-predict(m, newdata = pdata, interval = "prediction")

YLIM <- range(log(chicago[, "fire"]), pfit)
XLIM <- range(chicago[, "minor"])

par(mfrow = c(1, 2))
for (ss in levels(chicago[, "fside"])){
 plot(log(fire) ~ minor, col = COLS[fside], bg = BGS[fside], pch = PCHS[fside],
      xlab = XLAB, ylab = LYLAB, xlim = XLIM, ylim = YLIM,
      data = subset(chicago, fside == ss), main = ss)
 ind <- if (ss == "North") 1:lpred else (lpred + 1):(2 * lpred)
 lines(pminor, cifit[ind, "fit"], col = COLS[ss], lwd = 3)
 lines(pminor, cifit[ind, "lwr"], col = COLS[ss], lwd = 2, lty = 2)
 lines(pminor, cifit[ind, "upr"], col = COLS[ss], lwd = 2, lty = 2)
 lines(pminor, pfit[ind, "lwr"], col = COLS[ss], lwd = 2, lty = 3)
 lines(pminor, pfit[ind, "upr"], col = COLS[ss], lwd = 2, lty = 3)
   legend(x=25, y=1, lwd=c(3,2,2), lty=c(1,2,3), col=COLS[ss], 
                legend=c("estim. E[log(Y)|X=x]",
                         "conf. band around E[log(Y)|X=x]", 
                         "prediction band for log(Y)"))
}

par(mfrow = c(1, 1))

What is the interpretation of the estimated parameters and the specific numbers providsed in the summary output of the model with log-transformed dependent variable?

3. Model improvement (mean vs. variance structure)

Now, we will try to improve the model by adding the information about insurance policies (the independent variable insur). We will fit the following model:

m2 <- lm(log(fire) ~ (log(minor) + insur) * fside, data = chicago)

Note, that the fitted regression planes describing the dependence of log(fire) on log(minor) and insur based on the m2 model are the same as if the model is fitted separately in the North and South side of Chicago.

Hence if we are interested in just a description of those planes—the conditional mean structure (and are lazy to extract the appropriate intercepts and slopes from the m2 model), we can fit separate North/South models and take the coefficients from there.

Compare the output above with the following two models:

m2North <- lm(log(fire) ~ log(minor) + insur, data = chicago, 
                                            subset = (fside == "North"))
m2South <- lm(log(fire) ~ log(minor) + insur, data = chicago, 
                                            subset = (fside == "South"))

Indidividual work

Consider the model above and try to answer the following questions:

Can you reconstruct the (mean) parameter estimates from the m2 model obove using the two separate models for the north and south sides of Chicago? Can you do it vice versa?

Try some different parametrization of the locality. For instance

summary(lm(log(fire) ~ log(minor) * fside, data = chicago, 
                         contrasts = list(fside = contr.sum)))

## 
## Call:
## lm(formula = log(fire) ~ log(minor) * fside, data = chicago, 
##     contrasts = list(fside = contr.sum))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75178 -0.33949 -0.07901  0.34564  0.84890 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.15598    0.13491   8.568 7.54e-11 ***
## log(minor)         0.41486    0.04164   9.963 9.65e-13 ***
## fside1            -0.13982    0.13491  -1.036   0.3058    
## log(minor):fside1  0.10395    0.04164   2.497   0.0164 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4149 on 43 degrees of freedom
## Multiple R-squared:  0.7278,    Adjusted R-squared:  0.7088 
## F-statistic: 38.33 on 3 and 43 DF,  p-value: 3.234e-12

contr.sum(2)

##   [,1]
## 1    1
## 2   -1

What is the interpretation of the estimated parameters in the summary output above?