### -------------------------------------------------------------------------###
### NMSA407 - Linear Regression                                              ###
### Winter Term 2021/2022 | Online                                           ###
###                                                                          ###
### EXERCISE #1  Review                                                      ###
### -------------------------------------------------------------------------###

### There will be one R script (analogous to this one) available for each week 
### of the term. Each script is split into three separate parts: 
### i) Teaching material; ii) Individual work; and iii) Additional material. 
###
### i) The first part - the teaching material - will be walked through in detail
### during the exercise classes. 
### ii) The second part - the individual work - is meant for student's specific 
### interests. Going through the R code in this part is not an obligation but it
### is highly recommended. 
### iii) Finally, the last part - the additional material - can be used to 
### enrich specific knowledge and to get some deeper insight into some specific 
### steps of the first part of the script. 


### -------------------------------------------------------------------------###
###  Block 1: TEACHING MATERIAL                                              ###
### -------------------------------------------------------------------------###
   
### If you use your personal computer an additional package mffSM is needed 
### The package is not available at CRAN but, from can be downloaded from 
### the official course website
### https://www2.karlin.mff.cuni.cz/~komarek/vyuka/2021_22/nmsa407-2021.html
### Use the following command for the instalation: 

### 1) find the working directory of R using
#    getwd()
### 2) download either the .zip or the tar.gz file of the package mffSM
###    to the working directory
### 3) run 
#    install.packages("mffSM_1.1.zip", repos = NULL)
### or
#    install.packages("mffSM_1.1.tar.gz", repos = NULL)    

library("mffSM")
help(package="mffSM")

### data
data(Cars2004nh, package = "mffSM")

dim(Cars2004nh)
str(Cars2004nh)
head(Cars2004nh)
summary(Cars2004nh)

### Excluding cars where the consumption is missing. 
Cars2004nh <- Cars2004nh[!is.na(Cars2004nh$consumption),] 

### Creating a working subsample 
set.seed(201617)
idx <- c(sample((1:nrow(Cars2004nh))[Cars2004nh$ftype=="personal"], size=100), 
         sample((1:nrow(Cars2004nh))[Cars2004nh$ftype!="personal"], size=120))
Cars2004nh <- Cars2004nh[idx,]

### Deriving some additional variables

### fdrive2:  factor which distinguishes whether the drive is 
### front or other (rear or 4x4)
Cars2004nh <- transform(Cars2004nh, fdrive2 = factor(1*(fdrive == "front") + 
                          2*(fdrive != "front"), labels = c("front", "other")))
with(Cars2004nh, table(fdrive, fdrive2))

### cons10: 0/1 variable which is 1 if consumption <= 10 l/100 km, 0 otherwise
### fcons10: factor No/Yes derived from cons10 (0 -> No, 1 -> Yes)
Cars2004nh <- transform(Cars2004nh, cons10  = 1 * (consumption <= 10),
             fcons10 = factor(1 * (consumption <= 10), labels = c("No", "Yes")))

summary(Cars2004nh)


### 1. TWO SAMPLE PROBLEM (continuous variable vs. binary variable):
### Does the consumption depend on the drive 
### (distinguishing only two categories: front and other)?

### Exloratory part: Boxplot - usually the most useful plot for this problem 
### What could be used instead?
plot(consumption ~ fdrive2, data = Cars2004nh, col = rainbow_hcl(2))

### Sample characteristics
with(Cars2004nh, tapply(consumption, fdrive2, summary))
with(Cars2004nh, tapply(consumption, fdrive2, sd, na.rm = TRUE))

X = subset(Cars2004nh, fdrive2 == "front")$consumption
Y = subset(Cars2004nh, fdrive2 == "other")$consumption
  

### Formal statistical test(s) 
### | What is the underlying model in each test? | What are the assumptions? 

### Standard two-sample t-test (assuming equal variances)
t.test(consumption ~ fdrive2, data = Cars2004nh, var.equal = TRUE)

mean(X) - mean(Y)
  
### Welch two-sample t-test (not assuming equal variances)
t.test(consumption ~ fdrive2, data = Cars2004nh)  
  
### Wilcoxon two-sample rank test
wilcox.test(consumption ~ fdrive2, data = Cars2004nh, conf.int = TRUE)

median(X) - median(Y)
XYdiff = outer(X,Y,FUN="-")
median(XYdiff)

### Kolmogorov-Smirnov test
ks.test(subset(Cars2004nh, fdrive2 == "front")$consumption, 
        subset(Cars2004nh, fdrive2 == "other")$consumption, data = Cars2004nh)
  
plot(ecdf(X),main="ECDF")
plot(ecdf(Y),col=2,add=TRUE)

### 2. MORE SAMPLE PROBLEM 
### (continuous variable vs. categorical variable with more than two categories)
### Does the consumption depend on the drive (distinguishing front/rear/4x4)?

### Boxplots
plot(consumption ~ fdrive, data = Cars2004nh, col = rainbow_hcl(3))
with(Cars2004nh, tapply(consumption, fdrive, summary))
with(Cars2004nh, tapply(consumption, fdrive, sd, na.rm = TRUE))

### Formal statistical test(s) 
### | What is the underlying model in each test? | What are the assumptions? 

### Standard ANOVA
summary(aov(consumption ~ fdrive, data = Cars2004nh))  

### ANOVA without assuming equal variances
oneway.test(consumption ~ fdrive, data = Cars2004nh)   

### Kruskal-Wallis Rank Sum Test
kruskal.test(consumption ~ fdrive, data = Cars2004nh)  

### Alternatively: the same test statistics as in the standard ANOVA above
summary(m0<-lm(consumption~fdrive,data=Cars2004nh))
anova(m0)
oneway.test(consumption ~ fdrive, data = Cars2004nh,var.equal=TRUE)   

X = subset(Cars2004nh, fdrive == "front")$consumption
Y = subset(Cars2004nh, fdrive == "rear")$consumption
Z = subset(Cars2004nh, fdrive == "4x4")$consumption
nX = length(X)
nY = length(Y)
nZ = length(Z)
mX = mean(X)
mY = mean(Y)
mZ = mean(Z)
m = mean(c(X,Y,Z))
nX*(m - mX)^2+nY*(m-mY)^2+nZ*(m-mZ)^2
sum((X - mX)^2)+sum((Y-mY)^2)+sum((Z-mZ)^2)
sum((c(X,Y,Z)-m)^2)
nX+nY+nZ

### 3. CONDITIONAL MEAN / REGRESSION 
### (continuous variable vs. continuous variable):
### Does the consumption depend on the length?

### Exploratory part: boxplots are not a reasonable choice here - why? 
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", 
     pch = 23)

### Histograms 
par(mfrow = c(1, 2))
hist(Cars2004nh$consumption, prob = TRUE, main = "", col = "seagreen", 
     xlab = "Consumption [l/100 km]")
hist(Cars2004nh$length, prob = TRUE, main = "", col = "seagreen", 
     xlab = "Length [cm]")
par(mfrow = c(1, 1))

with(Cars2004nh, cor(consumption, length))

with(Cars2004nh, cor(consumption, length, use = "complete.obs"))

with(Cars2004nh, cor(consumption, length, method = "spearman", 
                     use = "complete.obs"))

### Formal statistical test(s) 
### | What is the underlying model in each test? | What are the assumptions? 

with(Cars2004nh, cor.test(consumption, length))

with(Cars2004nh, cor.test(consumption, length, method = "spearman"))

### Is there any link between the test on a row "length"
### in the output from summary(m1) and any of the previous two tests?

m1 <- lm(consumption ~ length, data = Cars2004nh)
summary(m1)

X = Cars2004nh$length
Y = Cars2004nh$consumption
I = (!is.na(X)) & (!is.na(Y))
X = X[I]
Y = Y[I]
n = length(X)
  
### point estimates of regression coefficients
(b<-sum((X-mean(X))*(Y-mean(Y)))/sum((X-mean(X))^2))
(a = mean(Y) - mean(X)*b)
  
### in matrix notation
Xm = cbind(1,X)
solve(t(Xm)%*%Xm)%*%t(Xm)%*%matrix(Y,ncol=1)
  
### residuals and the estimate of residual variance
r = Y - (a+b*X)
summary(r)
(sig = sqrt(sum(r^2)/(n-2)))
n - 2
  
### standard deviations
sqrt(sig^2*solve(t(Xm)%*%Xm))

### multiple R^2
cor(X,Y)^2
  
### Is there any link of the above output to the lines in the plot below?
### When does the correlations coefficient equal to the slope parameter?

plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", 
     pch = 23)
abline(m1, col = "blue", lwd = 2)
abline(h = mean(Cars2004nh$consumption, na.rm = TRUE), col = "darkgreen", 
       lwd = 2)

X = X/sd(X)
Y = Y/sd(Y)
summary(lm(Y~X))
cor(X,Y)
  
X = X - mean(X)
Y = Y - mean(Y)
summary(m2<-lm(Y~X))
cor(X,Y)

plot(Y~X, col = "red3", bg = "orange", pch = 23)
abline(m2, col = "blue", lwd = 2)
abline(h = mean(Y), col = "darkgreen", lwd = 2)  

  
### 4. CONTINGENCY TABLE (categorical covariate vs. categorical covariate)
### Does the fact that the car is economical (in the U.S. sense -> 
### consumption<=10) depend on the drive (distinguishing only front and other)?

### Exploratory part
plot(fcons10 ~ fdrive2, data = Cars2004nh, col = diverge_hcl(2))

(TAB2 <- with(Cars2004nh, table(fdrive2, fcons10)))
round(prop.table(TAB2, margin = 1), 2)

### Formal statistical test(s) 
### | What is the underlying model in each test? | What are the assumptions? 

chisq.test(TAB2)          # Using continuity correction to be more conservative  
chisq.test(TAB2, correct = FALSE); # Not using continuity correction

# Expected observed counts under the null hypothesis 
print(chisq.test(TAB2)$expected)  
  
(T = chisq.test(TAB2,correct=FALSE))
n = sum(TAB2)         # number of observations
R = rowSums(TAB2)/n   # observed proportion in rows
C = colSums(TAB2)/n   # observed proportion in columns
R%*%t(C)*n            # expected counts under independence
(T$observed - T$expected)/sqrt(T$expected)  # residuals
T$residuals
(ts<-sum(T$res^2))    # test statistic
pchisq(ts,df=1,lower.tail=FALSE) # p-value
  

### Are the tests below different from those above (with the same p-value)?
### What else can now be learned from the output?
### Can you reconstruct the test statistics in the test below? 

(x <- TAB2[, "Yes"])
(n <- apply(TAB2, 1, sum))
prop.test(x, n)
prop.test(x, n, correct = FALSE)

p2 = x/n            # proportion estimates in groups
p = sum(x)/sum(n)   # common proportion estimate
(ts<-(diff(p2)/sqrt(p*(1-p)*sum(1/n)))^2) # test statistic
pchisq(ts,df=1,lower.tail=FALSE)
  

### What is tested here?
prop.test(x, n, alternative = "greater")

### QUESTIONS: 
### Would it be possible to find out whether the proportion of economical cars 
### among those with the front drive is by more than 40 percentage points higher
### than the proportion of economical cars among those with the other drive?
### Even if it is not possible to get a p-value by just one call to a function, 
### is it at least possible to decide on a 5% significance level?


### For completeness, there is one case which was not discussed in this script. 
### What about the relationship between some categorical covariate and 
### continuous covariates, in the sense that we intend to predict the value of 
### a categorical covariate based on the observed value of a 
### continuous covariate? Can you guess some usefull ideas/hints? 

### Such a relationship is commonly investigated in practice, and a very broad 
### class of so-called generalized linear models is used to formally assess this
### type of a relationship. Generalized linear models are more advanced models. 
### They are closely discussed in the course 
### "NMST 432 - Advanced Regression Models".
  

### -------------------------------------------------------------------------###
###  Block 2: INDIVIDUAL WORK                                                ###
### -------------------------------------------------------------------------###

### 1. Two-sample problem: 
### Verify the assumptions for each two-sample test used in part 1

### For instance, a formal test for equal variances in two groups
(ts<-var(subset(Cars2004nh, fdrive2 == "front")$consumption)/
    var(subset(Cars2004nh, fdrive2 == "other")$consumption))
(df1<-length(subset(Cars2004nh, fdrive2 == "front")$consumption-1))
(df2<-length(subset(Cars2004nh, fdrive2 == "other")$consumption-1))
p<-pf(ts,df1=df1-1,df2=df2-1)
2*(min(p,1-p))

### Alternatively: 
var.test(subset(Cars2004nh, fdrive2 == "front")$consumption,
         subset(Cars2004nh, fdrive2 == "other")$consumption)

### 2. More-sample problem: 
### - instead of using three group specific means, consider a reformulation by 
###   using the conditional mean given the groups
### - what is the analogy with the regression formulation

### 3. Regression
### the linear function used in the model m1: 

a <- m1$coeff[1] # intercept
b <- m1$coeff[2] # slope 

y <- function(a, b, x){return(a + b * x)} ### two parameters

plot(consumption ~ length, data = Cars2004nh)
abline(m1, col = "black")

### compare with
x <- Cars2004nh$length
lines(y(a, b, x) ~ x, col = "red", lwd = 2)

### what if we use a different function now: 

y <- function(a,b,c,x){return(a * (x == "front") + b * (x == "rear") + 
                                c * (x == "4x4"))} ### three parameters

summary(m1 <- lm(consumption ~ -1 + fdrive, data = Cars2004nh))
(gmeans <- with(Cars2004nh, tapply(consumption, fdrive, mean)))

boxplot(consumption ~ fdrive, data = Cars2004nh, col = rainbow_hcl(3))
points(consumption[fdrive == "front"] ~ 
         runif(length(consumption[fdrive == "front"] ), 0.5, 1.5), 
       data = Cars2004nh, pch = 21, bg = "red", cex = 0.8)
points(consumption[fdrive == "rear"] ~ 
         runif(length(consumption[fdrive == "rear"] ), 1.5, 2.5), 
       data = Cars2004nh, pch = 21, bg = "green", cex = 0.8)
points(consumption[fdrive == "4x4"] ~ 
         runif(length(consumption[fdrive == "4x4"] ), 2.5, 3.5), 
       data = Cars2004nh, pch = 21, bg = "blue", cex = 0.8)

### sample means in three groups
lines(rep(gmeans[1],2) ~ c(0.5, 1.5), lwd = 4, col = "red")
lines(rep(gmeans[2],2) ~ c(1.5, 2.5), lwd = 4, col = "green")
lines(rep(gmeans[3],2) ~ c(2.5, 3.5), lwd = 4, col = "blue")

a <- m1$coeff[1]
b <- m1$coeff[2]
c <- m1$coeff[3]

### group means by the regression model
points(y(a,b,c, fdrive) ~ fdrive, data = Cars2004nh, pch = 22, bg = "black", 
       cex = 2)


### -------------------------------------------------------------------------###
###  Block 3: ADDITIONAL MATERIAL                                            ###
### -------------------------------------------------------------------------###


### More complex insight into the underlying data...  
### Visual assessment of normality of the two samples

COL2 <- rainbow_hcl(2)
X = subset(Cars2004nh, fdrive2 == "front")$consumption
Y = subset(Cars2004nh, fdrive2 == "other")$consumption

par(mfrow = c(1, 2))
hist(X, prob = TRUE, main = "Front", col = COL2[1], 
     xlab = "Consumption [l/100 km]")
curve(dnorm(x,mean=mean(X),sd=sd(X)),add=TRUE,col=2,lwd=2)
hist(Y, prob = TRUE, main = "Other", col = COL2[2], 
     xlab = "Consumption [l/100 km]")
curve(dnorm(x,mean=mean(Y),sd=sd(Y)),add=TRUE,col=2,lwd=2)
par(mfrow = c(1, 1))

### How would you vizualize the sample means and medians which are tested in 
### the two-sample problems? 

### For practical purposes one sided alternative can be tested instead. 
### Which alternative makes more sense? What exactly do we test here?
t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2)
t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2, 
       var.equal = TRUE)
wilcox.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", 
            mu = -2, conf.int = TRUE)

### Reconstruct the test statistics and the corresponding p-values. 



### Contingency tables for general categorical covariates

### Does the fact that the car is economical (in the U.S. sense -> 
### consumption<=10) depend on the drive (while distinguishing front/rear/4x4)?

plot(fcons10 ~ fdrive, data = Cars2004nh, col = diverge_hcl(2))

(TAB3 <- with(Cars2004nh, table(fdrive, fcons10)))
prop.table(TAB3, margin = 1)

### What are we testing by the following test?
### What is the model behind this test?
### Does the model seem to be good enough to be useful?
### Which artefacts of the model might be relatively safely ignored and why?
### What is the conclusion?

chisq.test(TAB3)

print(chisq.test(TAB3)$expected)
print(chisq.test(TAB3)$residuals)
print(chisq.test(TAB3)$stdres)


### Does the drive (front/rear/4x4) depend
### on the type of the car (personal/wagon/SUV/pickup/sport/minivan)?

plot(fdrive ~ ftype, data = Cars2004nh, col = diverge_hcl(3))

(TAB4 <- with(Cars2004nh, table(ftype, fdrive)))
prop.table(TAB4, margin = 1)


### What are we testing by the following test?
### What is the model behind this test?
### Does the model seem to be good enough to be useful?
### Which artefacts of the model might be relatively safely ignored and why?
### What is the conclusion?

chisq.test(TAB4)
print(chisq.test(TAB4)$expected)

### Why did we get the warning message? What can be done about it?