### -------------------------------------------------------------------------### ### NMSA407 - Linear Regression ### ### Winter Term 2021/2022 | Online ### ### ### ### EXERCISE #1 Review ### ### -------------------------------------------------------------------------### ### There will be one R script (analogous to this one) available for each week ### of the term. Each script is split into three separate parts: ### i) Teaching material; ii) Individual work; and iii) Additional material. ### ### i) The first part - the teaching material - will be walked through in detail ### during the exercise classes. ### ii) The second part - the individual work - is meant for student's specific ### interests. Going through the R code in this part is not an obligation but it ### is highly recommended. ### iii) Finally, the last part - the additional material - can be used to ### enrich specific knowledge and to get some deeper insight into some specific ### steps of the first part of the script. ### -------------------------------------------------------------------------### ### Block 1: TEACHING MATERIAL ### ### -------------------------------------------------------------------------### ### If you use your personal computer an additional package mffSM is needed ### The package is not available at CRAN but, from can be downloaded from ### the official course website ### https://www2.karlin.mff.cuni.cz/~komarek/vyuka/2021_22/nmsa407-2021.html ### Use the following command for the instalation: ### 1) find the working directory of R using # getwd() ### 2) download either the .zip or the tar.gz file of the package mffSM ### to the working directory ### 3) run # install.packages("mffSM_1.1.zip", repos = NULL) ### or # install.packages("mffSM_1.1.tar.gz", repos = NULL) library("mffSM") help(package="mffSM") ### data data(Cars2004nh, package = "mffSM") dim(Cars2004nh) str(Cars2004nh) head(Cars2004nh) summary(Cars2004nh) ### Excluding cars where the consumption is missing. Cars2004nh <- Cars2004nh[!is.na(Cars2004nh$consumption),] ### Creating a working subsample set.seed(201617) idx <- c(sample((1:nrow(Cars2004nh))[Cars2004nh$ftype=="personal"], size=100), sample((1:nrow(Cars2004nh))[Cars2004nh$ftype!="personal"], size=120)) Cars2004nh <- Cars2004nh[idx,] ### Deriving some additional variables ### fdrive2: factor which distinguishes whether the drive is ### front or other (rear or 4x4) Cars2004nh <- transform(Cars2004nh, fdrive2 = factor(1*(fdrive == "front") + 2*(fdrive != "front"), labels = c("front", "other"))) with(Cars2004nh, table(fdrive, fdrive2)) ### cons10: 0/1 variable which is 1 if consumption <= 10 l/100 km, 0 otherwise ### fcons10: factor No/Yes derived from cons10 (0 -> No, 1 -> Yes) Cars2004nh <- transform(Cars2004nh, cons10 = 1 * (consumption <= 10), fcons10 = factor(1 * (consumption <= 10), labels = c("No", "Yes"))) summary(Cars2004nh) ### 1. TWO SAMPLE PROBLEM (continuous variable vs. binary variable): ### Does the consumption depend on the drive ### (distinguishing only two categories: front and other)? ### Exloratory part: Boxplot - usually the most useful plot for this problem ### What could be used instead? plot(consumption ~ fdrive2, data = Cars2004nh, col = rainbow_hcl(2)) ### Sample characteristics with(Cars2004nh, tapply(consumption, fdrive2, summary)) with(Cars2004nh, tapply(consumption, fdrive2, sd, na.rm = TRUE)) X = subset(Cars2004nh, fdrive2 == "front")$consumption Y = subset(Cars2004nh, fdrive2 == "other")$consumption ### Formal statistical test(s) ### | What is the underlying model in each test? | What are the assumptions? ### Standard two-sample t-test (assuming equal variances) t.test(consumption ~ fdrive2, data = Cars2004nh, var.equal = TRUE) mean(X) - mean(Y) ### Welch two-sample t-test (not assuming equal variances) t.test(consumption ~ fdrive2, data = Cars2004nh) ### Wilcoxon two-sample rank test wilcox.test(consumption ~ fdrive2, data = Cars2004nh, conf.int = TRUE) median(X) - median(Y) XYdiff = outer(X,Y,FUN="-") median(XYdiff) ### Kolmogorov-Smirnov test ks.test(subset(Cars2004nh, fdrive2 == "front")$consumption, subset(Cars2004nh, fdrive2 == "other")$consumption, data = Cars2004nh) plot(ecdf(X),main="ECDF") plot(ecdf(Y),col=2,add=TRUE) ### 2. MORE SAMPLE PROBLEM ### (continuous variable vs. categorical variable with more than two categories) ### Does the consumption depend on the drive (distinguishing front/rear/4x4)? ### Boxplots plot(consumption ~ fdrive, data = Cars2004nh, col = rainbow_hcl(3)) with(Cars2004nh, tapply(consumption, fdrive, summary)) with(Cars2004nh, tapply(consumption, fdrive, sd, na.rm = TRUE)) ### Formal statistical test(s) ### | What is the underlying model in each test? | What are the assumptions? ### Standard ANOVA summary(aov(consumption ~ fdrive, data = Cars2004nh)) ### ANOVA without assuming equal variances oneway.test(consumption ~ fdrive, data = Cars2004nh) ### Kruskal-Wallis Rank Sum Test kruskal.test(consumption ~ fdrive, data = Cars2004nh) ### Alternatively: the same test statistics as in the standard ANOVA above summary(m0<-lm(consumption~fdrive,data=Cars2004nh)) anova(m0) oneway.test(consumption ~ fdrive, data = Cars2004nh,var.equal=TRUE) X = subset(Cars2004nh, fdrive == "front")$consumption Y = subset(Cars2004nh, fdrive == "rear")$consumption Z = subset(Cars2004nh, fdrive == "4x4")$consumption nX = length(X) nY = length(Y) nZ = length(Z) mX = mean(X) mY = mean(Y) mZ = mean(Z) m = mean(c(X,Y,Z)) nX*(m - mX)^2+nY*(m-mY)^2+nZ*(m-mZ)^2 sum((X - mX)^2)+sum((Y-mY)^2)+sum((Z-mZ)^2) sum((c(X,Y,Z)-m)^2) nX+nY+nZ ### 3. CONDITIONAL MEAN / REGRESSION ### (continuous variable vs. continuous variable): ### Does the consumption depend on the length? ### Exploratory part: boxplots are not a reasonable choice here - why? plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23) ### Histograms par(mfrow = c(1, 2)) hist(Cars2004nh$consumption, prob = TRUE, main = "", col = "seagreen", xlab = "Consumption [l/100 km]") hist(Cars2004nh$length, prob = TRUE, main = "", col = "seagreen", xlab = "Length [cm]") par(mfrow = c(1, 1)) with(Cars2004nh, cor(consumption, length)) with(Cars2004nh, cor(consumption, length, use = "complete.obs")) with(Cars2004nh, cor(consumption, length, method = "spearman", use = "complete.obs")) ### Formal statistical test(s) ### | What is the underlying model in each test? | What are the assumptions? with(Cars2004nh, cor.test(consumption, length)) with(Cars2004nh, cor.test(consumption, length, method = "spearman")) ### Is there any link between the test on a row "length" ### in the output from summary(m1) and any of the previous two tests? m1 <- lm(consumption ~ length, data = Cars2004nh) summary(m1) X = Cars2004nh$length Y = Cars2004nh$consumption I = (!is.na(X)) & (!is.na(Y)) X = X[I] Y = Y[I] n = length(X) ### point estimates of regression coefficients (b<-sum((X-mean(X))*(Y-mean(Y)))/sum((X-mean(X))^2)) (a = mean(Y) - mean(X)*b) ### in matrix notation Xm = cbind(1,X) solve(t(Xm)%*%Xm)%*%t(Xm)%*%matrix(Y,ncol=1) ### residuals and the estimate of residual variance r = Y - (a+b*X) summary(r) (sig = sqrt(sum(r^2)/(n-2))) n - 2 ### standard deviations sqrt(sig^2*solve(t(Xm)%*%Xm)) ### multiple R^2 cor(X,Y)^2 ### Is there any link of the above output to the lines in the plot below? ### When does the correlations coefficient equal to the slope parameter? plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23) abline(m1, col = "blue", lwd = 2) abline(h = mean(Cars2004nh$consumption, na.rm = TRUE), col = "darkgreen", lwd = 2) X = X/sd(X) Y = Y/sd(Y) summary(lm(Y~X)) cor(X,Y) X = X - mean(X) Y = Y - mean(Y) summary(m2<-lm(Y~X)) cor(X,Y) plot(Y~X, col = "red3", bg = "orange", pch = 23) abline(m2, col = "blue", lwd = 2) abline(h = mean(Y), col = "darkgreen", lwd = 2) ### 4. CONTINGENCY TABLE (categorical covariate vs. categorical covariate) ### Does the fact that the car is economical (in the U.S. sense -> ### consumption<=10) depend on the drive (distinguishing only front and other)? ### Exploratory part plot(fcons10 ~ fdrive2, data = Cars2004nh, col = diverge_hcl(2)) (TAB2 <- with(Cars2004nh, table(fdrive2, fcons10))) round(prop.table(TAB2, margin = 1), 2) ### Formal statistical test(s) ### | What is the underlying model in each test? | What are the assumptions? chisq.test(TAB2) # Using continuity correction to be more conservative chisq.test(TAB2, correct = FALSE); # Not using continuity correction # Expected observed counts under the null hypothesis print(chisq.test(TAB2)$expected) (T = chisq.test(TAB2,correct=FALSE)) n = sum(TAB2) # number of observations R = rowSums(TAB2)/n # observed proportion in rows C = colSums(TAB2)/n # observed proportion in columns R%*%t(C)*n # expected counts under independence (T$observed - T$expected)/sqrt(T$expected) # residuals T$residuals (ts<-sum(T$res^2)) # test statistic pchisq(ts,df=1,lower.tail=FALSE) # p-value ### Are the tests below different from those above (with the same p-value)? ### What else can now be learned from the output? ### Can you reconstruct the test statistics in the test below? (x <- TAB2[, "Yes"]) (n <- apply(TAB2, 1, sum)) prop.test(x, n) prop.test(x, n, correct = FALSE) p2 = x/n # proportion estimates in groups p = sum(x)/sum(n) # common proportion estimate (ts<-(diff(p2)/sqrt(p*(1-p)*sum(1/n)))^2) # test statistic pchisq(ts,df=1,lower.tail=FALSE) ### What is tested here? prop.test(x, n, alternative = "greater") ### QUESTIONS: ### Would it be possible to find out whether the proportion of economical cars ### among those with the front drive is by more than 40 percentage points higher ### than the proportion of economical cars among those with the other drive? ### Even if it is not possible to get a p-value by just one call to a function, ### is it at least possible to decide on a 5% significance level? ### For completeness, there is one case which was not discussed in this script. ### What about the relationship between some categorical covariate and ### continuous covariates, in the sense that we intend to predict the value of ### a categorical covariate based on the observed value of a ### continuous covariate? Can you guess some usefull ideas/hints? ### Such a relationship is commonly investigated in practice, and a very broad ### class of so-called generalized linear models is used to formally assess this ### type of a relationship. Generalized linear models are more advanced models. ### They are closely discussed in the course ### "NMST 432 - Advanced Regression Models". ### -------------------------------------------------------------------------### ### Block 2: INDIVIDUAL WORK ### ### -------------------------------------------------------------------------### ### 1. Two-sample problem: ### Verify the assumptions for each two-sample test used in part 1 ### For instance, a formal test for equal variances in two groups (ts<-var(subset(Cars2004nh, fdrive2 == "front")$consumption)/ var(subset(Cars2004nh, fdrive2 == "other")$consumption)) (df1<-length(subset(Cars2004nh, fdrive2 == "front")$consumption-1)) (df2<-length(subset(Cars2004nh, fdrive2 == "other")$consumption-1)) p<-pf(ts,df1=df1-1,df2=df2-1) 2*(min(p,1-p)) ### Alternatively: var.test(subset(Cars2004nh, fdrive2 == "front")$consumption, subset(Cars2004nh, fdrive2 == "other")$consumption) ### 2. More-sample problem: ### - instead of using three group specific means, consider a reformulation by ### using the conditional mean given the groups ### - what is the analogy with the regression formulation ### 3. Regression ### the linear function used in the model m1: a <- m1$coeff[1] # intercept b <- m1$coeff[2] # slope y <- function(a, b, x){return(a + b * x)} ### two parameters plot(consumption ~ length, data = Cars2004nh) abline(m1, col = "black") ### compare with x <- Cars2004nh$length lines(y(a, b, x) ~ x, col = "red", lwd = 2) ### what if we use a different function now: y <- function(a,b,c,x){return(a * (x == "front") + b * (x == "rear") + c * (x == "4x4"))} ### three parameters summary(m1 <- lm(consumption ~ -1 + fdrive, data = Cars2004nh)) (gmeans <- with(Cars2004nh, tapply(consumption, fdrive, mean))) boxplot(consumption ~ fdrive, data = Cars2004nh, col = rainbow_hcl(3)) points(consumption[fdrive == "front"] ~ runif(length(consumption[fdrive == "front"] ), 0.5, 1.5), data = Cars2004nh, pch = 21, bg = "red", cex = 0.8) points(consumption[fdrive == "rear"] ~ runif(length(consumption[fdrive == "rear"] ), 1.5, 2.5), data = Cars2004nh, pch = 21, bg = "green", cex = 0.8) points(consumption[fdrive == "4x4"] ~ runif(length(consumption[fdrive == "4x4"] ), 2.5, 3.5), data = Cars2004nh, pch = 21, bg = "blue", cex = 0.8) ### sample means in three groups lines(rep(gmeans[1],2) ~ c(0.5, 1.5), lwd = 4, col = "red") lines(rep(gmeans[2],2) ~ c(1.5, 2.5), lwd = 4, col = "green") lines(rep(gmeans[3],2) ~ c(2.5, 3.5), lwd = 4, col = "blue") a <- m1$coeff[1] b <- m1$coeff[2] c <- m1$coeff[3] ### group means by the regression model points(y(a,b,c, fdrive) ~ fdrive, data = Cars2004nh, pch = 22, bg = "black", cex = 2) ### -------------------------------------------------------------------------### ### Block 3: ADDITIONAL MATERIAL ### ### -------------------------------------------------------------------------### ### More complex insight into the underlying data... ### Visual assessment of normality of the two samples COL2 <- rainbow_hcl(2) X = subset(Cars2004nh, fdrive2 == "front")$consumption Y = subset(Cars2004nh, fdrive2 == "other")$consumption par(mfrow = c(1, 2)) hist(X, prob = TRUE, main = "Front", col = COL2[1], xlab = "Consumption [l/100 km]") curve(dnorm(x,mean=mean(X),sd=sd(X)),add=TRUE,col=2,lwd=2) hist(Y, prob = TRUE, main = "Other", col = COL2[2], xlab = "Consumption [l/100 km]") curve(dnorm(x,mean=mean(Y),sd=sd(Y)),add=TRUE,col=2,lwd=2) par(mfrow = c(1, 1)) ### How would you vizualize the sample means and medians which are tested in ### the two-sample problems? ### For practical purposes one sided alternative can be tested instead. ### Which alternative makes more sense? What exactly do we test here? t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2) t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2, var.equal = TRUE) wilcox.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2, conf.int = TRUE) ### Reconstruct the test statistics and the corresponding p-values. ### Contingency tables for general categorical covariates ### Does the fact that the car is economical (in the U.S. sense -> ### consumption<=10) depend on the drive (while distinguishing front/rear/4x4)? plot(fcons10 ~ fdrive, data = Cars2004nh, col = diverge_hcl(2)) (TAB3 <- with(Cars2004nh, table(fdrive, fcons10))) prop.table(TAB3, margin = 1) ### What are we testing by the following test? ### What is the model behind this test? ### Does the model seem to be good enough to be useful? ### Which artefacts of the model might be relatively safely ignored and why? ### What is the conclusion? chisq.test(TAB3) print(chisq.test(TAB3)$expected) print(chisq.test(TAB3)$residuals) print(chisq.test(TAB3)$stdres) ### Does the drive (front/rear/4x4) depend ### on the type of the car (personal/wagon/SUV/pickup/sport/minivan)? plot(fdrive ~ ftype, data = Cars2004nh, col = diverge_hcl(3)) (TAB4 <- with(Cars2004nh, table(ftype, fdrive))) prop.table(TAB4, margin = 1) ### What are we testing by the following test? ### What is the model behind this test? ### Does the model seem to be good enough to be useful? ### Which artefacts of the model might be relatively safely ignored and why? ### What is the conclusion? chisq.test(TAB4) print(chisq.test(TAB4)$expected) ### Why did we get the warning message? What can be done about it?