#####
##### NMST440: Advanced Aspects of the R Environment
#####
##### ---------------------------------------------------------------------
#####
##### Simple simulations towards properties of an estimator
#####
##### ---------------------------------------------------------------------
#####
##### Arnošt Komárek
##### https://www2.karlin.mff.cuni.cz/~komarek
##### komarek@karlin.mff.cuni.cz
#####
##### ======================================================================
rm(list = ls())
ROOT <- "/home/komarek/teach/mff_2021/nmst440_AdvRko/Tutorial07/"
setwd(ROOT)
##### ==================================================================================================
#####
##### In the following, we will deal with a problem of so called "regression with errors in variables".
##### The problem is to estimate a linear model: E(Y|X) = beta_0 + beta_1*X_1 + ... + beta_p*X_p, where
##### however, regressors are measured with an error, that is, instead of observing X_1, ..., X_p,
##### we observe X_j^* = X_j + eta_j, j = 1, ..., p
##### where eta_j is a random variable with zero mean and standard deviation sigma_j(x).
##### To estimate the regression coefficients beta_0, beta_1, ..., beta_p,
##### method of least squares will be used with the observed ("erroneous") regressors X_j^*.
##### It can be shown (and one could this probably expect) that the resulting LSE of the regression coefficients
##### do not have usual properties (unbiasedness, ...).
#####
##### The simulation study below shows, on an example of the regression with errors in variables,
##### how to explore (by a simulation study) properties (bias, standard error/mean squared error, consistency)
##### of a certain estimator (LSE of regression coefficients here) under certain conditions.
#####
##### ==================================================================================================
##### =========================================================
#####
##### Regression with errors in variables
#####
##### =========================================================
#### As in the previous script, we must determine scenarios under which we will examine
#### behavior of our estimator. Part of the scenario, being common to all simulation settings
#### will be the normal linear model with p = 1 continuous regressor following a uniform
#### distribution on (-1, 1). Regression coefficients will be equal to 0 (intercept) and 1 (slope).
#### The residual variance will be equal to 0.2^2. That is, response variables Y_i are generated as follows:
####
#### Y_i = beta_0 + beta_1*X_i + epsilon_i,
#### where beta_0 = 0, beta_1 = 1,
#### X_i ~ Unif(-1, 1),
#### epsilon_i ~ Norm(0, sigma_eps^2), sigma_eps = 0.2, i = 1, ..., n,
#### X_1, ..., X_n, epsilon_1, ..., epsilon_n are also assumed to be independent.
####
#### Nevertheless, for estimation of beta_0 and beta_1, values of X_1, ..., X_n are assumed NOT TO BE AVAILABLE.
#### Instead, for estimation, their "erroneous" counterparts X_1^*, ..., X_n^* are used. As a part
#### of the simulation scenario, we must specify how those are generated. Here, we will assume
#### that they come from a normal distribution whose mean is equal to the correct value of the regressor
#### and standard deviation is equal to sigma_x = 0.1. That is, we will use the following data generating mechanism:
####
#### X_i^* = X_i + eta_i, eta_i ~ Norm(0, sigma_x^2), sigma_x = 0.1, i = 1, ..., n,
#### eta_1, ..., eta_n independent.
####
#### Then, we explore behavior of hat(beta) = LSE based on data (Y_i, X_i^*), i = 1, ..., n
#### for sample sizes n = 50, 100, 500, 1000, 10000.
### Sample sizes and a simulation length
### ++++++++++++++++++++++++++++++++++++++++++++++++++++++
n <- c(50, 100, 500, 1000, 10000) ## sample sizes
M <- 1000 ## simulation length,
## run everything also with M = 10000
### Parameters of a true model
### ++++++++++++++++++++++++++++++
beta0 <- 0
beta1 <- 1
sigma.x <- 0.1
sigma.eps <- 0.2
### One dataset for illustration (n = 100)
### +++++++++++++++++++++++++++++++++++++++
###
### This piece of the code also shows how to generate one dataset
### and then how to obtain needed LSE
### (the step to be repeated M-times for each sample size)
nnow <- 100
x.true <- runif(nnow, -1, 1)
y.obs <- beta0 + beta1 * x.true + rnorm(nnow, mean = 0, sd = sigma.eps)
x.obs <- x.true + rnorm(nnow, mean = 0, sd = sigma.x)
summary(fit <- lm(y.obs ~ x.obs))
### Plot just for illustration
plot(y.obs ~ x.obs, pch = 23, col = "red3", bg = "orange", xlab = "x", ylab = "y")
abline(a = beta0, b = beta1, col = "red3", lwd = 2)
abline(fit, col = "blue2", lwd = 2)
legend("topleft", legend = c("Fitted", "True"), col = c("blue2", "red3"), lty = 1, lwd = 2)
### NOTE: For just one dataset, the fitted line is almost surely different from the true one
### even if there was no error in regressors. The purpose of the simulation study below
### is to see (a) whether the "mean" fitted line differ from the true one (unbiasedness),
### (b) whether a "difference" between the fitted line and the true one becomes smaller
### when the sample size n increases (consistency).
### Main simulation loop
### ++++++++++++++++++++++++
#### The code below is analog of the code from a script in which we were exploring
#### properties of a statistical test. The only difference now is that we are
#### interested in distribution of estimator(s). Hence we must store them.
#### Next to estimated regression coefficients, also estimated residual standard deviation
#### will be kept.
be0 <- be1 <- sig.eps <- matrix(nrow = M, ncol = length(n)) ## matrices to store calculated estimates
colnames(be0) <- colnames(be1) <- colnames(sig.eps) <- n
# j <- 2; m <- 1
for (j in 1:length(n)){
set.seed(19730911) ## to be able to reproduce the results
cat(" *** Calculating sample size n = ", n[j], " ***\n")
for (m in 1:M){
x.true <- runif(n[j], -1, 1)
y.obs <- beta0 + beta1 * x.true + rnorm(n[j], mean = 0, sd = sigma.eps)
x.obs <- x.true + rnorm(n[j], mean = 0, sd = sigma.x)
fit <- lm(y.obs ~ x.obs)
be0[m, j] <- coef(fit)[1]
be1[m, j] <- coef(fit)[2]
sig.eps[m, j] <- summary(fit)[["sigma"]]
}
}
## What are those numbers?
be1[, 1] ## = (pseudo)random sample from the distribution of the LSE of the slope for the first considered sample size
be0[1:10,]
be1[1:10,]
sig.eps[1:10,]
### Again, it is wise to store results of perhaps time consuming simulation
save(list = c("be0", "be1", "sig.eps"), file = paste(ROOT, "/SimResult/sim_regres.RData", sep = ""))
#
### And load the generated samples before processing them
(load(paste(ROOT, "/SimResult/sim_regres.RData", sep = "")))
(M <- nrow(be1))
### Bias of estimators
### +++++++++++++++++++++++
###
### bias = expectation(estimator - true value)
### so with a random sample, bias is estimated by a sample mean
(bias.be0 <- apply(be0 - beta0, 2, mean))
## estimated intercept seems to be unbiased
(bias.be1 <- apply(be1 - beta1, 2, mean))
## but the slope seems to be estimated with a bias of about -0.029
## and this bias does not change with the sample size
(bias.sig.eps2 <- apply(sig.eps^2 - sigma.eps^2, 2, mean))
## also residual variance seems to be estimated with a bias
## of about 0.0097
### Standard errors of estimators (estimated by the simulation study)
### (does not make much sense for biased estimators!)
### ++++++++++++++++++++++++++++++++++++++++++++++++++++
(se.be0 <- apply(be0, 2, sd))
## seems to be decreasing with a sample size
## and since this estimator also looked unbiased, it seems it is also consistent
(se.be1 <- apply(be1, 2, sd))
## also seems to be decreasing, nevertheless, since we found some bias,
## the estimator might be "consistent" around a value which is different
## from the true slope
(se.sig.eps2 <- apply(sig.eps^2, 2, sd))
## also seems to be decreasing, nevertheless, since we found some bias,
## the estimator might be "consistent" around a value which is different
## from the true residual variance
### Mean squared errors of estimators
### = relevant quantity to evaluate quality of an estimator
### which is perhaps biased
### ++++++++++++++++++++++++++++++++++++++++++++++++++++
(mse.be0 <- apply((be0 - beta0)^2, 2, mean))
(mse.be1 <- apply((be1 - beta1)^2, 2, mean))
(mse.sig.eps2 <- apply((sig.eps^2 - sigma.eps^2)^2, 2, mean))
### Usually square root of MSE is reported
### (it has the same scale as standard error)
### ++++++++++++++++++++++++++++++++++++++++++++++
(smse.be0 <- sqrt(mse.be0))
(smse.be1 <- sqrt(mse.be1))
(smse.sig.eps2 <- sqrt(mse.sig.eps2))
### For possible calculation of confidence intervals,
### or derivation of statistical tests, it is necessary to know something
### about distribution of the estimator
### Distribution of estimators (estimated by simulation)
### ++++++++++++++++++++++++++++++++++++++++++++++++++++++
nnow <- n[2] ## Choose one n to be plotted
par(mfrow = c(2, 2))
hist(be0[, paste(nnow)], prob = TRUE, xlab = expression(hat(beta)[0]), ylab = "Density", col = "burlywood", main = paste("n =", nnow))
abline(v = beta0, col = "red", lwd = 3)
hist(be1[, paste(nnow)], prob = TRUE, xlab = expression(hat(beta)[1]), ylab = "Density", col = "burlywood", main = paste("n =", nnow))
abline(v = beta1, col = "red", lwd = 3)
hist(sig.eps[, paste(nnow)], prob = TRUE, xlab = expression(hat(sigma)), ylab = "Density", col = "burlywood3", main = paste("n =", nnow))
abline(v = sigma.eps, col = "red", lwd = 3)
hist(sig.eps[, paste(nnow)]^2, prob = TRUE, xlab = expression(hat(sigma)^2), ylab = "Density", col = "burlywood3", main = paste("n =", nnow))
abline(v = sigma.eps^2, col = "red", lwd = 3)
par(mfrow = c(1, 1))
### Possible presentation of simulation results
### (boxplots of obtained values of estimates)
### ++++++++++++++++++++++++++++++++++++++++++++++
gr <- factor(rep(1:length(n), each = M), labels = n)
gr[1:10]
table(gr)
WHAT <- "be1" ## Which parameter will be drawn
trueVal <- beta1
#WHAT <- "sig.eps" ## Which parameter will be drawn
#trueVal <- sigma.eps
get(WHAT)[1:10,] ## Matrix with "WHAT" estimates
plot(as.numeric(get(WHAT)) ~ gr, col = "burlywood", xlab = "Sample size", ylab = "Estimate")
abline(h = trueVal, col = "red", lwd = 2)
#### We can easily seen (a) biasedness (boxplots around the true value or not),
#### (b) consistency (if unbiased) - is variability decreasing with the sample size?
##### =========================================================
#####
##### IMPORTANT TECHNICAL REMARKS
#####
##### =========================================================
### When running simulation, the estimation/testing procedure
### may sometimes fail for some simulated datasets.
### If this does not happen "too often", we can ignore it
### from the inferential point of view but the code must
### be adjusted to prevent stopping the simulation whenever
### a "bad" dataset is simulated.
### For example the following may happen (especially with a low sample size) with the ordinal data from nmst440-simul1.R script:
y1 <- c(5, 5, 5, 5, 5)
y2 <- c(5, 5, 5, 5, 5)
tt <- t.test(y1, y2, var.equal = TRUE)
### both samples have sample variance equal to 0 and hence also the T-statistic
### tries to divide by zero
### If this happens inside the loop, everything is stopped.
### Solution: use try() function to check whether everything was OK.
tt <- try(t.test(y1, y2, var.equal = TRUE))
print(tt)
class(tt)
## Or:
tt <- try(t.test(y1, y2, var.equal = TRUE), silent = TRUE)
print(tt)
class(tt)
## Result with "good" data
y1 <- c(5, 4, 5, 5, 5)
tt <- try(t.test(y1, y2, var.equal = TRUE))
print(tt) ## i.e., tt is now usual object coming from the t.test function
class(tt)
### Simulation loop is then better implemented using the 'while' loop:
M <- 1000
tstat <- numeric(M)
Mdone <- Mwrong <- 0
set.seed(20170420)
while(Mdone < M & Mwrong < 100){ ### some protection against infinitely many errors is indeed needed
y1 <- sample(x = 1:5, size = 5, replace = TRUE, prob = c(0.50, 0.30, 0.10, 0.07, 0.03)) ## random sample, group 1
y2 <- sample(x = 1:5, size = 5, replace = TRUE, prob = c(0.50, 0.30, 0.10, 0.07, 0.03)) ## random sample, group 2
tt <- try(t.test(y1, y2, var.equal = TRUE), silent = TRUE) ## two-sample t-test, equal-variances
if (class(tt)[1] == "try-error"){
Mwrong <- Mwrong + 1
next
}
Mdone <- Mdone + 1
tstat[Mdone] <- tt$statistic
}
print(c(Mdone, Mwrong))