#####
##### NMST440: Advanced Aspects of the R Environment
#####
##### ---------------------------------------------------------------------
#####
##### Simple simulations towards properties of a statistical test
#####
##### ---------------------------------------------------------------------
#####
##### Arnošt Komárek
##### https://www2.karlin.mff.cuni.cz/~komarek
##### komarek@karlin.mff.cuni.cz
#####
##### ======================================================================
rm(list = ls())
ROOT <- "/home/komarek/teach/mff_2021/nmst440_AdvRko/Tutorial07/"
setwd(ROOT)
##### ==================================================================================================
##### In the following, we will explore behavior (statistical properties)
##### of two tests designed to compare two independent samples:
##### two-sample t-test and Wilcoxon (rank-sum) test.
#####
##### We will look at probability of type I error and power of the test under certain alternatives.
##### Both tests are derived under certain assumptions (normality for the t-test,
##### continuous distribution and shift in location alternatives for the Wilcoxon test).
##### Under those assumptions, prob. of type I error is indeed equal to
##### the significance level (alpha), formulas also exist on how to calculate a power.
#####
##### The main task now is to explore the properties in a situation when the assumptions
##### do not hold. As a motivation, we will consider "typical" problem from social sciences
##### with ordinal data coming from a questionnaire related to some question
##### with possible answers of a type
##### 'strongly diasgree' - 'disagree' - 'neutral' - 'agree' - 'strongly agree'
#####
##### Typical problem is to compare two groups (e.g., students of mathematics and students of informatics)
##### with respect to such a question. One possible approach is to use a chi-square test of independence
##### (with factor 1: question, factor 2: group indicator). Nevertheless, ordinality of the response
##### would then be ignored. Another approach is to use some two-sample test for numeric data.
##### In (semi)statictical courses, (social science) students are adviced not to use the t-test
##### with an explanation "data are not normal". Students are usually taught to use the "non-parametric"
##### alternative, i.e., the Wilcoxon test. Nevertheless, if (semi)teachers of statistics were aware
##### of the Central Limit Theorem (CLT), they would rather recommend the t-test.
##### In principle, if (semi)teachers of statistics knew something about statistics,
##### they would never recommend the Wilcoxon test since a crucial assumption behind it
##### (if its non-asymptotic version is to be used) is continuous distribution which guarantees
##### that we can almost surely uniquely rank the observations. Luckily for (semi)teachers of statistics,
##### they usually deal with reasonably sized data where even the Wilcoxon test is used
##### in its asymptotic version (based on "mean ranks"), so finally, such a Wilcoxon test
##### is more or less the same as the t-test (as we will see below).
#####
##### ===================================================================================================
##### Principle of a simulation study is to set some SCENARIOs under which data will be generated
##### (distribution of data, sample size, ...). Then, for each scenario: we repeat (M-times) the following two steps
#####
##### 1) generate data using the scenario (computer based random number generator is usually needed to do so)
##### --> Data[m]
##### 2) calculate a value of the test statistic we are interested in using Data[m]
##### --> Statistic[m] (will be denoted T[m])
#####
##### At the end T[1], T[2], ..., T[M] is a RANDOM sample from the distribution of statistic T
##### under the assumption that data follow SCENARIO. Methods based on the empirical distribution function of T[1], T[2], ..., T[M]
##### can now be used to estimate needed properties of the distribution of T under SCENARIO, see below for more details.
##### =========================================================
#####
##### Two-sample t-test and Wilcoxon test use for ordinal data
#####
##### =========================================================
#### SCENARIO says that we are comparing two independent equally sized groups.
#### Data in each group come from a discrete distribution with values 1, 2, 3, 4, 5
#### and certain probabilities (vector 'p' below).
### We will consider equal sample sizes in both groups.
### ++++++++++++++++++++++++++++++++++++++++++++++++++++++
n <- c(10, 20, 50, 100, 500) ## sample size in one group (five possible values will be examined)
M <- 1000 ## simulation length,
## run everything also with M = 10000
### Reference distribution
### ++++++++++++++++++++++++++++
val <- 1:5 ## range of values
p <- c(0.50, 0.30, 0.15, 0.03, 0.02) ## reference distribution
#p <- rep(0.20, 5)
plot(val, p, type = "h", xlab = "Value", ylab = "Probability", col = "red3", lwd = 3)
### to illustrate; As it is seen, distribution is quite skewed.
### Will this skewness constitute a problem for performance of the t-test?
(EY <- sum(p*val)) ## true mean of data distribution
(varY <- sum(p*(val - EY)^2)) ## true variance of data distribution
(sdY <- sqrt(varY)) ## true standard deviation of data distribution
##### --------------------------------------------------
##### Simulation towards the true significance level
##### and towards distribution of the test statistic
##### under the NULL hypothesis
##### --------------------------------------------------
#### Under the null hypothesis, both samples that we compare
#### must come from the same distribution.
### Main simulation loop
### ++++++++++++++++++++++++
tstat <- wstat <- matrix(nrow = M, ncol = length(n)) ## matrix to store calculated test statistics
colnames(tstat) <- colnames(wstat) <- n
print(tstat[1:10,])
print(wstat[1:10,])
### Index j: sample size (scenario)
### Index m: dataset generated under specific scenario
### When "learning" or when developing the code, run everything
### for a particular value of j and m. Do not run the whole for loop
### directly.
# j <- 2; m <- 1 ## Run the inner part of the for loop for example for this
## combination of j and m and try to understand each row of the code,
## possibly by manual checking of the values that appear on a specific row.
for (j in 1:length(n)){
set.seed(19730911) ## to be able to reproduce the results
### Computer random number generator is not really random, it is usually called as pseudorandom.
### Starting from some initial value called 'seed', certain algorithm determines what will be generated.
### Algorithm then ensures that on a long run, distribution of generated numbers will be sufficiently close
### to the distribution from which the algorithm should generate.
### If not determined by user, 'seed' is usually taken from the system clock.
### Nevertheless, it is often useful (and sometimes even requested) if results can be exactly
### reproduced in future. To ensure that, 'seed' must be set manually, in R, this is done by the 'set.seed' command.
###
### In context of simulation studies, 'seed' can be set either before starting the whole simulation study (in front of the whole for loop),
### or before starting a particular scenario (like here). This option is better as it easily allows to reproduce calculations for just one
### selected scenario.
EW <- n[j] * n[j] / 2 ## H0 mean of Mann-Whitney (shifted Wilcoxon) statistic
for (m in 1:M){
### set.seed() here would be wrong. As consequence, exactly the same data would be generated M-times,
### and the generated random sample from a distribution of the test statistic would wrongly have zero variability...
### Generate data
y1 <- sample(x = val, size = n[j], replace = TRUE, prob = p) ## random sample, group 1
y2 <- sample(x = val, size = n[j], replace = TRUE, prob = p) ## random sample, group 2
### t-test
tt <- t.test(y1, y2, var.equal = TRUE) ## two-sample t-test, equal-variances
tstat[m, j] <- tt$statistic ## keep a value of the statistic of the t-test
### Wilcoxon test, we will keep a value of the statistic which should
### be asymptotically normal. Note that the "statistic" element of the list which results
### from the 'wilcox.test' function must be standardized to get a statistic which is asymptotically normal.
### Some needed expressions are calculated by hand.
R <- rank(c(y1, y2))
nTies <- table(R)
varCorrectWithTies <- sum(nTies^3 - nTies) / (2*n[j] * (2*n[j] - 1))
varW <- (n[j] * n[j] / 12) * (2*n[j] + 1 - varCorrectWithTies) ## H0 variance of Mann-Whitney statistic corrected for ties
wt <- wilcox.test(y1, y2, exact = FALSE, correct = TRUE) ## asymptotic Wilcoxon test, continuity corrections
wstat[m, j] <- (wt$statistic - EW - 0.5*sign(wt$statistic - EW)) / sqrt(varW) ## statistic which should be asymptotically normal
}
}
### What is this?
tstat[, 1]
### -> random sample of the t-test statistic under scenario of equality of the distributions of the two samples
### and the first considered sample size (n = 10)
tstat[1:10,] ## What are those numbers?
wstat[1:10,]
### -> first 10 values of each random sample from the distribution of the test statistic under each scenario
### (different sample sizes here)
### It is wise to store generated values, especially if the above for loop takes long
## (which is usually the case with "more serious" simulation studies)
### +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
### Below, results are stored in a file sim_tw_ordinal.RData
### in a subdirectory 'SimResult' (must exist) of the working directory.
save(list = c("tstat", "wstat"), file = paste(ROOT, "/SimResult/sim_tw_ordinal.RData", sep = ""))
### Later, the generated values can be loaded
### and further processed to get results to be "published" (in a diploma thesis, ...)
(load(paste(ROOT, "/SimResult/sim_tw_ordinal.RData", sep = "")))
(M <- nrow(tstat))
### True significance levels of the t-tests (their MC estimates)
### +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
### Probability of type I error = true significance level must be always related
### to assumed significance level of the test, or better said, it must be related
### to a decision rule to determine whether H0 is rejected or not. For classical tests,
### decision rule = critical region and critical region is often given by just one critical value
### (which depends on assumed significance level).
###
### In the code below, we will estimate true significance level (true probability of type I error)
### of the t-test and Wilcoxon test applied to ordinal data generated using above scenarios
### under the assumption that test was performed on a (nominal) level alpha = 0.05.
(Ct <- qt(0.975, df = 2*n - 2)) ## What are those numbers? Yes, critical values
Ctmat <- matrix(rep(Ct, each = M), nrow = M)
Ctmat[1:10,] ## Just 'technical' matrix where in each column, critical value
## related to each of considered sample sizes is repeated.
#
(nReject.t <- apply(abs(tstat) > Ctmat, 2, sum)) ## Number of rejections by t-test for each sample size.
(alpha.t <- nReject.t / M) ## True significance level estimated by simulation.
### In principle, alpha.t is estimated probability of a Bernoulli (0/1) sample.
### Hence, we can calculate its standard error (= estimated standard deviation) of this estimate
### using classical formula (for Bernoulli samples):
(MCE.alpha.t <- sqrt(alpha.t * (1 - alpha.t) / M))
## In this context, this standard error is called 'Monte Carlo error' (MCE).
## As it is seen, by increasing 'M' we can (easily), decrease MCE as much as we want
## depending on complexity of the simulation study and power of our computer.
### CLT can be used to get 95% "confidence intervals" for true probabilitis of type I error:
alpha.t.Low <- alpha.t - qnorm(0.975) * MCE.alpha.t
alpha.t.Upp <- alpha.t + qnorm(0.975) * MCE.alpha.t
ALPHA.t <- data.frame(Estimate = alpha.t, Lower = alpha.t.Low, Upper = alpha.t.Upp)
print(ALPHA.t)
### As we can see, probability of type I error of the t-test is pretty OK,
### even if distribution of data was discrete and skewed. And this seems to hold
### even for quite small sample size n = 10.
### But this should be of no surprise if we know something about CLT (including
### some knowledge on a speed of its convergence - note that all higher order moments
### are pretty OK with discrete data unless we send some of the 'cell' probabilities to 0).
### True significance levels of the Wilcoxon tests (their MC estimates)
### +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(Cw <- qnorm(0.975)) ## What is this number?
#
(nReject.w <- apply(abs(wstat) > Cw, 2, sum) ) ## Number of rejections by Wilcoxon test for each sample size.
(alpha.w <- nReject.w / M) ## True significance level calculated by simulation.
#
(MCE.alpha.w <- sqrt(alpha.w * (1 - alpha.w) / M)) ## Monte Carlo error
#
alpha.w.Low <- alpha.w - qnorm(0.975) * MCE.alpha.w
alpha.w.Upp <- alpha.w + qnorm(0.975) * MCE.alpha.w
ALPHA.w <- data.frame(Estimate = alpha.w, Lower = alpha.w.Low, Upper = alpha.w.Upp)
print(ALPHA.w)
### True distribution of the t-test test statistic as compared to the assumed one (Student t)
### ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tgrid <- seq(-5, 5, length = 500) ### grid for plotting
### Histogram (= estimate of a true density) vs. density of t-distribution (red)
### and kernel estimate (blue) of the density (= another estimate of a true density)
### --> one possible way on how to visualize the results and to check how much the true distribution
### of the test statistic corresponds to its assumed distribution (Student t in this case)
layout(matrix(c(0,1,1,2,2,0, 3,3,4,4,5,5), nrow = 2, byrow = TRUE))
for (j in 1:length(n)){
ft <- dt(tgrid, df = 2*n[j] - 2)
kern <- density(tstat[, j])
hist(tstat[, j], prob = TRUE, col = "burlywood", xlab = "T", ylab = "Density", main = paste("n =", n[j]), ylim = c(0, 0.45), xlim = range(tstat))
lines(kern$x, kern$y, col = "blue3", lwd = 2)
lines(tgrid, ft, col = "red3", lwd = 2)
}
par(mfrow = c(1, 1))
### Empirical (= estimate of the true distribution) vs. assumed CDF
### --> this is perhaps even better (easier comparison of true distribution to assumed one)
layout(matrix(c(0,1,1,2,2,0, 3,3,4,4,5,5), nrow = 2, byrow = TRUE))
for (j in 1:length(n)){
Ft <- pt(tgrid, df = 2*n[j] - 2)
plot(ecdf(tstat[, j]), do.points = FALSE, col = "blue", xlab = "T", ylab = "CDF", main = paste("n =", n[j]), ylim = c(0, 1), xlim = range(tstat))
lines(tgrid, Ft, col = "red3", lwd = 2)
}
par(mfrow = c(1, 1))
### blue line (estimated true CDF) is practically hidden by assumed CDF (= CDF of t-distribution)
### True distribution of the Wilcoxon test test statistic compared to the assumed one (normal)
### ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
wgrid <- seq(-4, 4, length = 500)
### Histogram vs. density (red) and kernel estimate (blue) of the density
layout(matrix(c(0,1,1,2,2,0, 3,3,4,4,5,5), nrow = 2, byrow = TRUE))
for (j in 1:length(n)){
fw <- dnorm(wgrid)
kern <- density(tstat[, j])
hist(wstat[, j], prob = TRUE, col = "burlywood", xlab = "W (standardized)", ylab = "Density", main = paste("n =", n[j]), ylim = c(0, 0.45), xlim = range(wstat))
lines(kern$x, kern$y, col = "blue3", lwd = 2)
lines(wgrid, fw, col = "red3", lwd = 2)
}
par(mfrow = c(1, 1))
### Empirical vs. theoretical CDF
layout(matrix(c(0,1,1,2,2,0, 3,3,4,4,5,5), nrow = 2, byrow = TRUE))
for (j in 1:length(n)){
Fw <- pnorm(wgrid)
plot(ecdf(wstat[, j]), do.points = FALSE, col = "blue", xlab = "W (standardized)", ylab = "CDF", main = paste("n =", n[j]), ylim = c(0, 1), xlim = range(wstat))
lines(wgrid, Fw, col = "red3", lwd = 2)
}
par(mfrow = c(1, 1))
##### =============================================================================================
#####
##### Simulation based critical values: It may sometimes happen that we deal with a test for which
##### we have no idea about a distribution of the test statistic under the null hypothesis.
##### Behavior of critical values can then be again explored by a simulation study (performed under
##### scenarios that correspond to the null hypothesis). We only need to know in advance how to define
##### a critical region. For example here, let the critical region be defined to cover those values
##### of the test statistic T (or W) that are "large" in absolute value. That is, the critical region
##### is {t: |t| > c(alpha)}, where c(alpha) is a constant which ensures that probability of type I error
##### is alpha. That is, we need P(|T| > c(alpha); H0) = alpha. Now, with simulation, we have
##### a sample T[1], ..., T[M] from a distribution of T under the null hypothesis. So how do we estimate
##### c(alpha). Yes, by the 1 - alpha sample quantile from a sample |T[1]|, ..., |T[M]|.
### Simulation based critical values of the tests (5% sig. level) for ordinal data (under our scenario!)
### +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(Ct.simul <- apply(abs(tstat), 2, quantile, 0.95))
print(Ct) ## assumed one for comparison
(Cw.simul <- apply(abs(wstat), 2, quantile, 0.95))
print(Cw) ## assumed one for comparison
##### ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#####
##### *** TIME TO HAVE A BREAK? ***
#####
##### ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
##### ===================================================================================================================
##### Once we have a test which provides a correct probability of type I error, next property to explore
##### is its power. In principle, a test if good for nothing if its power is poor, irrespective of the fact
##### that the type I error is properly controlled.
#####
##### When power is calculated, the alternative hypothesis must always be fully specified. In fact,
##### the alternative determines distribution of data (SCENARIO) for which we calculate probability (= power)
##### that the test rejects the null hypothesis. In other words, similar steps as above now must be conducted.
##### The only difference is that data must be generated using a scenario that corresponds to the alternative hypothesis.
##### ===================================================================================================================
##### --------------------------------------------------
##### Simulation towards the power
##### and towards distribution of the test statistic
##### under the alternative hypothesis
##### --------------------------------------------------
### Distribution in the second group (is different from a distribution in the first group)
### ++++++++++++++++++++++++++++++++++
p2 <- c(0.02, 0.03, 0.15, 0.30, 0.50)
#p2 <- rep(0.20, 5)
### Compare the two distributions (our scenarios)
### +++++++++++++++++++++++++++++++++++++++++++++++
par(mfrow = c(1, 2))
plot(val, p, type = "h", xlab = "Value", ylab = "Probability", col = "red3", lwd = 3, main = "Group 1")
plot(val, p2, type = "h", xlab = "Value", ylab = "Probability", col = "blue2", lwd = 3, main = "Group 2")
par(mfrow = c(1, 1))
(EY <- c(sum(p*val), sum(p2*val))) ## true means in the two groups
(varY <- c(sum(p*(val - EY[1])^2), sum(p2*(val - EY[2])^2))) ## true variances
(sdY <- sqrt(varY)) ## true standard deviations
### Main simulation loop
### ++++++++++++++++++++++++
###
### This is the same as in a previous case.
### The only difference is a way in which we generate data.
tstat2 <- wstat2 <- matrix(nrow = M, ncol = length(n)) ## matrix to store calculated test statistics
colnames(tstat2) <- colnames(wstat2) <- n
#j <- 2; m <- 1
for (j in 1:length(n)){
set.seed(19730911) ## to be able to reproduce the results
EW <- n[j] * n[j] / 2 ## H0 mean of Mann-Whitney (shifted Wilcoxon) statistic
for (m in 1:M){
y1 <- sample(x = val, size = n[j], replace = TRUE, prob = p) ## random sample, group 1
y2 <- sample(x = val, size = n[j], replace = TRUE, prob = p2) ## random sample, group 2
tt <- t.test(y1, y2, var.equal = TRUE) ## two-sample t-test, equal-variances
tstat2[m, j] <- tt$statistic
R <- rank(c(y1, y2))
nTies <- table(R)
varCorrectWithTies <- sum(nTies^3 - nTies) / (2*n[j] * (2*n[j] - 1))
varW <- (n[j] * n[j] / 12) * (2*n[j] + 1 - varCorrectWithTies) ## H0 variance of Mann-Whitney statistic corrected for ties
wt <- wilcox.test(y1, y2, exact = FALSE, correct = TRUE) ## asymptotic Wilcoxon test, continuity corrections
wstat2[m, j] <- (wt$statistic - EW - 0.5*sign(wt$statistic - EW)) / sqrt(varW)
}
}
tstat2[,1] ### random sample from a distribution of statistic of the t-test with a sample size n = 10
### and under the condition that the two samples are distributed as specified above
tstat2[1:10,] ## What are those numbers?
wstat2[1:10,]
save(list = c("tstat2", "wstat2"), file = paste(ROOT, "/SimResult/sim_tw2_ordinal.RData", sep = ""))
#
(load(paste(ROOT, "/SimResult/sim_tw2_ordinal.RData", sep = "")))
(M <- nrow(tstat2))
### Powers of the t-tests (their MC estimates)
### +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(Ct <- qt(0.975, df = 2*n - 2)) ## Critical values that determine when the test rejects
Ctmat <- matrix(rep(Ct, each = M), nrow = M) ## 'technical' matrix which replicates the critical values
Ctmat[1:10,]
#
(nReject.t2 <- apply(abs(tstat2) > Ct, 2, sum) ) ## Number of rejections by t-test for each sample size.
(beta.t <- nReject.t2 / M) ## Power calculated by simulation.
## With our quite different samples being compared, power is practically 1
## even with already the smallest sample size.
(MCE.beta.t <- sqrt(beta.t * (1 - beta.t) / M)) ## Monte Carlo error
### Powers of the Wilcoxon tests (their MC estimates)
### +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(Cw <- qnorm(0.975)) ## What is this number?
#
(nReject.w2 <- apply(abs(wstat2) > Cw, 2, sum) ) ## Number of rejections by Wilcoxon test for each sample size.
(beta.w <- nReject.w2 / M) ## Power calculated by simulation.
(MCE.beta.w <- sqrt(beta.w * (1 - beta.w) / M)) ## Monte Carlo error