1. Wishart distribution
In general, Wishart distribution is the distribution of random matrices. For a normaly distributed random sample \(\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n}\) drawn from some multivariate distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\), where \(\boldsymbol{\mu} \in \mathbb{R}^{p}\) is a mean vector and \(\Sigma\) a variance-covariance matrix we have that \[
\mathbb{X}^{\top}\mathbb{X} \sim W_{p}(\Sigma, n),
\] for \(\mathbb{X} = (\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n})^{\top}\). The Wishart distribution can be considered to be a multivariate generalization of the univariate \(\chi^2\) distribution (for a sample of size \(n \in \mathbb{N}\) drawn from \(N(0,1)\) the corresponding distribution \(\mathbb{X}^\top\mathbb{X}\) is \(W_{1}(1, n) \equiv \chi_{n}^2\)).
Try by yourself
-
use the help session in R to find out more about classical commands used for the Wishart distribution (in particular
rWishart()
);
-
some other commands and additional extensions are available in some packages which needs to be downloaded and installed separately, e.g. function
Wishart()
in the library ‘MCMCpack’, or Wishart()
in the library ‘mixAK’, or rwishart()
in the library ‘dlm’, and others;
-
try to use some the commands mentioned above
A simple random sample from Wishart distribution \(W_{1}(\Sigma = 1, n = 10)\) (equivalently a \(chi^2\) distribution with \(n = 10\) degrees of freedom) can be obtained as
S <- as.matrix(1)
sample <- rWishart(10, df = 10, S)
For large sample size \(n \in \mathbb{N}\) we can compare both distributions using histogram figures:
set.seed(1234)
sampleWishart <- rWishart(5000, df = 10, S)
sampleChiSq <- rchisq(5000, df = 10)
par(mfrow = c(1,2))
hist(sampleWishart, col = "lightblue", main = expression(paste("Wishart Distribution", sep ="")), xlim = c(0, 40), breaks = 30, freq = F)
lines(density(sampleWishart), col = "red", lwd = 2)
hist(sampleChiSq, col = "lightblue", main = expression(paste(chi^2, "Distribution", sep = "")), xlim = c(0,40), breaks = 30, freq = F)
lines(density(sampleChiSq), col = "red", lwd = 2)
Try by yourself
-
consider some Wishart distribution for some \(p \in \mathbb{N}\) such that \(p > 1\);
-
generate a sample from the Wishart distribution \(W_{p}\) with some reasonable parameters \(\Sigma\) and \(df\);
-
consider some general matrix \(A \in \mathbb{R}^{p \times q}\) for some \(q \in \mathbb{N}\) such that \(p \neq q\) and verify (e.g. by histograms) that \(A^\top \mathbb{M} A \sim W_{q}(A^\top\Sigma A, n)\), where \(\mathbb{M} \sim W_{p}(\Sigma, n)\);
2. Hotelling’s \(\boldsymbol{T^2}\) Distribution
In a similar manner as we define a classical \(t\)-distribution in an univariate case (i.e. standard normal \(N(0,1)\) variable devided by a square root of a \(chi^2\) variable normalized by its degrees of freedom) we define a multivariate generalization (Hotelling’s \(T^{2}\) distribution) as \[
n \boldsymbol{Y}^{\top} \mathbb{M}^{-1} \boldsymbol{Y} \sim T^{2}(n, p),
\] to be a random variable with the Hotelling’s \(T^2\) distribution with \(p \in \mathbb{N}\) to be the dimension of \(Y \sim N_{p}(0, \mathbb{I})\) and \(n \in \mathbb{N}\) being the parameter of the Wishart distribution of \(\mathbb{M} \sim W_{p}(\mathbb{I}, n)\).
A special case for \(p = 1\) gives the standard Fisher distribution with one and \(n\) degrees of freedom (equivalently a square of the \(t\)-distribution with \(n\) degrees of freedom). The Hotelling’s \(T^2\) distribution with parameters \(p, n \in \mathbb{N}\) can be identivied with the Fisher distribution using the following expression \[
T^{2}(p, n) \equiv \frac{n p}{n - p + 1}F_{p, n - p + 1}.
\]
The effect of different parameter settings can be (a little) visualized in the following figure for examples:
samples <- NULL
samples <- cbind(samples, rf(n = 1000, df1 = 1, df2 = 10))
samples <- cbind(samples, rf(n = 1000, df1 = 1, df2 = 100))
samples <- cbind(samples, rf(n = 1000, df1 = 1, df2 = 1000))
samples <- cbind(samples, rf(n = 1000, df1 = 10, df2 = 10))
samples <- cbind(samples, rf(n = 1000, df1 = 10, df2 = 100))
samples <- cbind(samples, rf(n = 1000, df1 = 10, df2 = 1000))
samples <- cbind(samples, rf(n = 1000, df1 = 100, df2 = 10))
samples <- cbind(samples, rf(n = 1000, df1 = 100, df2 = 100))
samples <- cbind(samples, rf(n = 1000, df1 = 100, df2 = 1000))
par(mfrow = c(3,3))
for (i in 1:9){
hist(samples[,i], xlab=paste("Sample no.", i, sep = ""), col = "yellow", main = "", freq = F)
lines(density(samples[,i]), col = "red")
}
Moreover, using the relationship between the Hotelling’s \(T^2\) distribution and the Fisher’s F distriubtion we can effectively use the \(F\)-distribution to draw the corresponding critical values when needed for some statistical tests.
For some multivariate random sample \(\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n} \sim N_{p}(\boldsymbol{\mu}, \Sigma)\) for some unknown mean vector \(\boldsymbol{\mu} \in \mathbb{R}^{p}\) and some variance-covariance matrix \(\Sigma\), we have that \[
(n - 1)\Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}\Big)^\top \mathcal{S}^{-1} \Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}\Big) \sim T^2(p, n - 1),
\] which can be also equivalently expressed as \[
\frac{n - p}{p} \Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}\Big)^\top \mathcal{S}^{-1} \Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}\Big) \sim F_{p, n - p}.
\] This can be now used to construct confidence regions for the unknown mean vector \(\boldsymbol{\mu}\) of for testing hypothesis about the true value of the vector of parameters \(\boldsymbol{\mu} = (\mu_{1}, \dots, \mu_{p})^{\top}\). However, rather than construction a confidence region for \(\boldsymbol{\mu}\) (which can be impractical in higher dimensions for even slightly larger values of \(p\)) one focusses on construction confidence intervals for the elements of \(\boldsymbol{\mu}\) such that the mutual coverage is under control (usually we require a simultaneous coverage of \((1 - \alpha)\times 100~\%\) for some small \(\alpha \in (0,1)\)).
For a hypothesis test
\[
H_{0}: \boldsymbol{\mu} = \boldsymbol{\mu}_{0} \in \mathbb{R}^{p}
\]
\[
H_{1}: \boldsymbol{\mu} \neq \boldsymbol{\mu}_{0} \in \mathbb{R}^{p}
\]
we can use the following test statistic:
\[
(n - 1)\Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}_{0}\Big)^\top \mathcal{S}^{-1} \Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}_{0}\Big),
\]
which, under the null hypothesis, follows the \(T^2(p, n - 1)\) distribution. Equivalently we also have that
\[
\frac{n - p}{p} \Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}_{0}\Big)^\top \mathcal{S}^{-1} \Big(\overline{\boldsymbol{X}} - \boldsymbol{\mu}_{0}\Big)
\]
follows the Fisher \(F\)-distribution with \(p\) and \(n - p\) degrees of freedom.
In an analogous way one can also construct a two sample Hotelling test to compare two population means.
Try by yourself
-
download and install the R library ‘Hotelling’ and ‘DescTools’;
-
for one sample Hotelling tests use command
HotellingsT2()
from ‘DescTools’ package;
-
for two sample Hotelling tests use command
hotelling.test()
from ‘Hotelling’ package;
-
use the help session to find out more about commands
hotelling.stat()
and hotelling.test()
which can be used to test the null hypothesis about the true value of the mean vector;
library("Hotelling")
There is a dataset called container.df
available in the package ‘Hotelling’. The data represent some concentration measurements on nine different elements (see the help session ?container.df
for further details).
data(container.df)
summary(container.df)
## gp Ti Al Fe
## Min. :1.0 Min. :0.02800 Min. :0.6590 Min. :0.02700
## 1st Qu.:1.0 1st Qu.:0.03175 1st Qu.:0.7405 1st Qu.:0.02975
## Median :1.5 Median :0.03300 Median :0.7850 Median :0.06950
## Mean :1.5 Mean :0.03350 Mean :0.8047 Mean :0.07665
## 3rd Qu.:2.0 3rd Qu.:0.03700 3rd Qu.:0.8730 3rd Qu.:0.12200
## Max. :2.0 Max. :0.03900 Max. :0.9410 Max. :0.16800
## Mn Mg Ca Ba
## Min. :0.0020 Min. :0.0900 Min. :5.511 Min. :0.00900
## 1st Qu.:0.0020 1st Qu.:0.1130 1st Qu.:6.422 1st Qu.:0.01200
## Median :0.0030 Median :0.1595 Median :6.989 Median :0.01550
## Mean :0.0034 Mean :0.1759 Mean :7.001 Mean :0.01615
## 3rd Qu.:0.0050 3rd Qu.:0.2362 3rd Qu.:7.578 3rd Qu.:0.02000
## Max. :0.0050 Max. :0.2700 Max. :7.780 Max. :0.02300
## Sr Zr
## Min. :0.0130 Min. :0.00900
## 1st Qu.:0.0150 1st Qu.:0.01000
## Median :0.0160 Median :0.01150
## Mean :0.0162 Mean :0.01125
## 3rd Qu.:0.0170 3rd Qu.:0.01200
## Max. :0.0190 Max. :0.01400
The estimates for the mean vector and variance-covariance matrix can be obtained as
apply(container.df[,2:10], 2, mean) ### mean vector estimate
## Ti Al Fe Mn Mg Ca Ba Sr Zr
## 0.03350 0.80470 0.07665 0.00340 0.17590 7.00055 0.01615 0.01620 0.01125
cov(container.df[,2:10]) ### variance-covariance estimate
## Ti Al Fe Mn Mg
## Ti 1.194737e-05 2.289474e-04 1.527632e-04 4.421053e-06 1.975789e-04
## Al 2.289474e-04 7.775589e-03 3.868363e-03 1.112842e-04 5.884863e-03
## Fe 1.527632e-04 3.868363e-03 2.465608e-03 7.072632e-05 3.294963e-03
## Mn 4.421053e-06 1.112842e-04 7.072632e-05 2.147368e-06 9.583158e-05
## Mg 1.975789e-04 5.884863e-03 3.294963e-03 9.583158e-05 4.784832e-03
## Ca 6.710526e-06 2.654805e-03 2.015711e-04 1.297895e-05 -4.767895e-05
## Ba 1.355263e-05 3.768895e-04 2.209500e-04 6.410526e-06 3.091211e-04
## Sr -2.105263e-07 1.484211e-06 -1.876842e-05 -6.631579e-07 -1.371579e-05
## Zr 1.921053e-06 6.907895e-05 2.451316e-05 6.842105e-07 4.423684e-05
## Ca Ba Sr Zr
## Ti 6.710526e-06 1.355263e-05 -2.105263e-07 1.921053e-06
## Al 2.654805e-03 3.768895e-04 1.484211e-06 6.907895e-05
## Fe 2.015711e-04 2.209500e-04 -1.876842e-05 2.451316e-05
## Mn 1.297895e-05 6.410526e-06 -6.631579e-07 6.842105e-07
## Mg -4.767895e-05 3.091211e-04 -1.371579e-05 4.423684e-05
## Ca 4.574089e-01 5.564921e-04 -3.695368e-04 -1.588158e-05
## Ba 5.564921e-04 2.139737e-05 -1.821053e-06 2.644737e-06
## Sr -3.695368e-04 -1.821053e-06 2.484211e-06 1.210526e-06
## Zr -1.588158e-05 2.644737e-06 1.210526e-06 1.671053e-06
and we can test whether the mean vector of concentrations in the first container (gp == 1
) and the second container (gp == 2
) is the same or not. Try to perform a one sample Hotelling test to test whether the unknown vector of mean parameters equals some prespecified vector \(\boldsymbol{\mu}_{0} \in \mathbb{R}^{p}\).
For a two sample problem we can use the first covariate in the data to define two populations and we provide a test whether the mean vectors are equal or not. The corresponding Hotelling test statistic equals
split.data = split(container.df[,-1],container.df$gp)
x <- split.data[[1]]
y <- split.data[[2]]
statistic <- hotelling.stat(x, y)
and the corresponding test is performed by
hotellingTest <- hotelling.test(.~gp, data = container.df)
hotellingTest
## Test stat: 126.11
## Numerator df: 9
## Denominator df: 10
## P-value: 4.233e-09
Questions
-
What is the interpretation of the test results?
-
Try to perform a simple two sample \(t\)-test(s). What are the conclussions now?
-
What would be the corresponding simultaneous confidence intervales for the elements of the mean vector in this example?
The correponding simultaneous confidence intervals for all possible linear combinations \(\boldsymbol{a}^\top \boldsymbol{\mu}\) of the mean vector \(\boldsymbol{\mu} \in \mathbb{R}^{p}\) is given as
\[
P\Big(\forall \boldsymbol{a} \in \mathbb{R}^{p};~ \boldsymbol{a}^\top \boldsymbol{\mu} \in \big(
\boldsymbol{a}^\top\overline{\boldsymbol{X}} - \sqrt{K_{\alpha} \boldsymbol{a}^\top \mathcal{S} \boldsymbol{a}}, \boldsymbol{a}^\top\overline{\boldsymbol{X}} + \sqrt{K_{\alpha} \boldsymbol{a}^\top \mathcal{S} \boldsymbol{a}} \big)
\big)\Big) = 1 - \alpha,
\] where \(K_{\alpha}\) is the corresponding quantile obtained from the Fisher \(F\) distribution (\(K_{\alpha} = \frac{p}{n - p} F_{p, n - p}(1 - \alpha)\)) and \(\mathcal{S}\) is the sample variance-covariance matrix.
Fquant <- (9/(20 - 9)) * qf(1 - 0.05, 9, 11)
meanVec <- apply(container.df[,2:10], 2, mean)
LB <- diag(9) %*% meanVec - sqrt(Fquant * diag(diag(9) %*% cov(container.df[,2:10]) %*% diag(9)))
UB <- diag(9) %*% meanVec + sqrt(Fquant * diag(diag(9) %*% cov(container.df[,2:10]) %*% diag(9)))
confInt <- data.frame(LB, UB, meanVec, row.names = names(container.df)[2:10])
names(confInt) <- c("Lower Boundary", "Upper Boundary", "Mean Estimate")
confInt
## Lower Boundary Upper Boundary Mean Estimate
## Ti 0.0281791989 0.038820801 0.03350
## Al 0.6689600906 0.940439909 0.80470
## Fe 0.0002131292 0.153086871 0.07665
## Mn 0.0011442333 0.005655767 0.00340
## Mg 0.0694184851 0.282381515 0.17590
## Ca 5.9594482060 8.041651794 7.00055
## Ba 0.0090293265 0.023270674 0.01615
## Sr 0.0137737525 0.018626247 0.01620
## Zr 0.0092600784 0.013239922 0.01125
Try by yourself
-
Consider some other linear combination for the simultaneous confidence intervals.
-
Define necessary linear combinations in order to obtain all pair-wise mean comparissons and their corresponding \(95~\%\) confidence (simultaneous) intervals.
-
What approach should be used to construct a condidence region (elipsoid)?