NMST539, LS 2015/16

Cvičenie 2 (týždeň 3)

(Mnohorozmeré normálne rozdelenie)

(R package mvtnorm)


To begin, we need to make sure that the corresponding library is loaded in the R working environment (use commad library("mvtnorm")). If the library can not be loaded we need to install it first by using command install.packages("mvtnorm").

library("mvtnorm")

Using the loaded library one can easily generate random values from some multivariate normal distribution.


Comment


Once we are about to use some random generator in R (or any other software) it is good to predefine the initial values for the generator (in R there is a command called set.seed() which can be used for this). By doing so one makes sure that the results are all reconstructable.


Let us start with some very simple (two-dimensional only) example. We will generate a sample from the distribution

\(N_{2}\Big(\boldsymbol{\mu} = \Big(\begin{array}{c} 2 \\ 3\end{array}\Big), \Sigma = \Big( \begin{array}{cc} 10^2 & 6^2 \\6^2 & 6^2 \end{array} \Big) \Big)\)

where \(\Sigma\) is the variance-covariance matrix. Remember, that in case of a uniform random generator (command rnorm()) one needs to specify standard error instead. The sample of size \(n = 1000\) can be generated by

set.seed(1111)
s1 <- rmvnorm(1000, c(2, 3), matrix(c(10^2, 6^2, 6^2, 6^2),2,2))

And we can plot the generated random sample into a scatterplot using the following command:

plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Covariate 1", ylab="Covariate 2", bg = "lightblue")

Using now sample mean estimates and sample variance-covariance matrix we can easily draw some ‘depth’ contours in the figure. The other option is to use theoretical values as we know them as well however, for the following we will rather stick with the sample only and we use sample mean vector and sample variance-covariance matrix.

center <- apply(s1, 2, mean) ## sample mean vector
sigma <- cov(s1)             ## sample covariance matrix

The two dimensional normal density function is given by

\(\large{f(\boldsymbol{x}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp\Big\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \Big\},}\)

where \(\boldsymbol{x} = (x_1, x_2)^{\top}\) and \(\boldsymbol{\mu} = (\mu_{1}, \mu_2)^{\top}\). Thus, we also need the inverse matrix for the sample variance-covariance matrix \(\hat{\Sigma}\) to be able to draw the corresponding contours.

The fuction called ellipse() in the following code is responsible for calculating the value of the two dimensional normal density function when plugging in the appropriate sample estimates.

sigma.inv = solve(sigma, matrix(c(1,0,0,1),2,2))
ellipse <- function(s,t) {u<-c(s,t)-center; e <- (-1) * (u %*% sigma.inv %*% u) / 2; exp(e)/(2 * pi * sqrt(det(sigma.inv)))}

And now we can draw contours

n <- 60
x <- (0:(n-1))*2 - 50
y <- (0:(n-1))*2 - 50
z <- mapply(ellipse, as.vector(rep(x,n)), as.vector(outer(rep(0,n), y, `+`)))
plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Covariate 1", ylab="Covariate 2", bg = "lightgray")
contour(x,y,matrix(z,n,n), levels=(0:15), col = terrain.colors(16), add=TRUE, lwd = 2)

Try by yourselves


  1. Try various different settings for the parameters defining the two dimensional normal distribution used above and see what happens with the distribution of the points, rotations of the structure and the corresponding countours.

  2. Use two different samples each simulated from the two dimensional normal distribution with different parameters and draw the samples with the corresponding countours into one plot only. Again try to use different parameter settings to see what happens.

  3. Explain what is the natural interpretation of the parameters used in the models.

Comment


There is also a predefined function in R called ellipse() which can be found in the R package mixtools. Install the library using install.packages(mixtools) and try to following piece of code:

rm(list = ls())
library(mixtools)
library(mvtnorm) 
set.seed(1111)
p <- rmvnorm(1000, c(250000, 20000), matrix(c(100000^2, 22000^2, 22000^2, 6000^2),2,2))
plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Packets", ylab="Flows")
ellipse(mu=colMeans(p), sigma=cov(p), alpha = .05, npoints = 250, col="red") 

How is the the elipse actually calculated?

alpha <- 0.05
angles <- seq(0, 2*pi, length = 200)
cpoints <- cbind(cos(angles), sin(angles))

q <- sqrt(qchisq(1 - alpha, df = 2))
mu <- colMeans(p)
sigma <- cov(p)
sigmaEig <- eigen(sigma)
sigmaSqr <- sigmaEig$vectors %*% diag(sqrt(sigmaEig$values)) %*% t(sigmaEig$vectors)
Epoints <-  q * (cpoints %*% sigmaSqr)
Epoints[,1] <- Epoints[,1] + mu[1]
Epoints[,2] <- Epoints[,2] + mu[2]

plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Packets", ylab="Flows")
points(Epoints[,1], Epoints[,2], col = "red", pch = "*")



Another option for simulating multivariate normal sample is available in the library MASS by using command mvrnorm():

library(MASS)
bivn1 <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1, .5, .5, 1), 2))
bivn2 <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1.5, 1.5, 1.5, 1.5), 2))

# now we do a kernel density estimate
bivn.kde1 <- kde2d(bivn1[,1], bivn1[,2], n = 50)
bivn.kde2 <- kde2d(bivn2[,1], bivn2[,2], n = 50)

# fancy perspective
persp(bivn.kde1, phi = 45, theta = 30, shade = .1, border = NA)
par(new=TRUE)
persp(bivn.kde2, phi = 45, theta = 30, shade = .1, border = NA)

Two dimensional normal from one dimensional

Finally, we can generate a ‘two dimensional normal’ from one dimensional by using some reasonable transformation but one can not directly control for the covariance of the two covariates being generated.

library(ggplot2)
rm(list = ls())
set.seed(1111)
n <- 1000
sample1 <- rnorm(n, mean=2, sd = 2)
sample2 <- 0.2 * sample1 + rnorm(n)
data <- data.frame(x = sample1, y = sample2, group="A")
sample3 <- rnorm(n, mean=0, sd = 1)
sample4 <- 3*sample3 - 1 + rnorm(n)
data <- rbind(data, data.frame(x = sample3, y = sample4, group="B"))
ggplot(data, aes(x=x, y=y,colour=group)) + geom_point(size=1.5, alpha=.6) 



Domáca úloha

(Deadline: piaty týždeň / 21.03.2016)

Na webovej stránke Doc. Hlávku

http://www1.karlin.mff.cuni.cz/~hlavka/teac.html

je k dispozícii niekoľko dátových súborov, ktoré do Rka stačí načítať pomocou príkazu load(nazov_suboru.rda).

  • vyberte si jeden dátový súbor, u ktorého je možne predpokladať aspoň dve premenné, ktoré by mohli mať normálne rozdelenie (napr. odhadnúť pomocou histogramu) načítajte ho v Rku;
  • do již vytvoreného Rnw súboru, pridajte aspoň jeden obrázok, ktorý bude obsahovať scatterplot oboch proměnných a do obrázku taktiež vytvorte príslušné kontúry;
  • príslušný obrázok dostatočne okomentujte, aby bolo jasné, čo je v ňom zobrazené;