NMST539 | Lab Session 2

Multivariate Normal Distribution

LS 2017 | Thursday 02/03/17

Rmd file (UTF8 coding)

For the second NMST539 lab session we will discover some tools, available within the R software instalation, which are meant for working with the multivariate normal distribution.

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)


R package ‘mvtnorm’

For the beginning, we need to make sure that the library for the multivariate normal distribution (mvtnorm) is loaded in the R working environment (use commad library("mvtnorm")). If the library can not be loaded we need to install it first by running the command install.packages("mvtnorm")

library("mvtnorm")

We can now use the loaded library to easily generate random values from some multivariate normal distribution with some pre-specified variance-covariance matrix \(\Sigma\).

For more details about the library (including the manual document) see the library website on R cran:
https://cran.r-project.org/web/packages/mvtnorm/index.html



Comment


It is useful to always keep in mind that everytime we are about to use some random generator in R (or any other software) it is good to predefine the initial values for the generator (in R there is a command called set.seed() to do the job). By doing so one makes sure that the obtained results are always all reconstructable and they can be easily verified.

For instance try (among each other) the command:

(sample <- rnorm(10))
##  [1] -0.7465914  1.5746143 -0.0272531  0.4167468  1.2033093  0.7232020
##  [7]  0.4029413  0.7172862  2.1043320  0.9213117

and compare it with the same commend, however, with the initial setting of the generator done by the set.seed(1234) command (any numerical value can be used as a parameter):

set.seed(1234)
(sample <- rnorm(10))
##  [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247  0.5060559
##  [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378
Is it clear, what is the difference between these two outputs?



For the beginning, let us start with a simple (two-dimensional only) example: we will generate a random sample of size \(n \in \mathbb{N}\) from the two-dimensional normal distribution

\(N_{2}\Big(\boldsymbol{\mu} = \Big(\begin{array}{c}2\\3\end{array}\Big), \Sigma = \Big(\begin{array}{cc}10^2 & 6^2\\ 6^2 & 6^2\end{array}\Big) \Big)\).

Remember, that in case of a one-dimensional random generator for the normal distribution in R (command rnorm()) one needs to specify the standard error instead of the variance (compare the help sessions ?rnorm and ?rmnorm).

The sample of size \(n=1000\) from the given two-dimensional normal distribution can be generated by the following command:

set.seed(1234)
s1 <- rmvnorm(1000, c(2, 3), matrix(c(10^2, 6^2, 6^2, 6^2),2,2))

And we can plot the generated random sample into a nice scatterplot using for instance the command

plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Marginal No. 1", ylab="Marginal No. 2", bg = "lightblue")

Using now mean estimates and the sample variance-covariance matrix we can easily draw some ‘depth’ contours in the figure. Another option of course, is to use the theoretical values, as we know the distribution we generated from (which is usually not the case in real life situations). However, in the following, we will rather use the sample and we calculate the sample mean vector and the sample variance-covariance matrix.

center <- apply(s1, 2, mean)   ## sample mean vector
sigma <- cov(s1)               ## sample variance-covariance matrix

The two dimensional normal density function is given by

\(f(\boldsymbol{x}) = f(x_{1}, x_{2}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp \left\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \right\}\),

where \(\boldsymbol{x} = (x_{1}, x_{2})^\top \in \mathbb{R}^2\) and \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2})^\top\) is the mean vector.

Thus, we also need the inverse matrix for the sample variance-covariance matrix \(\widehat{\Sigma}\) to be able to draw the corresponding contours. The R fuction ellipse() in the following code is responsible for calculating the value of the two dimensional normal density function when plugging in the appropriate sample estimates and the margin values.

sigma.inv = solve(sigma, matrix(c(1,0,0,1),2,2))

ellipse <- function(s,t){
                    u<-c(s,t)-center; 
                    e <- (-1) * (u %*% sigma.inv %*% u) / 2; 
                    exp(e)/(2 * pi * sqrt(det(sigma.inv)))}

This can be used as follows:

n <- 60
x <- (0:(n-1))*2 - 50
y <- (0:(n-1))*2 - 50

z <- mapply(ellipse, as.vector(rep(x,n)), as.vector(outer(rep(0,n), y, `+`)))

And the final plot is produced by the following couple of the commands:

plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Margin No. 1", ylab="Margin No. 2", bg = "lightgray")
contour(x,y,matrix(z,n,n), levels=(0:15), col = terrain.colors(16), add=TRUE, lwd = 1)



Do by Yourselves


  • Try various settings for the parameters defining the two dimensional normal distribution (the mean vector and the variance covariance matrix) used above and try to understand what happens with the distribution of the points, rotation of the structure and the corresponding countours when changing the corresponding parameters.

  • Use two different samples each simulated from the two dimensional normal distribution with different parameters and draw the samples with the corresponding countours into one plot only. Again try to use different parameter settings to see what happens.

  • Try to give a resonable and sensible interpretation (in a way that non-statisticians or non-mathemticians would understant) of the parameters used in the models.



Another option which can be used to get the same results is the function predefined in the R package mixtools. Install the library by using install.packages(mixtools) and load the library by running the command

library("mixtools")

Within the mixtools library there is another ellipse() function defined (use the help session ?ellipse to see how it works).

rm(list = ls())
library(mixtools)
library(mvtnorm) 

set.seed(1234)
p <- rmvnorm(1000, c(250000, 20000), matrix(c(100000^2, 22000^2, 22000^2, 6000^2),2,2))
plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Margin No. 1", ylab="Margin No. 2")
ellipse(mu=colMeans(p), sigma=cov(p), alpha = .05, npoints = 250, col="red") 



Do by Yourselves


Can you reconstruct the countour from the figure above? How is the red ellipse calculated there?

alpha <- 0.05
angles <- seq(0, 2*pi, length = 200)
cpoints <- cbind(cos(angles), sin(angles))

q <- sqrt(qchisq(1 - alpha, df = 2))
mu <- colMeans(p)
sigma <- cov(p)
sigmaEig <- eigen(sigma)
sigmaSqr <- sigmaEig$vectors %*% diag(sqrt(sigmaEig$values)) %*% t(sigmaEig$vectors)
Epoints <-  q * (cpoints %*% sigmaSqr)
Epoints[,1] <- Epoints[,1] + mu[1]
Epoints[,2] <- Epoints[,2] + mu[2]

plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Margin No.1", ylab="Margin No.2")
points(Epoints[,1], Epoints[,2], col = "red", pch = "*")



Alternative Approaches with the library mvrnorm()

Another option for simulating a general multivariate normal sample is available within the library MASS by running the command mvrnorm() (again, use the help session ?mvrnorm to get into more details). The library needs to be loaded in by the standard command

library("MASS")

Next, we can generate some random samples from two-dimensional normal distribution and we can even calculate the corresponding multivariate kernel density estimates.

### two independent samples from two-dimensional normal each
bivn1 <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1, .5, .5, 1), 2))
bivn2 <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1.5, 1.5, 1.5, 1.5), 2))

### two dimensional kernel density estimate
bivn.kde1 <- kde2d(bivn1[,1], bivn1[,2], n = 50)
bivn.kde2 <- kde2d(bivn2[,1], bivn2[,2], n = 50)

par(mfrow = c(1,2))
persp(bivn.kde1, phi = 45, theta = 30, shade = .1, border = NA)
#par(new=TRUE)
persp(bivn.kde2, phi = 45, theta = 30, shade = .1, border = NA)



Comment


Remmeber, that once we know the joint distribution function all marginals are easily obtainable. On the other hand, however, once we know all marginals it is still not enough to reconstruct the whole joint (multivariate) distribution. Using this idea we can do some the next principle to generate some multivariate normal distribution BUT we can not control for the overal dependence structure - the covariance matrix among the marginals.

library(ggplot2)
rm(list = ls())
set.seed(1234)

n <- 1000
sample1 <- rnorm(n, mean=2, sd = 2)
sample2 <- 0.2 * sample1 + rnorm(n)
data <- data.frame(x = sample1, y = sample2, group="A")

sample3 <- rnorm(n, mean=0, sd = 1)
sample4 <- 3*sample3 - 1 + rnorm(n)
data <- rbind(data, data.frame(x = sample3, y = sample4, group="B"))
ggplot(data, aes(x=x, y=y,colour=group)) + geom_point(size=1.5, alpha=.6) 





Homework Assignment

(Deadline: third lab session / 21.03.2016)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata") 

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma) 
  • Choose one dataset where there are at least two covariates which can be considered to be continous ones wich (approximatelly) some normal distribution (e.g. use a histogram plot to help yourself to decide) and load the data into the R environment.
  • Use your Rnw file (which you have already prepared for the previous homework assignment) and add some figures into your report. At least one figure should contain a scatterplot of the two covariates (see the previous item) including the contours which you calculate from the sample mean vector and the sample variance-covariance matrix.
  • Provide some reasonable interpretation of the figures and all elements included in them.