NMST539 | Cvičenie 2

Multivariate Normal Distribution

LS 2017 | Monday 26/02/18

Rmd file (UTF8 coding)

Outline of the second NMST539 lab session:

  • Multivariate normal distribution: theory and practical utilization with R;
  • R package ‘mvtnorm’;
  • R command ‘mvrnorm()’;
  • Illustrations and examples;

For the second lab session we utilize some useful tools, available within the standard R software instalation and the R package ‘mvtnorm’, meant for use with the multivariate normal distribution. Some additional options and practical functions in R targeting multivariate distributions will be discussed too.

The R-software is available for download (GNU licence, free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many) however, not nesessary for a proper functioning of the R software: RStudio.

Some (rather brief) manuals and introduction tutorials for R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)



1. R package ‘mvtnorm’

Firstly, we need to make sure that the library for the multivariate normal distribution (R package mvtnorm) is isntalled and properly loaded in the R working environment (use commad library("mvtnorm")). If the library can not be loaded you need to install it first. Installation is performed fully automatically by typing the command install.packages("mvtnorm") in the R command line.

library("mvtnorm")
## Warning: package 'mvtnorm' was built under R version 3.2.5

We can now use the loaded library to easily generate random values from some multivariate normal distribution with some pre-specified variance-covariance matrix \(\Sigma\) (a symmetric and positive definite matrix).

For more details about the library (including a PDF manual document) see the library website on R cran:
https://cran.r-project.org/web/packages/mvtnorm/index.html



Comment


It is useful to always keep in mind that everytime we are about to use some random generator in R (or any other software) it is good to predefine the initial values for the generator (in R there is a command called set.seed() to do the job). By doing so one makes sure that the obtained results are always all reconstructable and they can be easily verified.

For some specific purposes (master thesis, scientific paper, simulations, etc.) this is even a strictly required step.

For instance try (among each other) the command:

(sample <- rnorm(10))
##  [1]  2.3636139  0.9461329 -0.9355034  1.6337718  0.6346859  0.1925017
##  [7] -1.4251919  1.2437541  2.2649117 -0.7457070

and compare it with the same command, however, with the initial setting of the generator done by the set.seed(1234) command (any numerical value can be used as an input parameter, however, it must be the same for all):

set.seed(1234)
(sample <- rnorm(10))
##  [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247  0.5060559
##  [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378

Is it clear, what is the difference between these two outputs?
Producing results which are easily reconstructable and can be verified is the key issue here.





Example

Let us start with a simple (two-dimensional) example: we are going to generate a random sample of size \(n \in \mathbb{N}\) from the two-dimensional normal distribution

\(\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_n \sim N_{2}\Big(\boldsymbol{\mu} = \Big(\begin{array}{c}2\\3\end{array}\Big), \Sigma = \Big(\begin{array}{cc}10^2 & 6^2\\ 6^2 & 6^2\end{array}\Big) \Big)\).

for \(\boldsymbol{X}_{i} = (X_{i 1}, X_{i 2})^\top\).

Remember, that in case of a one-dimensional random generator for the normal distribution in R (command rnorm()) one needs to specify the standard error instead of the variance (compare the help sessions ?rnorm and ?rmnorm).

  • What is the marginal distribution of the elements – random variables \(X_{i 1}\) and \(X_{i 2}\)?
  • Can we use the standard – one dimensional random generator rnorm() to simulate both variables individually and, later, to combine them into the vector to form the multivariate – two dimensional normal distribution?
  • Verify by yourself that the variance-covariance matrix \(\Sigma\) is a symmetric and positive definite matrix.

The sample of size \(n=1000\) from the given two-dimensional normal distribution can be generated by the following command:

set.seed(1234)
s1 <- rmvnorm(1000, c(2, 3), matrix(c(10^2, 6^2, 6^2, 6^2),2,2))

And we can plot the generated random sample into a nice scatterplot using, for instance, the standard R command plot():

plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Marginal No. 1", ylab="Marginal No. 2", bg = "lightblue")

Using now the mean estimates and the sample variance-covariance matrix we can easily draw some ‘depth’ contours in the figure. Another option of course, is to use the theoretical expression for the mutivariate normal density, as we know the distribution we generated from (which is usually not the case in real life situations). However, in the following, we will rather use the generated sample and we calculate the sample mean vector and the sample variance-covariance matrix.

center <- apply(s1, 2, mean)   ## sample mean vector
sigma <- cov(s1)               ## sample variance-covariance matrix

The two dimensional normal density function is defined as

\(f(\boldsymbol{x}) = f(x_{1}, x_{2}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp \left\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \right\}\),

where \(\boldsymbol{x} = (x_{1}, x_{2})^\top \in \mathbb{R}^2\), \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2})^\top\) is the mean vector, and \(\Sigma\) is some given variance-covariance matrix (which is symmetric and positive definite).

Thus, we also need the inverse matrix for the sample variance-covariance matrix \(\widehat{\Sigma}\) to be able to draw the corresponding contours. The R fuction ellipse() in the following code is responsible for calculating the value of the two dimensional normal density function when plugging in the appropriate sample estimates and the margin values.

sigma.inv = solve(sigma, matrix(c(1,0,0,1),2,2))

ellipse <- function(s,t){
                    u<-c(s,t)-center; 
                    e <- (-1) * (u %*% sigma.inv %*% u) / 2; 
                    exp(e)/(2 * pi * sqrt(det(sigma.inv)))}

This can be now applied in the following way:

n <- 60
x <- (0:(n-1))*2 - 50
y <- (0:(n-1))*2 - 50

z <- mapply(ellipse, as.vector(rep(x,n)), as.vector(outer(rep(0,n), y, `+`)))

And the final plot is produced by the following couple of the commands:

plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Margin No. 1", ylab="Margin No. 2", bg = "lightgray")
contour(x,y,matrix(z,n,n), levels=(0:15), col = terrain.colors(16), add=TRUE, lwd = 1)



Do by Yourselves


  • Try various settings for the parameters defining the two dimensional normal distribution (the mean vector and the variance covariance matrix) used above and try to understand what happens with the distribution of the points, rotation of the structure, and the corresponding countours when changing the values of the given parameters.

  • Use two different samples each simulated from the two dimensional normal distribution with different parameters and draw the samples with the corresponding countours into one plot only. Again try to use different parameter settings to see what happens.

  • Try to give a resonable and sensible interpretation (in a way that non-statisticians or non-mathemticians would understant) of the parameters used in the models.

  • Can you think of some reasonable and meaningful interpretation of the plotted contours in the figure above? Why are they called depth contours? Is there some straightforward relationship with quantiles?

  • Think of some linear combination of the random elements \(X_{i1}\) and \(X_{i2}\) in the sample you generated from the two dimensional normal distribution. What is the distribution of this linear combination? Use some graphical tools to verify your conclussion.




2. R package ‘mixtools’

Another option which can be used to get the same results is the function predefined in the R package mixtools. Install the library by using install.packages(mixtools) and load the library by running the command

library("mixtools")

Within the mixtools library there is another ellipse() function defined (use the help session ?ellipse to see how it works).

rm(list = ls())
library(mixtools)
library(mvtnorm) 

set.seed(1234)
p <- rmvnorm(1000, c(250000, 20000), matrix(c(100000^2, 22000^2, 22000^2, 6000^2),2,2))
plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Margin No. 1", ylab="Margin No. 2")
ellipse(mu=colMeans(p), sigma=cov(p), alpha = .05, npoints = 250, col="red") 



Do by Yourselves


Can you reconstruct the countour from the figure above? How is the red ellipse calculated there?

alpha <- 0.05
angles <- seq(0, 2*pi, length = 200)
cpoints <- cbind(cos(angles), sin(angles))

q <- sqrt(qchisq(1 - alpha, df = 2))
mu <- colMeans(p)
sigma <- cov(p)
sigmaEig <- eigen(sigma)
sigmaSqr <- sigmaEig$vectors %*% diag(sqrt(sigmaEig$values)) %*% t(sigmaEig$vectors)
Epoints <-  q * (cpoints %*% sigmaSqr)
Epoints[,1] <- Epoints[,1] + mu[1]
Epoints[,2] <- Epoints[,2] + mu[2]

plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Margin No.1", ylab="Margin No.2")
points(Epoints[,1], Epoints[,2], col = "red", pch = "*")



3. Alternative Approaches with the command mvrnorm()

Another option for simulating a general multivariate normal sample is available within the library MASS by running the command mvrnorm() (again, use the help session ?mvrnorm to find out more details about this command). The library needs to be loaded in by the standard command

library("MASS")

Next, we can generate some random samples from two-dimensional normal distribution and, for some visualisation purposes, we calculate the corresponding multivariate kernel density estimates (theoretical details are omitted and they are not important).

### two independent samples from two-dimensional normal each
bivn1 <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1, .5, .5, 1), 2))
bivn2 <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1.5, 1.5, 1.5, 1.5), 2))

### two dimensional kernel density estimate
bivn.kde1 <- kde2d(bivn1[,1], bivn1[,2], n = 50)
bivn.kde2 <- kde2d(bivn2[,1], bivn2[,2], n = 50)

par(mfrow = c(1,2))
persp(bivn.kde1, phi = 45, theta = 30, shade = .1, border = NA)
#par(new=TRUE)
persp(bivn.kde2, phi = 45, theta = 30, shade = .1, border = NA)



Comment


Remmeber, that once we know the joint distribution function all marginals are easily obtainable – one just need to calculate some multivariate integrals which is always easily doable. On the other hand, however, once we know all marginals it is still not enough to reconstruct the whole joint (multivariate) distribution. Using this idea we can use the next principle to generate some multivariate normal distribution BUT we can not control for the overal dependence structure – the covariance matrix among the marginals.

library(ggplot2)
rm(list = ls())
set.seed(1234)

n <- 1000
sample1 <- rnorm(n, mean=2, sd = 2)
sample2 <- 0.2 * sample1 + rnorm(n)
data <- data.frame(x = sample1, y = sample2, group="A")

sample3 <- rnorm(n, mean=0, sd = 1)
sample4 <- 3*sample3 - 1 + rnorm(n)
data <- rbind(data, data.frame(x = sample3, y = sample4, group="B"))
ggplot(data, aes(x=x, y=y,colour=group)) + geom_point(size=1.5, alpha=.6) 





Homework Assignment

(Deadline: third lab session / 05.03.2018)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata") 

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma) 
  • Choose one dataset where there are at least two covariates which can be considered to be continous ones (at least approximatelly, for instance, use a histogram plot to help yourself to decide which variables to consider) with some unknown multivariate normal distribution and load the data into the R environment.
  • Use your Rnw file (which you have already prepared for the previous homework assignment) and add some figures into your report. There should be at least one figure containing a scatterplot of both covariates including the contours which you calculate using the the sample mean vector and the sample variance-covariance matrix.
  • Provide some reasonable interpretation of the figures and all other elements included in the plot.