NMST539 | Cvičenie 2

Multivariate Normal Distribution

LS 2017 | Monday 26/02/18

Rmd file (UTF8 coding)

Outline of the second NMST539 lab session:

  • Multivariate normal distribution: theory and practical utilization with R;
  • R package ‘mvtnorm’;
  • R command ‘mvrnorm()’;
  • Illustrations and examples;

For the second lab session we utilize some useful tools, available within the standard R software instalation and the R package ‘mvtnorm’, meant for use with the multivariate normal distribution. Some additional options and practical functions in R targeting multivariate distributions will be discussed too.

The R-software is available for download (GNU licence, free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many) however, not nesessary for a proper functioning of the R software: RStudio.

Some (rather brief) manuals and introduction tutorials for R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)



1. R package ‘mvtnorm’

Firstly, we need to make sure that the library for the multivariate normal distribution (R package mvtnorm) is isntalled and properly loaded in the R working environment (use commad library("mvtnorm")). If the library can not be loaded you need to install it first. Installation is performed fully automatically by typing the command install.packages("mvtnorm") in the R command line.

library("mvtnorm")
## Warning: package 'mvtnorm' was built under R version 3.2.5

We can now use the loaded library to easily generate random values from some multivariate normal distribution with some pre-specified variance-covariance matrix \(\Sigma\) (a symmetric and positive definite matrix).

For more details about the library (including a PDF manual document) see the library website on R cran:
https://cran.r-project.org/web/packages/mvtnorm/index.html



Comment


It is useful to always keep in mind that everytime we are about to use some random generator in R (or any other software) it is good to predefine the initial values for the generator (in R there is a command called set.seed() to do the job). By doing so one makes sure that the obtained results are always all reconstructable and they can be easily verified.

For some specific purposes (master thesis, scientific paper, simulations, etc.) this is even a strictly required step.

For instance try (among each other) the command:

(sample <- rnorm(10))
##  [1]  2.3636139  0.9461329 -0.9355034  1.6337718  0.6346859  0.1925017
##  [7] -1.4251919  1.2437541  2.2649117 -0.7457070

and compare it with the same command, however, with the initial setting of the generator done by the set.seed(1234) command (any numerical value can be used as an input parameter, however, it must be the same for all):

set.seed(1234)
(sample <- rnorm(10))
##  [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247  0.5060559
##  [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378

Is it clear, what is the difference between these two outputs?
Producing results which are easily reconstructable and can be verified is the key issue here.





Example

Let us start with a simple (two-dimensional) example: we are going to generate a random sample of size \(n \in \mathbb{N}\) from the two-dimensional normal distribution

\(\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_n \sim N_{2}\Big(\boldsymbol{\mu} = \Big(\begin{array}{c}2\\3\end{array}\Big), \Sigma = \Big(\begin{array}{cc}10^2 & 6^2\\ 6^2 & 6^2\end{array}\Big) \Big)\).

for \(\boldsymbol{X}_{i} = (X_{i 1}, X_{i 2})^\top\).

Remember, that in case of a one-dimensional random generator for the normal distribution in R (command rnorm()) one needs to specify the standard error instead of the variance (compare the help sessions ?rnorm and ?rmnorm).

  • What is the marginal distribution of the elements – random variables \(X_{i 1}\) and \(X_{i 2}\)?
  • Can we use the standard – one dimensional random generator rnorm() to simulate both variables individually and, later, to combine them into the vector to form the multivariate – two dimensional normal distribution?
  • Verify by yourself that the variance-covariance matrix \(\Sigma\) is a symmetric and positive definite matrix.

The sample of size \(n=1000\) from the given two-dimensional normal distribution can be generated by the following command:

set.seed(1234)
s1 <- rmvnorm(1000, c(2, 3), matrix(c(10^2, 6^2, 6^2, 6^2),2,2))

And we can plot the generated random sample into a nice scatterplot using, for instance, the standard R command plot():

plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Marginal No. 1", ylab="Marginal No. 2", bg = "lightblue")

Using now the mean estimates and the sample variance-covariance matrix we can easily draw some ‘depth’ contours in the figure. Another option of course, is to use the theoretical expression for the mutivariate normal density, as we know the distribution we generated from (which is usually not the case in real life situations). However, in the following, we will rather use the generated sample and we calculate the sample mean vector and the sample variance-covariance matrix.

center <- apply(s1, 2, mean)   ## sample mean vector
sigma <- cov(s1)               ## sample variance-covariance matrix

The two dimensional normal density function is defined as

\(f(\boldsymbol{x}) = f(x_{1}, x_{2}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp \left\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \right\}\),

where \(\boldsymbol{x} = (x_{1}, x_{2})^\top \in \mathbb{R}^2\), \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2})^\top\) is the mean vector, and \(\Sigma\) is some given variance-covariance matrix (which is symmetric and positive definite).

Thus, we also need the inverse matrix for the sample variance-covariance matrix \(\widehat{\Sigma}\) to be able to draw the corresponding contours. The R fuction ellipse() in the following code is responsible for calculating the value of the two dimensional normal density function when plugging in the appropriate sample estimates and the margin values.

sigma.inv = solve(sigma, matrix(c(1,0,0,1),2,2))

ellipse <- function(s,t){
                    u<-c(s,t)-center; 
                    e <- (-1) * (u %*% sigma.inv %*% u) / 2; 
                    exp(e)/(2 * pi * sqrt(det(sigma.inv)))}

This can be now applied in the following way:

n <- 60
x <- (0:(n-1))*2 - 50
y <- (0:(n-1))*2 - 50

z <- mapply(ellipse, as.vector(rep(x,n)), as.vector(outer(rep(0,n), y, `+`)))

And the final plot is produced by the following couple of the commands:

plot(s1, pch=21, xlim=c(-35, 35), ylim=c(-35,35), xlab="Margin No. 1", ylab="Margin No. 2", bg = "lightgray")
contour(x,y,matrix(z,n,n), levels=(0:15), col = terrain.colors(16), add=TRUE, lwd = 1)