NMST539 | Lab Session 3 (Individual work)

Multivariate Normal Distribution

(marginal and conditional distributions)

LS 2017 | Instad of Thursday 07/03/17

Rmd file (UTF8 coding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
Komárek, A.: Základy práce s R. (PDF súbor)
Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

Conditional Normal Distribution

Let us consider a two-dimensional normal distribution of some random vector \(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big)\). The distribution takes the following (general) form:

\(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big) \sim N_{2}\left(\boldsymbol{\mu} = \Big(\begin{array}{c} \mu_{1} \\ \mu_{2}\end{array}\Big), \Sigma = \left( \begin{array}{cc} \sigma_{1}^{2} & \sigma_{12} \\\sigma_{21} & \sigma_{2}^{2} \end{array} \right) \right)\),

where \(\boldsymbol{\mu} \in \mathbb{R}^2\) is the vector of the expected values and \(\Sigma\) is the variance-covariance matrix, which is a positive definite and symmetric. The correspoding density function (of the two dimensional normal distrubution) is given by the expression

\(\large{f(\boldsymbol{x}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp\Big\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \Big\},}\)

for \(\boldsymbol{x} = (x_{1}, x_{2})^{\top} \in \mathbb{R}^{2}\).

This density can be used to derive the marginal distrubution of random variables \(X_{1}\) and \(X_{2}\) or the conditional distrubution of \(X_{1}\) given \(X_{2}\) (or \(X_{2}\) given \(X_{1}\) respectively). In the following we will do both.

For the marginal density of \(X_{1}\) we have \(f(x_{1}) = \int_{\mathbb{R}} f(x_{1}, x_{2}) \mbox{d}x_{2}\) and analogously also for the marginal density of \(X_{2}\) (where we need to integrate the join density wrt the first covariate).

For a simple example with a two dimensional normal distribution the conditional distribution distribution of \(X_{2}\) given \(X_{1} = x_{1}\) is normal again and it holds that

\((X_{2} | X_{1 } = x_{1}) \sim N\Big(\mu_{2} + \frac{\sigma_{21}(x_1 - \mu_1)}{\sigma_{1}^2}, \sigma_{2}^2 - \frac{\sigma_{12}\sigma_{21}}{\sigma_{1}^2}\Big).\)

Now we can apply the formulas given above to obtain the marginal and conditional distributions. Firstly, the library mvtnorm needs to be installed on R. The library is loaded in by running the command:

library("mvtnorm")

Let us consider a simple example with two dimensional normal distribution with the zero mean vector \(\boldsymbol{\mu} = (0,0)^\top\) and the variance-covariance matrix \(\Sigma = \left( \begin{array}{cc} 1 & 0.8 \\0.8& 1\end{array} \right)\). We would like to calculate the conditional distribution of \(X_{2}\) given \(X_{1} = 0.7\).

Sigma <- matrix(c(1,.8,.8,1), nrow=2) ## variance-covariance matrix

x <- seq(-3,3,0.01)
contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))

abline(v=.7, lwd=2, lty=2, col = "red")
text(0.75, -2, labels=expression(x[1]==0.7), col = "red", pos = 4)

### conditional distribution of X2 | X1 = 0.7
y <- dnorm(x, mean =  0.8 * 0.7, sd = sqrt(1 - 0.8^2))
lines(y-abs(min(x)),x,lty=2,lwd=2, col = "red")

### marginals
m1 <- m2 <- dnorm(x, 0, 1)
lines(x, m1 - abs(min(x)), lty = 1, lwd = 1)
lines(m2 - abs(min(x)), x, lty = 1, lwd = 1)

And we can add some other conditional distribution of (e.g. for \(X_{2} | X_{1} = -1\)) by adding additional lines of an R code:

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
abline(v=-1, lwd=2, lty=2, col = "blue")

### conditional distribution of X2 | X1 = - 1
y2 <- dnorm(x, mean = 0.8 * (- 1), sd = sqrt(1 - 0.8^2))
lines(-y2 + max(x),x,lty=2,lwd=2, col = "blue")

Another option in R on who to obtain the same output:

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
abline(v=-1, lwd=1, lty=1, col = "blue")
lines(y2 - 1 ,x,lty=2,lwd=2, col = "blue")

And even more generalized version for different values of \(x_{1} \in \mathbb{R}\):

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
condN <- function(x, cx){dnorm(x, mean = 0.8 * cx, sd = sqrt(1 - 0.8^2))}
for (i in seq(-2, 2, by = 0.25)){col <- colors()[grep("red",colors())][4*i + 9]; 
                                 lines(condN(x, i) + i, x, lwd = 2, col = col);
                                 abline(v = i, lwd=1, lty=1, col = col)}

To Do by Yourself

Consider the conditional distribution we obtained above. Can you numerically reconstruct (obtain) the values which are used as the parameters in the normal distribution?
Recall the general formula for a conditional distribution of \(X_{2}\) given \(X_{1} = x_{2}\) if the join distribution of \(\Big(\begin{array}{c}X_{1}\\ X_{2}\end{array}\Big)\) is normal and \(X_{1}\) is \(p\)-dimensional and \(X_{2}\) is \(q\)-dimensional.
Use the R software to make an the following example: consider at least a four dimensional normal distribution and calculate the conditional distribution of \(X_{1}\) given the remaining elements of the random vector, such that \(X_{1}\) consists of the first two elements only. Make a plot for the conditional two-dimensional normal distribution which you obtained.

Comment

For a general multivariate normal distribution denoted as

\(\Big(\begin{array}{x}\boldsymbol{X}_{1}\\\boldsymbol{X}_{2}\end{array}\Big) \sim N_{p + q}\left(\boldsymbol{\mu} = \Big(\begin{array}{c} \boldsymbol{\mu}_{1} \\ \boldsymbol{\mu}_{2}\end{array}\Big), \Sigma = \left( \begin{array}{cc} \Sigma_{11} & \Sigma_{21} \\\Sigma_{12} & \Sigma_{22} \end{array} \right) \right)\)

for some random vectors \(\boldsymbol{X}_{1} \in \mathbb{R}^{p}\) and \(\boldsymbol{X}_{2} \in \mathbb{R}^{q}\) for \(p, q \in \mathbb{N}\) and the corresponding expectation vectors \(\boldsymbol{\mu}_{1}\) and \(\boldsymbol{\mu}_{2}\) and the variance-covariance matrix \(\Sigma\) the conditional distribution of \(\boldsymbol{X}_{1}\) given \(\boldsymbol{X}_{2} = \boldsymbol{x}_{2}\) is defined by the following expression:

\(\boldsymbol{X}_{1} | \boldsymbol{X}_{2} = \boldsymbol{x}_{2} \sim N_{p}\left(\boldsymbol{\mu}_{1} + \Sigma_{12}\Sigma_{22}^{-1}(\boldsymbol{x}_{2} - \boldsymbol{\mu}_{2} ), \Sigma = \Sigma_{11} = \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \right)\).

Conditional Multivariate Normal with ‘condMVNorm()’ package

There is a nice (and quite brief) R library available from the R cran mirrors which can be used to calculate the conditional distribution of some random vector \(\boldsymbol{X}_{1}\) given another random vector (value) \(\boldsymbol{X}_{2} = \boldsymbol{x}_{2}\) where the joint distribution of \((\boldsymbol{X}_{1}^{\top}, \boldsymbol{X}_{2}^{\top})^{\top}\) is multivariate normal (with the corresponding dimension). The library can be installed by using command install.packages("condMVNorm").

library("condMVNorm")

Functions dcmvnorm() and condMVN can be directly used to obtain the conditional distribution of interest (see the help session for more details - ?dcmvnorm and ?condMVN).

We apply these functions to the previous example with a two dimensional normal distribution with a zero mean vector and \(\Sigma = \left( \begin{array}{cc} 1 & 0.8 \\ 0.8 & 1 \end{array} \right)\). We want to obtain the conditional distribution of \(X_{2}\) given \(X_1 = 0.7\).

condDist <- condMVN(mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)

and also

dcmvnorm(0, mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)

## [1] 0.4301297

which should be compared with

dnorm(0, condDist$condMean, sqrt(condDist$condVar))

## [1] 0.4301297

Comment

Keep in minds that some functions in R require standard deviation to be specified when working with the normal distribution (e.g. functions dnorm(), pnorm(), qnorm(), etc.) while others need the variance-covariance matrix (variance in one dimension) to be provided instead.

Conditional Normal Distribution Example

In this example we consider a dataset mtcars which is standardly available in the R environment (for a more detailed description see the help session: ?mtcars).

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

attach(mtcars)

Let us assume that covariates ‘mpg’ (miles per gallon), ‘hp’ (horse power) and ‘disp’ (displacement) are all normaly distributed and the triple \((X_{1}, X_{2}, X_{3})^{\top} = (mpg, hp, disp)^{\top}\) follows a three dimensional normal distribution with some unknown mean vector \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2}, \mu_{3})^{\top} \in \mathbb{R}^{3}\) and some positive definite and symmetric variance-covariance matrix \(\Sigma\).

What is the (estimated) distribution if the miles per gallon (‘mpg’) given the horse power and displacement, where \((X_{2}, X_{3})^{\top} = (120,120)^{\top}\)?

The corresponding estimates can be obtained by

mu <- apply(cbind(mpg, hp, disp), 2, mean) 
Sigma <- cov(cbind(mpg, hp, disp))

Thus, the corresponding conditional mean and variance are obtained as

condMean <- mu[1] + Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% (c(120, 120) - mu[2:3])
condSigma <- Sigma[1,1] - Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% Sigma[2:3, 1]

Which gives us the following distribution function:

condMean

##          [,1]
## [1,] 24.11354

condSigma

##         [,1]
## [1,] 9.14495

plot(dnorm(x <- seq(0, 35, length = 500), mean = condMean, sd = sqrt(condSigma)) ~ x, 
                                           xlab = "Miles per gallon", ylab = "Conditional Density", 
                                           type = "l", lwd = 2, col = "red", 
                                           xlim = c(0, 35), ylim = c(0,0.18))
lines(dnorm(x, mean(mpg), sd(mpg)) ~ x, lwd = 1)
legend(0, 0.175, legend = c("Conditional distribution: MPG | HP = 120 & Disp = 120", 
                           "Estimated Normal Distribution of MPG"), col = c("red", "black"), lty = c(1,1))

Homework Assignment

(Deadline: third lab session / 21.03.2016)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Choose one dataset where there are at least two covariates which can be considered to be continous ones wich (approximatelly) some normal distribution (e.g. use a histogram plot to help yourself to decide) and load the data into the R environment.
Use your Rnw file (which you have already prepared for the previous homework assignment) and add some figures into your report. At least one figure should contours of the two-dimensional distribution you are considering. Assuming the two-variate join normal distribution for your two covariates calculate also a conditional distribution when conditioning one covariate by another one and both marginal distribtions. Draw the corresponding distributions into the firgure.
Provide some reasonable interpretation of the figures and all elements included in them.