NMST539 | Lab Session 3

Multivariate Normal Distribution

(marginal and conditional distributions)

LS 2017 | Monday 05/03/18

Rmd file (UTF8 coding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
Komárek, A.: Základy práce s R. (PDF súbor)
Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

1. Conditional Normal Distribution

Let us consider a two-dimensional normal distribution of some random vector \(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big)\). The corresponding distribution is usually denoted as

\(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big) \sim N_{2}\left(\boldsymbol{\mu} = \Big(\begin{array}{c} \mu_{1} \\ \mu_{2}\end{array}\Big), \Sigma = \left( \begin{array}{cc} \sigma_{1}^{2} & \sigma_{12} \\\sigma_{21} & \sigma_{2}^{2} \end{array} \right) \right)\),

where \(\boldsymbol{\mu} \in \mathbb{R}^2\) is the vector of the expected values and \(\Sigma\) is the variance-covariance matrix, which is a positive definite and symmetric, thus \(\sigma_{12} = \sigma_{21}\). The correspoding density function (of the two dimensional normal distrubution) is given by the expression

\(\large{f(\boldsymbol{x}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp\Big\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \Big\},}\)

for an arbitrary \(\boldsymbol{x} = (x_{1}, x_{2})^{\top} \in \mathbb{R}^{2}\).

This density can be used to derive the marginal distrubution of the random variables \(X_{1}\) and \(X_{2}\) or the conditional distrubution of \(X_{1}\) given \(X_{2}\) (or \(X_{2}\) given \(X_{1}\) respectively). In the following we will do both.

For the marginal density of \(X_{1}\) we need to obtain \(f(x_{1}) = \int_{\mathbb{R}} f(x_{1}, x_{2}) \mbox{d}x_{2}\) and analogously also for the marginal density of \(X_{2}\), where integrate the join density wrt the first covariate instead. Both marginals are again normaly distributed and it holds that \(X_{1} \sim N(\mu_1, \sigma_1^2)~~~\) and \(~~~X_{2} \sim N(\mu_2, \sigma_2^2)\).
For a simple example with a two dimensional normal distribution the conditional distribution distribution of \(X_{2}\) given \(X_{1} = x_{1}\) is, again, normal and it holds that (analogously also for the distribution of \(X_{1}\) given \(X_{2}\))
\((X_{2} | X_{1 } = x_{1}) \sim N\Big(\mu_{2} + \frac{\sigma_{21}(x_1 - \mu_1)}{\sigma_{1}^2}, \sigma_{2}^2 - \frac{\sigma_{12}\sigma_{21}}{\sigma_{1}^2}\Big).\)

Now we can apply the formulas given above to obtain the marginal and conditional distributions. We will use the R library mvtnorm (which needs to be firstly installed on R). The library is loaded into the working environment by running the command

library("mvtnorm")

Let us consider a simple example with two dimensional normal distribution with the zero mean vector \(\boldsymbol{\mu} = (0,0)^\top\), and the variance-covariance matrix \(\Sigma = \left( \begin{array}{cc} 1 & 0.8 \\0.8& 1\end{array} \right)\). We would like to calculate the conditional distribution of \(X_{2}\) given \(X_{1} = 0.7\).

Do by Yourselves

Is there any linear relationship between the covariates \(X_1\) and \(X_2\)? Can you quantitatively express how strong this relationship is?

In terms of the linear regression modelling approach: imagine you obtain a sample from the given two dimensional normal distribution and you fit a simple regression line to the data. Do you have some expectation about the parameter estimates you obtain when fitting the linear regression model?
Try the following:
```
n <- 100
sample <- rmvnorm(n, c(0, 0), matrix(c(1, 0.8, 0.8, 1),2,2))
summary(lm(sample[,1] ~ sample[,2]))
```

Use the following piece of the R code to obtain a comparison between the joint distribution, marginal distribution and the conditional distribution of \(X_{2}\) given \(X_{1} = 0.7\). Derive the theoretical expressions for the marginal distributions and the conditional distribution.

Sigma <- matrix(c(1,.8,.8,1), nrow=2) ## variance-covariance matrix

x <- seq(-3,3,0.01)
contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}), col = "blue")

abline(v=.7, lwd=2, lty=2, col = "red")
text(0.75, -2, labels=expression(x[1]==0.7), col = "red", pos = 4)

### conditional distribution of X2 | X1 = 0.7
y <- dnorm(x, mean =  0.8 * 0.7, sd = sqrt(1 - 0.8^2))
lines(y-abs(min(x)),x,lty=2,lwd=2, col = "red")

### marginals
m1 <- m2 <- dnorm(x, 0, 1)
lines(x, m1 - abs(min(x)), lty = 1, lwd = 2, col = "gray30")
lines(m2 - abs(min(x)), x, lty = 1, lwd = 2, col = "gray30")

The conditional distribution can be obtained for any value \(X_{1} = x_1\), for instance, we obtain the conditional distribution of \(X_{2} | X_{1} = -1\)):

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}), col = "blue")
abline(v=-1, lwd=2, lty=2, col = "red")

### conditional distribution of X2 | X1 = - 1
y2 <- dnorm(x, mean = 0.8 * (- 1), sd = sqrt(1 - 0.8^2))
lines(-y2 + max(x),x,lty=2,lwd=2, col = "red")

Another option in R to obtain the same output:

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}), col = "blue")
abline(v=-1, lwd=1, lty=1, col = "red")
lines(y2 - 1 ,x,lty=2,lwd=2, col = "red")

And even more generalized version for a set of different values of \(x_{1} \in \mathbb{R}\):

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}), col= "blue")
condN <- function(x, cx){dnorm(x, mean = 0.8 * cx, sd = sqrt(1 - 0.8^2))}
for (i in seq(-2, 2, by = 0.25)){col <- colors()[grep("red",colors())][4*i + 9]; 
                                 lines(condN(x, i) + i, x, lwd = 2, col = col);
                                 abline(v = i, lwd=1, lty=1, col = col)}

To Do by Yourself

Consider the conditional distribution we obtained above. Can you numerically reconstruct (obtain) the values which are used as the parameters in the normal distribution?

Recall the general formula for a conditional distribution of \(X_{2}\) given \(X_{1} = x_{2}\) if the join distribution of \((X_{1}^\top, X_{2}^\top)^\top\) is normal and \(X_{1}\) is \(p\)-dimensional random vector for \(p > 1\) and \(X_{2}\) is \(q\)-dimensional random vector with \(q > 1\).

Recal a normal linear model \(Y_{i} = \alpha + \beta X_i + \varepsilon_i\), for \(i = 1, \dots, n\). What is the distribution of \(Y\) given \(X = x\)? Do you see the relationship of the model with the countour plot above?

Consider a multivariate example for at least a four dimensional normal distribution and calculate the conditional distribution of the first two elements \((X_{1}, X_{2})\) given the remaining elements. Create a contour plot for the conditional two-dimensional normal distribution with various values for the condition.

Comment

For a general multivariate normal distribution denoted as

\(\Big(\begin{array}{x}\boldsymbol{X}_{1}\\\boldsymbol{X}_{2}\end{array}\Big) \sim N_{p + q}\left(\boldsymbol{\mu} = \Big(\begin{array}{c} \boldsymbol{\mu}_{1} \\ \boldsymbol{\mu}_{2}\end{array}\Big), \Sigma = \left( \begin{array}{cc} \Sigma_{11} & \Sigma_{21} \\\Sigma_{12} & \Sigma_{22} \end{array} \right) \right)\)

for some random vectors \(\boldsymbol{X}_{1} \in \mathbb{R}^{p}\) and \(\boldsymbol{X}_{2} \in \mathbb{R}^{q}\) for \(p, q \in \mathbb{N}\) and the corresponding expectation vectors \(\boldsymbol{\mu}_{1}\) and \(\boldsymbol{\mu}_{2}\) and the variance-covariance matrix \(\Sigma\) the conditional distribution of \(\boldsymbol{X}_{1}\) given \(\boldsymbol{X}_{2} = \boldsymbol{x}_{2}\) is defined by the following expression:

\((\boldsymbol{X}_{1} | \boldsymbol{X}_{2} = \boldsymbol{x}_{2}) \sim N_{p}\left(\boldsymbol{\mu}_{1} + \Sigma_{12}\Sigma_{22}^{-1}(\boldsymbol{x}_{2} - \boldsymbol{\mu}_{2} ), \Sigma = \Sigma_{11} = \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \right)\).

2. Conditional Multivariate Normal with ‘condMVNorm()’ package

There is a nice (and quite brief) R library available from the R cran mirrors which can be used to calculate the conditional distribution of some random vector \(\boldsymbol{X}_{1}\) given another random vector (value) \(\boldsymbol{X}_{2} = \boldsymbol{x}_{2}\), where the joint distribution of \((\boldsymbol{X}_{1}^{\top}, \boldsymbol{X}_{2}^{\top})^{\top}\) is multivariate normal (with the corresponding dimension). The library can be installed by using command install.packages("condMVNorm").

library("condMVNorm")

Functions dcmvnorm() and condMVN can be directly used to obtain the conditional distribution of interest (see the help session and use ?dcmvnorm and ?condMVN for more details).

We apply these functions to the previous example with a two dimensional normal distribution with a zero mean vector and the variance-covariance matrix \(\Sigma = \left( \begin{array}{cc} 1 & 0.8 \\ 0.8 & 1 \end{array} \right)\). We want to obtain the conditional distribution of \(X_{2}\) given \(X_1 = 0.7\).

condDist <- condMVN(mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)

and also

dcmvnorm(0, mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)

## [1] 0.4301297

which should be compared with

dnorm(0, condDist$condMean, sqrt(condDist$condVar))

## [1] 0.4301297

Comment

Keep in minds that some functions in R require standard deviation to be specified when working with the normal distribution (e.g. functions dnorm(), pnorm(), qnorm(), etc.) while others need the variance-covariance matrix (variance in one dimension) to be provided instead.

3. Conditional Normal Distribution Example

In this example we consider the dataset mtcars which is standardly available in the R environment (for a more detailed description of the dataset see the help session ?mtcars).

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

attach(mtcars)

Let us assume that covariates ‘mpg’ (miles per gallon), ‘hp’ (horse power) and ‘disp’ (displacement) are all normaly distributed and the triple \((X_{1}, X_{2}, X_{3})^{\top} = (mpg, hp, disp)^{\top}\) follows a three dimensional normal distribution with some unknown mean vector \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2}, \mu_{3})^{\top} \in \mathbb{R}^{3}\) and some positive definite and symmetric variance-covariance matrix \(\Sigma\).

Use the R software to obtain the corresponding estimates for the mean vector \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2}, \mu_{3})^{\top} \in \mathbb{R}^{3}\) and the variance-covariance matrix \(\Sigma\).

What is the (estimated) distribution if the miles per gallon (‘mpg’) given the horse power and displacement, where \((X_{2}, X_{3})^{\top} = (120,120)^{\top}\)?

The corresponding estimates can be obtained by

mu <- apply(cbind(mpg, hp, disp), 2, mean) 
Sigma <- cov(cbind(mpg, hp, disp))

Thus, the corresponding conditional mean and variance are obtained as

condMean <- mu[1] + Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% (c(120, 120) - mu[2:3])
condSigma <- Sigma[1,1] - Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% Sigma[2:3, 1]

Which gives us the following distribution function:

condMean

##          [,1]
## [1,] 24.11354

condSigma

##         [,1]
## [1,] 9.14495

plot(dnorm(x <- seq(0, 35, length = 500), mean = condMean, sd = sqrt(condSigma)) ~ x, 
                                           xlab = "Miles per gallon", ylab = "Conditional Density", 
                                           type = "l", lwd = 2, col = "red", 
                                           xlim = c(0, 35), ylim = c(0,0.18))
lines(dnorm(x, mean(mpg), sd(mpg)) ~ x, lwd = 1)
legend(0, 0.175, legend = c("Conditional distribution: MPG | HP = 120 & Disp = 120", 
                           "Estimated Normal Distribution of MPG"), col = c("red", "black"), lty = c(1,1))

To Do by Yourself

Consider a simple linear regression model where the miles per gallon information is used as a dependent covariate and the horse power and the displacement value are both used as explanatory variables. Estimate the model and interpret the estimated coefficient.
Estimate the expected value of the consuption (miles per gallon) for a car with the horse power 120 and the displacement 120 too.
Compare the conditional distribution you obtained above with the prediction obtained by the linear regression model. Discuss the results.

Theoretical Exercises

Exercise 1
Consider the random vector \(\boldsymbol{X}\) with the two dimensional normal distribution \(N_{2} \left( \left( \begin{array}{c}1\\2\end{array} \right), \left( \begin{array}{cc} 2 & 1\\1 & 2\end{array} \right) \right)\) and the conditional random vector \(\boldsymbol{Y} | \boldsymbol{X}\), such that \[ \boldsymbol{Y} | \boldsymbol{X} \sim N_{2}\left( \left( \begin{array}{c} X_{1}\\ X_{1} + X_{2} \end{array} \right), \left( \begin{array}{cc} 1 & 0\\ 0 & 1 \end{array} \right) \right). \]
Determine the distribution of \(Y_2 | Y_{1}\) and the distribution of \(\boldsymbol{W} = \boldsymbol{X} - \boldsymbol{Y}\).

Exercise 2
Consider a random vector \(\boldsymbol{X} \sim N_{2}(\boldsymbol{\mu}, \Sigma)\), for \(\boldsymbol{\mu} = (2,2)^\top\) and \(\Sigma = \mathbb{I}_2\) (a unit matrix). Consider also matrices \(\mathbb{A} = (1, 1)^\top\) and \(\mathbb{B} = (1, -1)^\top\). Show, that the random variables \(\mathbb{A}\boldsymbol{X}\) and \(\mathbb{B}\boldsymbol{X}\) are independent.

Exercise 3
Consider a random vector \((X, Y, Z)^\top \sim N_3(\boldsymbol{\mu}, \Sigma)\). Compute the values for \(\boldsymbol{\mu}\) and \(\Sigma\), if you know, that
- \(Y | Z \sim N(-Z, 1)\);
- \(\mu_{Z|Y} = -\frac{1}{3} - \frac{1}{3}Y\);
- \(X | (Y, Z) \sim N(2 + 2Y + 3Z, 1)\);
Determine the conditional distribution of \(X|Y\) and also \(X | Y + Z\).

Exercise 4
Assume, you know the following:
- \(Z \sim N(0,1)\);
- \(Y | Z \sim N(1 + Z, 1)\);
- \(X | (Y, Z) \sim N(1 - Y, 1)\);
Find the distribution of the random vector \((X, Y, Z)^\top\) and the conditional distribution of \(Y | (X, Z)\). Consider the linear transformations \(U = 1 + Z\) and \(V = 1 - Y\) and find the joint distribution of the random vector \((U, V)^\top\). Calculate the value of \(E(Y | U = 2)\).

Homework Assignment

(Deadline: lab session no. 4 / 12.03.2018)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Choose one dataset where there are at least two covariates which can be considered to be continous ones wich (approximatelly) some normal distribution (e.g. use a histogram plot to help yourself to decide) and load the data into the R environment.
Use your Rnw file (which you have already prepared for the previous homework assignment) and add some figures into your report. At least one figure should contours of the two-dimensional distribution you are considering. Assuming the two-variate join normal distribution for your two covariates calculate also a conditional distribution when conditioning one covariate by another one and both marginal distribtions. Draw the corresponding distributions into the firgure.
Provide some reasonable interpretation of the figures and all elements included in them.