NMST539 | Lab Session 3 (Individual work)
Multivariate Normal Distribution
(marginal and conditional distributions)
LS 2017 | Instad of Thursday 07/03/17
The R-software is available for download from the website: https://www.r-project.org
A user-friendly interface (one of many): RStudio.
Manuals and introduction into R (in Czech or English):
-
Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
-
Komárek, A.: Základy práce s R. (PDF súbor)
-
Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
-
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)
Conditional Normal Distribution
Let us consider a two-dimensional normal distribution of some random vector \(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big)\). The distribution takes the following (general) form:
\(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big) \sim N_{2}\left(\boldsymbol{\mu} = \Big(\begin{array}{c} \mu_{1} \\ \mu_{2}\end{array}\Big), \Sigma = \left( \begin{array}{cc} \sigma_{1}^{2} & \sigma_{12} \\\sigma_{21} & \sigma_{2}^{2} \end{array} \right) \right)\),
where \(\boldsymbol{\mu} \in \mathbb{R}^2\) is the vector of the expected values and \(\Sigma\) is the variance-covariance matrix, which is a positive definite and symmetric. The correspoding density function (of the two dimensional normal distrubution) is given by the expression
\(\large{f(\boldsymbol{x}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp\Big\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \Big\},}\)
for \(\boldsymbol{x} = (x_{1}, x_{2})^{\top} \in \mathbb{R}^{2}\).
This density can be used to derive the marginal distrubution of random variables \(X_{1}\) and \(X_{2}\) or the conditional distrubution of \(X_{1}\) given \(X_{2}\) (or \(X_{2}\) given \(X_{1}\) respectively). In the following we will do both.
For the marginal density of \(X_{1}\) we have \(f(x_{1}) = \int_{\mathbb{R}} f(x_{1}, x_{2}) \mbox{d}x_{2}\) and analogously also for the marginal density of \(X_{2}\) (where we need to integrate the join density wrt the first covariate).
For a simple example with a two dimensional normal distribution the conditional distribution distribution of \(X_{2}\) given \(X_{1} = x_{1}\) is normal again and it holds that
\((X_{2} | X_{1 } = x_{1}) \sim N\Big(\mu_{2} + \frac{\sigma_{21}(x_1 - \mu_1)}{\sigma_{1}^2}, \sigma_{2}^2 - \frac{\sigma_{12}\sigma_{21}}{\sigma_{1}^2}\Big).\)
Now we can apply the formulas given above to obtain the marginal and conditional distributions. Firstly, the library mvtnorm needs to be installed on R. The library is loaded in by running the command:
library("mvtnorm")
Let us consider a simple example with two dimensional normal distribution with the zero mean vector \(\boldsymbol{\mu} = (0,0)^\top\) and the variance-covariance matrix \(\Sigma = \left( \begin{array}{cc} 1 & 0.8 \\0.8& 1\end{array} \right)\). We would like to calculate the conditional distribution of \(X_{2}\) given \(X_{1} = 0.7\).
Sigma <- matrix(c(1,.8,.8,1), nrow=2) ## variance-covariance matrix
x <- seq(-3,3,0.01)
contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
abline(v=.7, lwd=2, lty=2, col = "red")
text(0.75, -2, labels=expression(x[1]==0.7), col = "red", pos = 4)
### conditional distribution of X2 | X1 = 0.7
y <- dnorm(x, mean = 0.8 * 0.7, sd = sqrt(1 - 0.8^2))
lines(y-abs(min(x)),x,lty=2,lwd=2, col = "red")
### marginals
m1 <- m2 <- dnorm(x, 0, 1)
lines(x, m1 - abs(min(x)), lty = 1, lwd = 1)
lines(m2 - abs(min(x)), x, lty = 1, lwd = 1)
And we can add some other conditional distribution of (e.g. for \(X_{2} | X_{1} = -1\)) by adding additional lines of an R code:
contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
abline(v=-1, lwd=2, lty=2, col = "blue")
### conditional distribution of X2 | X1 = - 1
y2 <- dnorm(x, mean = 0.8 * (- 1), sd = sqrt(1 - 0.8^2))
lines(-y2 + max(x),x,lty=2,lwd=2, col = "blue")
Another option in R on who to obtain the same output:
contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
abline(v=-1, lwd=1, lty=1, col = "blue")
lines(y2 - 1 ,x,lty=2,lwd=2, col = "blue")
And even more generalized version for different values of \(x_{1} \in \mathbb{R}\):
contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
condN <- function(x, cx){dnorm(x, mean = 0.8 * cx, sd = sqrt(1 - 0.8^2))}
for (i in seq(-2, 2, by = 0.25)){col <- colors()[grep("red",colors())][4*i + 9];
lines(condN(x, i) + i, x, lwd = 2, col = col);
abline(v = i, lwd=1, lty=1, col = col)}
To Do by Yourself
-
Consider the conditional distribution we obtained above. Can you numerically reconstruct (obtain) the values which are used as the parameters in the normal distribution?
-
Recall the general formula for a conditional distribution of \(X_{2}\) given \(X_{1} = x_{2}\) if the join distribution of \(\Big(\begin{array}{c}X_{1}\\ X_{2}\end{array}\Big)\) is normal and \(X_{1}\) is \(p\)-dimensional and \(X_{2}\) is \(q\)-dimensional.
-
Use the R software to make an the following example: consider at least a four dimensional normal distribution and calculate the conditional distribution of \(X_{1}\) given the remaining elements of the random vector, such that \(X_{1}\) consists of the first two elements only. Make a plot for the conditional two-dimensional normal distribution which you obtained.
Conditional Multivariate Normal with ‘condMVNorm()’ package
There is a nice (and quite brief) R library available from the R cran mirrors which can be used to calculate the conditional distribution of some random vector \(\boldsymbol{X}_{1}\) given another random vector (value) \(\boldsymbol{X}_{2} = \boldsymbol{x}_{2}\) where the joint distribution of \((\boldsymbol{X}_{1}^{\top}, \boldsymbol{X}_{2}^{\top})^{\top}\) is multivariate normal (with the corresponding dimension). The library can be installed by using command install.packages("condMVNorm") .
library("condMVNorm")
Functions dcmvnorm() and condMVN can be directly used to obtain the conditional distribution of interest (see the help session for more details - ?dcmvnorm and ?condMVN ).
We apply these functions to the previous example with a two dimensional normal distribution with a zero mean vector and \(\Sigma = \left( \begin{array}{cc} 1 & 0.8 \\ 0.8 & 1 \end{array} \right)\). We want to obtain the conditional distribution of \(X_{2}\) given \(X_1 = 0.7\).
condDist <- condMVN(mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)
and also
dcmvnorm(0, mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)
## [1] 0.4301297
which should be compared with
dnorm(0, condDist$condMean, sqrt(condDist$condVar))
## [1] 0.4301297
Conditional Normal Distribution Example
In this example we consider a dataset mtcars which is standardly available in the R environment (for a more detailed description see the help session: ?mtcars ).
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
attach(mtcars)
Let us assume that covariates ‘mpg’ (miles per gallon), ‘hp’ (horse power) and ‘disp’ (displacement) are all normaly distributed and the triple \((X_{1}, X_{2}, X_{3})^{\top} = (mpg, hp, disp)^{\top}\) follows a three dimensional normal distribution with some unknown mean vector \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2}, \mu_{3})^{\top} \in \mathbb{R}^{3}\) and some positive definite and symmetric variance-covariance matrix \(\Sigma\).
What is the (estimated) distribution if the miles per gallon (‘mpg’) given the horse power and displacement, where \((X_{2}, X_{3})^{\top} = (120,120)^{\top}\)?
The corresponding estimates can be obtained by
mu <- apply(cbind(mpg, hp, disp), 2, mean)
Sigma <- cov(cbind(mpg, hp, disp))
Thus, the corresponding conditional mean and variance are obtained as
condMean <- mu[1] + Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% (c(120, 120) - mu[2:3])
condSigma <- Sigma[1,1] - Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% Sigma[2:3, 1]
Which gives us the following distribution function:
condMean
## [,1]
## [1,] 24.11354
condSigma
## [,1]
## [1,] 9.14495
plot(dnorm(x <- seq(0, 35, length = 500), mean = condMean, sd = sqrt(condSigma)) ~ x,
xlab = "Miles per gallon", ylab = "Conditional Density",
type = "l", lwd = 2, col = "red",
xlim = c(0, 35), ylim = c(0,0.18))
lines(dnorm(x, mean(mpg), sd(mpg)) ~ x, lwd = 1)
legend(0, 0.175, legend = c("Conditional distribution: MPG | HP = 120 & Disp = 120",
"Estimated Normal Distribution of MPG"), col = c("red", "black"), lty = c(1,1))
Homework Assignment
(Deadline: third lab session / 21.03.2016)
Use the command below to instal the SMSdata packages with various multivariate datasets:
install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")
Chose one dataset of your choise from the list of all available datasets in the package:
data(package = "SMSdata")
There are 21 different datasets and you can load each of them by typing and running data(dataset_name) , for instance:
data(plasma)
-
Choose one dataset where there are at least two covariates which can be considered to be continous ones wich (approximatelly) some normal distribution (e.g. use a histogram plot to help yourself to decide) and load the data into the R environment.
-
Use your
Rnw file (which you have already prepared for the previous homework assignment) and add some figures into your report. At least one figure should contours of the two-dimensional distribution you are considering. Assuming the two-variate join normal distribution for your two covariates calculate also a conditional distribution when conditioning one covariate by another one and both marginal distribtions. Draw the corresponding distributions into the firgure.
-
Provide some reasonable interpretation of the figures and all elements included in them.
|
Comment
Keep in minds that some functions in R require standard deviation to be specified when working with the normal distribution (e.g. functions
dnorm()
,pnorm()
,qnorm()
, etc.) while others need the variance-covariance matrix (variance in one dimension) to be provided instead.