NMST539, LS 2015/16

Cvičenie 3 (týždeň 4)

Multivariate normal distribution)

(marginal and conditional distributions)


Let us firstly recall the two-dimensional normal distribution of some random vector \(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big)\) denoted as

\(\Big(\begin{array}{x}X_{1}\\X_{2}\end{array}\Big) \sim N_{2}\left(\boldsymbol{\mu} = \Big(\begin{array}{c} \mu_{1} \\ \mu_{2}\end{array}\Big), \Sigma = \left( \begin{array}{cc} \sigma_{1}^{2} & \sigma_{21} \\\sigma_{12} & \sigma_{2}^{2} \end{array} \right) \right)\)

where \(\boldsymbol{\mu}\) is the expectation vector and \(\Sigma\) is the variance-covariance matrix, which is a positive definite and symmetric. The correspoding density function (of the two dimensional normal distrubution) is given by the expression

\(\large{f(\boldsymbol{x}) = \frac{1}{2 \pi |\Sigma|^{1/2}} exp\Big\{ -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \Big\},}\)

for \(\boldsymbol{x} = (x_{1}, x_{2})^{\top} \in \mathbb{R}^{2}\). It is commonly useful to be able to derive marginal distrubution of random variables \(X_{1}\) and \(X_{2}\) or the conditional distrubution of \(X_{1}\) given \(X_{2}\) (or \(X_{2}\) given \(X_{1}\) respectively).

library("mvtnorm")

For a simple example with a two dimensional normal distribution we draw marginal densities and a conditional density of \(X_{2}\) given \(X_{1} = 0.7\).

Sigma <- matrix(c(1,.8,.8,1), nrow=2) ## variance-covariance matrix

x <- seq(-3,3,0.01)
contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))

abline(v=.7, lwd=2, lty=2, col = "red")
text(0.75, -2, labels=expression(x[1]==0.7), col = "red", pos = 4)

### conditional distribution of X2 | X1 = 0.7
y <- dnorm(x, mean =  0.8 * 0.7, sd = sqrt(1 - 0.8^2))
lines(y-abs(min(x)),x,lty=2,lwd=2, col = "red")

### marginals
m1 <- m2 <- dnorm(x, 0, 1)
lines(x, m1 - abs(min(x)), lty = 1, lwd = 1)
lines(m2 - abs(min(x)), x, lty = 1, lwd = 1)

And we can add some other conditianal distribution of (e.g. for \(X_{2} | X_{1} = -1\)) by adding additional lines of an R code:

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
abline(v=-1, lwd=2, lty=2, col = "blue")

### conditional distribution of X2 | X1 = - 1
y2 <- dnorm(x, mean = 0.8 * (- 1), sd = sqrt(1 - 0.8^2))
lines(-y2 + max(x),x,lty=2,lwd=2, col = "blue")

Another option:

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
abline(v=-1, lwd=1, lty=1, col = "blue")
lines(y2 - 1 ,x,lty=2,lwd=2, col = "blue")

And even more generalized version for different values of \(x_{1} \in \mathbb{R}\):

contour(x,x,outer(x,x,function(x,y){dmvnorm(cbind(x,y),sigma=Sigma)}))
condN <- function(x, cx){dnorm(x, mean = 0.8 * cx, sd = sqrt(1 - 0.8^2))}
for (i in seq(-2, 2, by = 0.25)){col <- colors()[grep("red",colors())][4*i + 9]; 
                                 lines(condN(x, i) + i, x, lwd = 2, col = col);
                                 abline(v = i, lwd=1, lty=1, col = col)}


Comment


For a general multivariate normal distribution denoted as

\(\Big(\begin{array}{x}\boldsymbol{X}_{1}\\\boldsymbol{X}_{2}\end{array}\Big) \sim N_{p + q}\left(\boldsymbol{\mu} = \Big(\begin{array}{c} \boldsymbol{\mu}_{1} \\ \boldsymbol{\mu}_{2}\end{array}\Big), \Sigma = \left( \begin{array}{cc} \Sigma_{11} & \Sigma_{21} \\\Sigma_{12} & \Sigma_{22} \end{array} \right) \right)\)

for some random vectors \(\boldsymbol{X}_{1} \in \mathbb{R}^{p}\) and \(\boldsymbol{X}_{2} \in \mathbb{R}^{q}\) for \(p, q \in \mathbb{N}\) and the corresponding expectation vectors \(\boldsymbol{\mu}_{1}\) and \(\boldsymbol{\mu}_{2}\) and the variance-covariance matrix \(\Sigma\) the conditional distribution of \(\boldsymbol{X}_{1}\) given \(\boldsymbol{X}_{2} = \boldsymbol{x}_{2}\) is defined by the following expression:

\(\boldsymbol{X}_{1} | \boldsymbol{X}_{2} = \boldsymbol{x}_{2} \sim N_{p}\left(\boldsymbol{\mu}_{1} + \Sigma_{12}\Sigma_{22}^{-1}(\boldsymbol{x}_{2} - \boldsymbol{\mu}_{2} ), \Sigma = \Sigma_{11} = \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \right)\).


Conditional multivariate normal with ‘condMVNorm()’

There is a nice (and quite brief) R library available from the R cran mirrors which can be used to calculate the conditional distribution of some random vector \(\boldsymbol{X}_{1}\) given another random vector (value) \(\boldsymbol{X}_{2} = \boldsymbol{x}_{2}\) where the joint distribution of \((\boldsymbol{X}_{1}^{\top}, \boldsymbol{X}_{2}^{\top})^{\top}\) is multivariate normal (with the corresponding dimension). The library can be installed by using command install.packages("condMVNorm").

library("condMVNorm")
## Warning: package 'condMVNorm' was built under R version 3.1.2

Functions dcmvnorm() and condMVN can be directly used to obtain the conditional distribution of interest (see the help session for more details - ?dcmvnorm and ?condMVN).

We apply these functions to the previous example with a two dimensional normal distribution with a zero mean vector and \(\Sigma = \left( \begin{array}{cc} 1 & 0.8 \\ 0.8 & 1 \end{array} \right)\). We want to obtain the conditional distribution of \(X_{2}\) given \(X_1 = 0.7\).

condDist <- condMVN(mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)

and also

dcmvnorm(0, mean = c(0,0), sigma = Sigma, dependent.ind = 2, given.ind = 1, X.given = 0.7)
## [1] 0.4301297

which should be compared with

dnorm(0, condDist$condMean, sqrt(condDist$condVar))
## [1] 0.4301297


Comment


Keep in minds that some functions in R require standard deviation to be specified when working with the normal distribution (e.g. functions dnorm(), pnorm(), qnorm(), etc.) while others need the variance-covariance matrix (variance in one dimension) to be provided instead.



Conditional normal distribution example

In this example we consider a dataset mtcars which is standardly available in the R environment (for a more detailed description see the help session: ?mtcars).

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
attach(mtcars)

Let us assume that covariates ‘mpg’ (miles per gallon), ‘hp’ (horse power) and ‘disp’ (displacement) are all normaly distributed and the triple \((X_{1}, X_{2}, X_{3})^{\top} = (mpg, hp, disp)^{\top}\) follows a three dimensional normal distribution with some unknown mean vector \(\boldsymbol{\mu} = (\mu_{1}, \mu_{2}, \mu_{3})^{\top} \in \mathbb{R}^{3}\) and some positive definite and symmetric variance-covariance matrix \(\Sigma\).

What is the (estimated) distribution if the miles per gallon (‘mpg’) given the horse power and displacement, where \((X_{2}, X_{3})^{\top} = (120,120)^{\top}\)?

The corresponding estimates can be obtained by

mu <- apply(cbind(mpg, hp, disp), 2, mean) 
Sigma <- cov(cbind(mpg, hp, disp))

Thus, the corresponding conditional mean and variance are obtained as

condMean <- mu[1] + Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% (c(120, 120) - mu[2:3])
condSigma <- Sigma[1,1] - Sigma[1, 2:3] %*% solve(Sigma[2:3,2:3]) %*% Sigma[2:3, 1]

Which gives us the following distribution function:

condMean
##          [,1]
## [1,] 24.11354
condSigma
##         [,1]
## [1,] 9.14495
plot(dnorm(x <- seq(0, 35, length = 500), mean = condMean, sd = sqrt(condSigma)) ~ x, 
                                           xlab = "Miles per gallon", ylab = "Conditional Density", 
                                           type = "l", lwd = 2, col = "red", 
                                           xlim = c(0, 35), ylim = c(0,0.18))
lines(dnorm(x, mean(mpg), sd(mpg)) ~ x, lwd = 1)
legend(0, 0.175, legend = c("Conditional distribution: MPG | HP = 120 & Disp = 120", 
                           "Estimated Normal Distribution of MPG"), col = c("red", "black"), lty = c(1,1))

Domáca úloha

(Deadline: piaty týždeň / 21.03.2016)

Na webovej stránke Doc. Hlávku

http://www1.karlin.mff.cuni.cz/~hlavka/teac.html

je k dispozícii niekoľko dátových súborov, ktoré do Rka stačí načítať pomocou príkazu load(nazov_suboru.rda).

  • vyberte si jeden dátový súbor, u ktorého je možne predpokladať aspoň dve premenné, ktoré by mohli mať normálne rozdelenie (napr. odhadnúť pomocou histogramu) načítajte ho v Rku;
  • do již vytvoreného Rnw súboru, pridajte aspoň jeden obrázok podobný s tými, ktoré sme vytvárali na začiatku cvičenia;
  • príslušný obrázok by mal obsahovať kontúry, marginálne hustory, podmienenú hustotu a vhodný popis;