NMST539 | Lab Session 5

Maximum Likelihood Estimation

(theoretical examples with applications in statistical tests)

LS 2017/2018 | Monday 19/03/18

Rmd file (UTF8 coding)

The maximum likelihood theory is a powerfull tool for estimation of some unknown population parameter \(\theta \in \Omega\) (or some vector of unknown parameters \(\boldsymbol{\theta} \in \Omega\) respectively). The main idea is to define so called likelihood function: for some random sample (in general, of some dimension \(p \in \mathbb{N}\)) \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_{n}\) drawn from some distribution family given by the density function \(f(\boldsymbol{x}, \boldsymbol{\theta})\), we define the likelihood function as

\[ L(\boldsymbol{\theta}, \mathcal{X}) = \prod_{i = 1}^n f(\boldsymbol{X}_i, \boldsymbol{\theta}), \] which is the function of the unknown vector of parameters \(\boldsymbol{\theta} \in \Omega\) given the data \(\mathcal{X} = (\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n})^\top\). The corresponding estimate \(\widehat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) is defined such, that the likelihood function \(L(\boldsymbol{\theta}, \mathcal{X})\) is maximized, e.i.,

\[ \widehat{\boldsymbol{\theta}} = \mathop{Argmax}_{\boldsymbol{\theta} \in \Omega} L(\boldsymbol{\theta}, \mathcal{X}). \]

The maximum likelihood estimate \(\widehat{\boldsymbol{\theta}}\) of the unknown vector of parameters \(\boldsymbol{\theta} \in \Omega\) is, under some regularity assumptions, well known for having some nice statistical properties, in particular, it holds that

\[ \sqrt{n} (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}) \sim N_{p}(\boldsymbol{0}, \mathbb{I}_{p}^{-1}(\boldsymbol{\theta})), \quad \textrm{for} \quad n \to \infty \] where \(\mathbb{I}_{p}(\boldsymbol{\theta})\) stands for the Fisher information matrix about the parameter vector \(\boldsymbol{\theta} \in \Omega\). Thus, the maximum likelihood estimate \(\widehat{\boldsymbol{\theta}}\) is asymptotically unbiasied with the asymptotic normal distribution (\(p\)-dimensional in general) and, moreover, it is efficient (in a sense that the asymptotic variance-covariance matrix \(\mathbb{I}_{p}^{-1}(\boldsymbol{\theta})\) is as good as the Rao-Cramer lower bound).

To Do by Yourself

For a multidimensional parameter \(\boldsymbol{\theta} = (\theta_1, \dots, \theta_p)^\top \in \Omega\) it is quite convenient to use matrix notation for calculations and, consequently, to use the matrix derivatives correspondingly. Let us remind some standard formulae for the matrix derivatives.

Verify by yourself (by using \(2 \times 2\) and \(3 \times 3\) symmetric and asymmetric matrices), that the provided formulae for the matrix derivatives are correct.

For some general square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial |\mathcal{X}|}{\partial \mathcal{X}} = \big(Adj(\mathcal{X})\big)^\top;\]

For some symmetric square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial |\mathcal{X}|}{\partial \mathcal{X}} = 2 (Adj(\mathcal{X})\big) - Diag(\mathcal{X});\]

For some general square matrices \(\mathcal{X}\) and \(\mathbb{A}\) of the type \(p \times p\), we have: \[\frac{\partial tr(\mathcal{X} \mathbb{A})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{A} \mathcal{X})}{\partial \mathcal{X}} = \mathbb{A}^\top;\]

For some general square matrices \(\mathcal{X}\) and \(\mathbb{A}\) of the type \(p \times p\), where \(\mathbb{A}\) is symmetric, we have: \[\frac{\partial tr(\mathcal{X} \mathbb{A})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{A} \mathcal{X})}{\partial \mathcal{X}} = 2 \mathbb{A} - Diag(\mathbb{A});\]

For some general square matrices \(\mathcal{X}\) and \(\mathbb{A}\) of the type \(p \times p\), we have: \[\frac{\partial tr(\mathcal{X}^{-1} \mathbb{A})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{A} \mathcal{X}^{-1})}{\partial \mathcal{X}} = - \mathcal{X}^{-1}\mathbb{A}^\top \mathcal{X}^{-1};\]

For some matrices \(\mathbb{A}\) and \(\mathbb{B}\) of the the types \(q \times p\) and \(p \times q\) and the square matrix \(\mathcal{X}\) of the type \(p \times p\), we have: \[\frac{\partial tr(\mathbb{A}\mathcal{X} \mathbb{B})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{B} \mathbb{A} \mathcal{X})}{\partial \mathcal{X}} = \mathbb{B}\mathbb{A};\]

For some general and square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial \ln |\mathcal{X}|}{\partial \mathcal{X}} = \frac{1}{det(\mathcal{X})} Adj(\mathcal{X}) = (\mathcal{X}^{-1})^\top \equiv \mathcal{X}^{{-\top}};\]

For some symetric and square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial \ln |\mathcal{X}|}{\partial \mathcal{X}} = 2 \mathcal{X}^{-1} - Diag(\mathcal{X}^{-1});\]

The provided formulae are usuful for deriving the maximum likelihood estimate for various random vector distributions. In particular, we focus on estimating the mean vector \(\boldsymbol{\mu} \in \mathbb{R}^p\) and the variance-covariance matrix \(\Sigma \in \mathbb{p \times p}\) (which is symetric and positive definite) in a general multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\).

1. Maximum Likelihood Estimation

Example 1

Consider a two dimensional sample \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_{n}\), for \(\boldsymbol{X}_{i} = (X_{i 1}, X_{i, 2})^\top\), drawn from the bivariate distribution, given by the density function \[ f_{\boldsymbol{\theta}}(x_{1}, x_{2}) = \frac{1}{\theta_1\theta_2} \cdot exp\Big\{ -\big( \frac{x_1}{\theta_1} + \frac{x_2}{\theta_2} \big) \Big\} \cdot \mathbb{I}_{\{x_1 > 0\}} \mathbb{I}_{\{x_2 > 0\}}, \] where \(\boldsymbol{\theta} = (\theta_1, \theta_2)^\top \in (0, \infty) \times (0, \infty)\). Find the maximum likelihood estimate \(\widehat{\boldsymbol{\theta}}\) of the parameter vector \(\theta\) and compute the Rao-Cramer lower bound.

Example 2
Consider a random sample from the bivariate distribution given by the density function \[ f_{\boldsymbol{\theta}}(x_1, x_2) = \frac{1}{\theta_1^2 \theta_2} \frac{1}{x_2} exp\Big\{ - \big( \frac{x_1}{\theta_1 x_2} + \frac{x_2}{\theta_1 \theta_2} \big) \Big\}, \] for \(x_1, x_2 > 0\) and zero otherwise. Find the maximum likelihood estimate for the vector of parameters \(\boldsymbol{\theta} = (\theta_1, \theta_2) \in (0, \infty) \times (0, \infty)\) and calculate the Rao-Cramer lower bound for \(\widehat{\boldsymbol{\theta}}\).

Example 3
Let the random vector \(\boldsymbol{X}\) follows the multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\), where \(\boldsymbol{\mu} \in \mathbb{R}^p\) and \(\Sigma = Diag\{\sigma_{11}, \dots, \sigma_{pp}\}\), for some \(\sigma_jj> 0\), for any \(j = 1, \dots, p\). Let \(\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n}\) be random sample drawn from the same distribution. Use the given sample and find the maximum likelihood estimates of the mean vector \(\boldsymbol{\mu}\) and the variance matrix \(\Sigma\). DErive also the Rao-Cramer lower bound for the parameter vector \(\boldsymbol{\theta} = (\mu_1, \dots, \mu_p, \sigma_{11}, \dots, \sigma_{pp})^\top\).

Example 4
Consider a multivariate random sample \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_n\) from the multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\), for the mean vector \(\mu \in \boldsymbol{\mu} \in \mathbb{R}^p\) and some symmetric and positive definite matrix \(\Sigma \in \mathbb{R}^{p \times p}\). Find the maximum likelihood estimate of the unknown mean vector \(\boldsymbol{\mu}\) and the variance covariance matrix \(\Sigma\).

2. Hypothesis tests

Example 1
Simulate a random sample from the bivariate normal distribution \(N_{2}\Big(\Big(\begin{array}{c}1\\2\end{array}\Big), \Big( \begin{array}{cc}1 & 0.5\\0.5 & 2 \end{array} \Big)\Big)\), and test the null hypothesis \[ H_{0}: 2\mu_{1} - \mu_2 = 0.\]

Consider two situations: either the variance-covariance matrix \(\Sigma\) is known, or \(\Sigma\) is unknown. Compare the results.

library("mvtnorm")
n <- 100
mu <-  c(1, 2)
Sigma <- matrix(c(1, 0.5, 0.5, 2),2,2)

set.seed(1234)
sample <- rmvnorm(n, mu, Sigma)

The null hypothesis can be also expressed as \[ H_0: \mathbb{A}\boldsymbol{\mu} = a, \] where \(\boldsymbol{\mu} = (\mu_1, \mu_2)^\top = (1, 2)^\top\), \(\mathbb{A} = (2, -1)\), and \(a = 0\). Therefore, under the null hypothesis, we have that \[ T_{n} = n(\mathbb{A}\overline{\boldsymbol{X}}_{n} - a)^\top(\mathbb{A}\Sigma \mathbb{A}^\top)^{-1} (\mathbb{A} \overline{\boldsymbol{X}}_{n} - a) \sim \chi_{1}^{2}. \] Therefore, the null hypothesis is rejected for large values of the test statistic \(T_{n}\) (if \(T_{n}\) is larger than the corresponding significance level – \((1 - \alpha)\) quantile of the \(\chi_2\) distribution with 1 degree of freedom, for some given \(\alpha \in (0, 1)\)).

The value of the test statistic is

a <- 0
A <- c(2, -1) 
Tvalue <- n * t(A %*% apply(sample, 2, mean) - a) %*% solve(A %*% Sigma %*% A) %*% (A %*% apply(sample, 2, mean) - a)

which should be compared with the corresponding quantile \(\chi_1^2(0.95) = 3.8415\). The null hypothesis is, therefore, not rejected (the value of the test statistic \(T_n\) is roughly 0.37).

However, if we assume that the variance-covariance matrix \(\Sigma\) is not known, the test statistics \(T_{n}\) can not be calculated and, instead of \(\Sigma\) the corresponding estimate – the sample variance-covariance matrix has to be pluged in. Then we obtain that

\[ \tilde{T}_{n} = (n - 1)(\mathbb{A}\overline{\boldsymbol{X}}_{n} - a)^\top(\mathbb{A}\mathcal{S} \mathbb{A}^\top)^{-1} (\mathbb{A} \overline{\boldsymbol{X}}_{n} - a) \sim T_{1, n - 1}^2. \]

TTvalue <- (n - 1) * t(A %*% apply(sample, 2, mean) - a) %*% solve(A %*% cov(sample) %*% A) %*% (A %*% apply(sample, 2, mean) - a)

which should be now compared with the corresponding quantile of the Fisher distribution \(F_{1, 99}(0.95) = 3.9371\). Both tests perform quite similar which is common especially for larger sample sizes. On the other hand, for small \(n \in \mathbb{N}\) the results of both tests can be quite different. The second test needs to also account for some uncertainity coming from the estimation of the variance matrix.

To Do by Yourself

Design a small simulation study to investigate the empirical performance of the two tests mentioned above.

Firstly, simulate the data from the null hypothesis and perform both tests. Find out, what is the empirical probability of the first type error (i.e., the overalll proportion of rejecting the null hypothesis, if the null hypothesis holds).
Consider some data from the alternative hypothesis and investigate what is the empirical power of the test (i.e., the empirical probability of not rejecting the null, if the null hypothesis does not hold). Use different sample size and different alternative hypothesis to get some meaningful insight into the performance of both tests.

Example 2
Consider a random sample \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_n\) from the multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\), for some mean vector \(\boldsymbol{\mu} \in \mathbb{R}^p\), and some symmetric and positive definite matrix \(\Sigma\). Consider the following statistical test: \[ H_{0}: \boldsymbol{\mu} = \boldsymbol{\mu}_{0}, \] against the general hypothesis \[ H_{1}: \boldsymbol{\mu} \neq \boldsymbol{\mu}_{0}. \] Show, that the maximum of the likelihood function under the null hypothesis is equal to \(\ell_{0} = \ell(\mathcal{X}, \mu_0, \mathcal{S} + (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_0)(\overline{\boldsymbol{X}}_{n} - \boldsymbol{\mu}_0)^\top))\). Show, that the likelihood ratio test statistic is equal to \[ - 2 \log \lambda = 2 (\ell_1 - \ell_0) = n \log\Big(1 + (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0}) ^\top \mathcal{S}^{-1} (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0})\Big). \]

Note, that the test statistics only depends on the quantity \((\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0}) ^\top \mathcal{S}^{-1} (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0})\) and, moreover, under the null hypothesis \(H_0\) it holds that \[ (n - 1)(\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0}) ^\top \mathcal{S}^{-1} (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0}) \sim T_{p, n - 1}^2. \]

To Do by Yourself

Consider a bivariate random sample of size \(n = 10\) from the distribution \(N_{2}(\boldsymbol{\mu}, \Sigma)\), for some uknown mean vector \(\boldsymbol{\mu} \in \mathbb{R}^2\) and the variance-covariance matrix \(\Sigma = \Big( \begin{array}{cc}2 & -1\\ -1 & 2 \end{array}\Big)\). Assume, that the sample mean vector \(\overline{\boldsymbol{X}}_{n}\) is equal to \((1, \frac{1}{2})^\top\). For the significance level \(\alpha = 0.05\) find the tests (critical regions and the corresponding decision rule) for the following null and alternative hypotheses:
- \[ H_0: \boldsymbol{\mu} = \Big(\begin{array}{c} 2\\ 2/3\end{array}\Big) \quad \quad \quad H_1: \boldsymbol{\mu} \neq \Big(\begin{array}{c} 2\\ 2/3\end{array}\Big) \]
- \[ H_0: \mu_1 + \mu_2 = \frac{7}{2} \quad \quad \quad H_1: \mu_1 + \mu_2 \neq \frac{7}{2} \]
- \[ H_0: \mu_1 - \mu_2 = \frac{1}{2} \quad \quad \quad H_1: \mu_1 - \mu_2 \neq \frac{1}{2} \]

Repeat the previus tests, however, the variance-covariance matrix is now asumed to be unknown and the corresponding empirical estimate – the sample variance covariance matrix \(\mathcal{S}\) is equal to \(\Big( \begin{array}{cc}2 & -1\\ -1 & 2 \end{array}\Big)\). Compare the results of the test with the scenarios where \(\Sigma\) is known.