NMST539 | Lab Session 6Application of PCA and Introduction to LASSOLS 2017 | Tuesday 11/04/2017Rmd file (UTF8 encoding)The R-software is available for download from the website: https://www.r-project.org A user-friendly interface (one of many): RStudio. Manuals and introduction into R (in Czech or English):
Linear Regression on PCsWe will again start with the data sample respresenting 65 river localities in the Czech Republic. For each locality there are 17 biological metric recorded (the status of the biological life within each locality expressed in terms of some well defined and internationally recognized metrics - indexes). In addition to the previous lab session we also consider a second data set on chemical measurements at the same localities (concentrations on 7 different chemical substances). The data file can be obtained from the following address:
There is one locality not recorded in the chemical dataset. Therefore, we need to synchronize both dataset before going for any analysis. This can be done by the following part of the R code:
To remind the correlation structure in the data we can visualize the correlation matrix using the the library ‘corrplot’ and the R command
Reminder
Instead of using bio metrics directly we rather apply the principal component analysis and we will try to explain the dependent variable using some (not many) pricipal components. Pricipal components are obtained by
And compare the following:
or with the results when applying EVD based approach:
The principal components are mutually orthogonal (in all cases we are getting rid of any multicolinearity by applying this approach) and ordered with respect to the decreasing variability. We can directly use the pricipal components to fit a classical linear regression model. Logically, there is no sense in using e.g. the second PC if we are omitting the first one (in general, the hierarchy should be in a way that all lower order PCs are automatically included in the model). With respect to the interpretability options there is no sense to consider too many PCs in the model. Let us (e.g.) model the amount of phosphor on the first principal component.
We can even perform a statistical test to decide whether using the first principal components gives us some specific amount of explained variability. For some random variable \(U\) with the Wishart distribution \(m^{-1} W_{p}(\Sigma, m)\) it holds, in general, that \[ \sqrt{m}(\ell - \lambda) \sim N_{p}(\boldsymbol{0}, 2 \Lambda^2), \] where \(\lambda = (\lambda_{1}, \dots, \lambda_{p})^\top\) and \(\ell = (\ell_1, \dots, \ell_{p})^\top\) are the corresponding eigen values from the eigen value decompositions of \(\Sigma = \Gamma \Lambda \Gamma^\top\) and \(U = G L G^\top\). The proportion of the explained variability by the first principal component (theoretical quantity \(\lambda_{1} / (\lambda_{1} + \dots + \lambda_{p})\)) is estimated as \(\ell_{1} / (\ell_{1} + \dots + \ell_{p})\). In our case, this equals to the following:
which is indeed the value we can see in the output. This is a nonlinear transformation of the vector \(\ell = (\ell_1, \dots, \ell_p)\) but it can be show that \[ T_{1} \equiv \sqrt{n - 1} \Bigg(\frac{\ell_{1}}{\ell_{1} + \dots + \ell_{p}} - \frac{\lambda_{1}}{\lambda_{1} + \dots + \lambda_{p}}\Bigg) \mathop{\sim}^{as.} N(0, \omega^2), \] where \[ \omega^2 = \frac{2 tr(\Lambda^2)}{(tr(\Lambda))^2} \Big( \psi^2 - 2 \psi\beta + \beta\Big), \] for \(\psi = \lambda_{1} / (\lambda_{1} + \dots + \lambda_{p})\) and \(\beta = \lambda_{1}^2 / (\lambda_{1}^2 + \dots + \lambda_{p}^2)\). For testing the null hypothesis \(H_{0}: \psi = \psi_{0}\), the test statistic \(T_{1}(\psi_{0})/\omega\) follows under the null hypothesis the standard normal distribution \(N(0,1)\). For instance, we can test the null hypothesis, that the proportion of the explained variability when using only one principal component, is equal to \(85~\%\) or not. Improving the modelOr we can also improve the overall model performance by adding the second principal component
Cons and Prons
Dimensionality Reduction & Variable Selection via LASSOAnother approach to a significant variable selection is introdused in a reqularized regression based on LASSO. LASSO stands for Least Absolute Selection and Shrinkage Operator and it was introduced by Tibshirani (1996). The main idea of the LASSO reqularized (penalized) regression is the following one: instead of minimizing \[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2}, \] with respect to the vector of regression coefficients \(\boldsymbol{\beta} \in \mathbb{R}^{p}\) for \(p < n\) once considers an alternative minimization defined as \[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2} + \lambda \|\boldsymbol{\beta}\|_{1}, \] where \(\lambda > 0\) is some reqularization parameter and \(\|\cdot\|_{1}\) stands for a classical \(L_{1}\) norm. It is obvious that the larger the value of the regularization parameter \(\lambda > 0\) the more dominant the second term in the minimization problem above. Thus, the elements of \(\boldsymbol{\beta}\) are shrunk towards zero. Actually, for large values of \(\lambda \to \infty\) the most elements of the vector of the regression parameters will be set to zero directly. Such solution is called a sparse solution (the vector of parameter estimates is sparse, it contains non-zero elements very rarely). On the other hand, for \(\lambda \to 0\) the first term in the minimization problem is dominant and thus, for \(\lambda = 0\) the whole problem reduces to a classical least squares regression. The main advantage of the LASSO regularized regression approach is that it also allows for heavily overparametrized models (situations where \(p \gg n\)). Such problems commonly occur in genetic data, sociology surveys, econometrics and basically all other areas… Unlike the classical least squares problem where the vector of the parameter estimates is given as a solution of the normal equations and thus, is explicit, no explicit solution is generaly given for the LASSO regression. But… LARS-LASSO AlgorithmA very elegant solution to the LASSO regression fitting problem was obtained once we realized that the solution paths are piece-wise linear. Given this, one only needs to calculate the knot point positions and the rest of the path is a linear approximation only. Using the LARS-LASSO algorithm proposed by Efron et al. (2004) we can do that in a very effective way. Actually in \(p\) linear steps only!!! The whole algorithm is based on a geometric interpretation of covariances between the current response estimate and covariates still not included in the model yet.
In the sourse code loaded above there are two functions for R:
And the graphical output can be obtained using this code:
Cons and Prons
|