NMST539, LS 2015/16Cvičenie 6 (týždeň 7)Application of PCA and Introduction to LASSOLinear Regression on PCs
There is one locality not recorded in the chemical dataset. Therefore, we need to synchronize both dataset before going for any analysis.
To remind the correlation structure in the data we can visualize the correlation matrix using the following piece of the R code (with three covariates from the chemical dataset added at the beginning):
Reminder
Instead of using bio metrics directly we rather apply the principal component analysis and we will try to explain the dependent variable using some (not many) pricipal components. Pricipal components are obtained by
And compare the following:
or with the results when applying EVD based approach:
The principal components are mutually orthogonal (in all cases we are getting rid of any multicolinearity by applying this approach) and ordered with respect to the decreasing variability. We can directly use the pricipal components to fit a classical linear regression model. Logically, there is no sense in using e.g. the second PC if we are omitting the first one (in general, the hierarchy should be in a way that all lover PCs are automatically included in the model). With respect to the interpretability options there is no sense to consider too many PCs in the model. Let us (e.g.) model the amount of phosphor on the first principal component.
Or we can also improve the overall model performance by adding the second principal component
Cons and Prons
Dimensionality Reduction & Variable Selection via LASSOAnother approach to a significant variable selection is introdused in a reqularized regression based on LASSO. LASSO stands for Least Absolute Selection and Shrinkage Operator and it was introduced by Tibshirani (1996). The main idea of the LASSO reqularized (penalized) regression is the following one: instead of minimizing \[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2}, \] with respect to the vector of regression coefficients \(\boldsymbol{\beta} \in \mathbb{R}^{p}\) for \(p < n\) once considers an alternative minimization defined as \[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2} + \lambda \|\boldsymbol{\beta}\|_{1}, \] where \(\lambda > 0\) is some reqularization parameter and \(\|\cdot\|_{1}\) stands for a classical \(L_{1}\) norm. It is obvious that the larger the value of the regularization parameter \(\lambda > 0\) the more dominant the second term in the minimization problem above. Thus, the elements of \(\boldsymbol{\beta}\) are shrunk towards zero. Actually, for large values of \(\lambda \to \infty\) the most elements of the vector of the regression parameters will be set to zero directly. Such solution is called a sparse solution (the vector of parameter estimates is sparse, it contains non-zero elements very rarely). On the other hand, for \(\lambda \to 0\) the first term in the minimization problem is dominant and thus, for \(\lambda = 0\) the whole problem reduces to a classical least squares regression. The main advantage of the LASSO regularized regression approach is that it also allows for heavily overparametrized models (situations where \(p \gg n\)). Such problems commonly occur in genetic data, sociology surveys, econometrics and basically all other areas… Unlike the classical least squares problem where the vector of the parameter estimates is given as a solution of the normal equations and thus, is explicit, no explicit solution is generaly given for the LASSO regression. But… LARS-LASSO AlgorithmA very elegant solution to the LASSO regression fitting problem was obtained once we realized that the solution paths are piece-wise linear. Given this, one only needs to calculate the knot point positions and the rest of the path is a linear approximation only. Using the LARS-LASSO algorithm proposed by Efron et al. (2004) we can do that in a very effective way. Actually in \(p\) linear steps only!!! The whole algorithm is based on a geometric interpretation of covariances between the current response estimate and covariates still not included in the model yet.
In the sourse code loaded above there are two functions for R:
And the graphical output can be obtained using this code:
Cons and Prons
Domáca úloha (dobrovoľná)Na webovej stránke Doc. Hlávku http://www1.karlin.mff.cuni.cz/~hlavka/teac.html je k dispozícii niekoľko dátových súborov, ktoré do Rka stačí načítať pomocou príkazu
|