NMST539 | Lab Session 6

Application of PCA and Introduction to LASSO

LS 2017 | Tuesday 11/04/2017

Rmd file (UTF8 encoding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
Komárek, A.: Základy práce s R. (PDF súbor)
Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

Linear Regression on PCs

We will again start with the data sample respresenting 65 river localities in the Czech Republic. For each locality there are 17 biological metric recorded (the status of the biological life within each locality expressed in terms of some well defined and internationally recognized metrics - indexes). In addition to the previous lab session we also consider a second data set on chemical measurements at the same localities (concentrations on 7 different chemical substances).

The data file can be obtained from the following address:

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

There is one locality not recorded in the chemical dataset. Therefore, we need to synchronize both dataset before going for any analysis. This can be done by the following part of the R code:

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(chemData[,c(1,6,7,8)], bioData[ind,-c(1,11,12,14)])
attach(data)

To remind the correlation structure in the data we can visualize the correlation matrix using the the library ‘corrplot’ and the R command corrplot() (with three covariates from the chemical dataset added at the beginning):

library(corrplot)
PCdata <- data[,-1]
corrplot(cor(PCdata), method="ellipse")

Reminder

There are two standard commands for obtaining principal components in R: prcomp() and princomp(). Both are available under the classical R installation (library ‘stats’). The first one is using the singular value decomposition of the data matrix (SVD) and the second one uses the spectral decomposition (Eigen Value Decomposition) of the variance-covariance (correlation) matrix instead. For additional details see the help sessions for both functions (type ?prcomp() and ?princomp()).

Instead of using bio metrics directly we rather apply the principal component analysis and we will try to explain the dependent variable using some (not many) pricipal components. Pricipal components are obtained by

pc1 <- prcomp(PCdata)
varValues <- (pc1$sdev)^2
pComps <- pc1$x

And compare the following:

diag(varValues)
round(var(pComps),4)

or with the results when applying EVD based approach:

pc2 <- princomp(PCdata)
diag(pc2$sdev^2)
pc2$scores

The principal components are mutually orthogonal (in all cases we are getting rid of any multicolinearity by applying this approach) and ordered with respect to the decreasing variability.

We can directly use the pricipal components to fit a classical linear regression model. Logically, there is no sense in using e.g. the second PC if we are omitting the first one (in general, the hierarchy should be in a way that all lower order PCs are automatically included in the model). With respect to the interpretability options there is no sense to consider too many PCs in the model.

Let us (e.g.) model the amount of phosphor on the first principal component.

summary(m1 <- lm(Pcelk ~ pComps[,1]))

## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.085628 -0.044760 -0.007637  0.032229  0.140544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.115148   0.006739  17.088  < 2e-16 ***
## pComps[, 1] 0.002262   0.000295   7.665 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05391 on 62 degrees of freedom
## Multiple R-squared:  0.4866, Adjusted R-squared:  0.4783 
## F-statistic: 58.76 on 1 and 62 DF,  p-value: 1.508e-10

summary(m2 <- lm(Pcelk ~ pComps[,1] + I((pComps[,1])^2)))

## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1] + I((pComps[, 1])^2))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.082154 -0.035624 -0.009357  0.027346  0.141047 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.008e-01  7.974e-03  12.638  < 2e-16 ***
## pComps[, 1]        2.919e-03  3.548e-04   8.226  1.8e-11 ***
## I((pComps[, 1])^2) 2.756e-05  9.251e-06   2.980  0.00414 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05078 on 61 degrees of freedom
## Multiple R-squared:  0.5518, Adjusted R-squared:  0.5371 
## F-statistic: 37.55 on 2 and 61 DF,  p-value: 2.344e-11

We can even perform a statistical test to decide whether using the first principal components gives us some specific amount of explained variability. For some random variable \(U\) with the Wishart distribution \(m^{-1} W_{p}(\Sigma, m)\) it holds, in general, that \[ \sqrt{m}(\ell - \lambda) \sim N_{p}(\boldsymbol{0}, 2 \Lambda^2), \] where \(\lambda = (\lambda_{1}, \dots, \lambda_{p})^\top\) and \(\ell = (\ell_1, \dots, \ell_{p})^\top\) are the corresponding eigen values from the eigen value decompositions of \(\Sigma = \Gamma \Lambda \Gamma^\top\) and \(U = G L G^\top\).

The proportion of the explained variability by the first principal component (theoretical quantity \(\lambda_{1} / (\lambda_{1} + \dots + \lambda_{p})\)) is estimated as \(\ell_{1} / (\ell_{1} + \dots + \ell_{p})\). In our case, this equals to the following:

ell <- (pc2$sdev^2)[1] / sum(pc2$sdev^2)

which is indeed the value we can see in the output.

This is a nonlinear transformation of the vector \(\ell = (\ell_1, \dots, \ell_p)\) but it can be show that

\[ T_{1} \equiv \sqrt{n - 1} \Bigg(\frac{\ell_{1}}{\ell_{1} + \dots + \ell_{p}} - \frac{\lambda_{1}}{\lambda_{1} + \dots + \lambda_{p}}\Bigg) \mathop{\sim}^{as.} N(0, \omega^2), \] where \[ \omega^2 = \frac{2 tr(\Lambda^2)}{(tr(\Lambda))^2} \Big( \psi^2 - 2 \psi\beta + \beta\Big), \]

for \(\psi = \lambda_{1} / (\lambda_{1} + \dots + \lambda_{p})\) and \(\beta = \lambda_{1}^2 / (\lambda_{1}^2 + \dots + \lambda_{p}^2)\). For testing the null hypothesis \(H_{0}: \psi = \psi_{0}\), the test statistic \(T_{1}(\psi_{0})/\omega\) follows under the null hypothesis the standard normal distribution \(N(0,1)\).

For instance, we can test the null hypothesis, that the proportion of the explained variability when using only one principal component, is equal to \(85~\%\) or not.

Improving the model

Or we can also improve the overall model performance by adding the second principal component

summary(m3 <- lm(Pcelk ~ pComps[,1] + pComps[,2]))

## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1] + pComps[, 2])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08134 -0.03734 -0.01089  0.03179  0.14616 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1151484  0.0065001  17.715  < 2e-16 ***
## pComps[, 1]  0.0022616  0.0002846   7.947 5.43e-11 ***
## pComps[, 2] -0.0019035  0.0008018  -2.374   0.0208 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.052 on 61 degrees of freedom
## Multiple R-squared:   0.53,  Adjusted R-squared:  0.5146 
## F-statistic: 34.39 on 2 and 61 DF,  p-value: 9.98e-11

plot(Pcelk ~ pComps[,1], pch = 21, bg = "yellow", main = "", xlab = "First Principal Component", ylab = "Total Phosphor")
xseq <- seq(min(pComps[,1], max(pComps[,1])), length = 200)
lines(m2$coeff[1] + m2$coeff[2] * xseq + m2$coeff[3] * xseq^2 ~ xseq, col = "red")

par(mfrow = c(1,4))
biplot(pc1, scale = 0.5, choices = 1:2, cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-1,1), ylim = c(-0.5, 0.5), cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-0.5,0.5), ylim = c(-0.5, 0.5), cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-0.2,0.2), ylim = c(-0.2, 0.2), cex = 0.8, xlabs = rep(".", nrow(PCdata)))

Cons and Prons

The main advantage of using the principal components for regression is (typically) an enormous reduction of dimensionality.
In addition, one also gets rid of an existing multicolinearity issue (if there is some).
The main disadvantage of any (linear) regression approach based on principal components is a very challinging interpretation of the final model.

Dimensionality Reduction & Variable Selection via LASSO

Another approach to a significant variable selection is introdused in a reqularized regression based on LASSO. LASSO stands for Least Absolute Selection and Shrinkage Operator and it was introduced by Tibshirani (1996). The main idea of the LASSO reqularized (penalized) regression is the following one: instead of minimizing

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2}, \]

with respect to the vector of regression coefficients \(\boldsymbol{\beta} \in \mathbb{R}^{p}\) for \(p < n\) once considers an alternative minimization defined as

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2} + \lambda \|\boldsymbol{\beta}\|_{1}, \] where \(\lambda > 0\) is some reqularization parameter and \(\|\cdot\|_{1}\) stands for a classical \(L_{1}\) norm. It is obvious that the larger the value of the regularization parameter \(\lambda > 0\) the more dominant the second term in the minimization problem above. Thus, the elements of \(\boldsymbol{\beta}\) are shrunk towards zero. Actually, for large values of \(\lambda \to \infty\) the most elements of the vector of the regression parameters will be set to zero directly. Such solution is called a sparse solution (the vector of parameter estimates is sparse, it contains non-zero elements very rarely).

On the other hand, for \(\lambda \to 0\) the first term in the minimization problem is dominant and thus, for \(\lambda = 0\) the whole problem reduces to a classical least squares regression.

The main advantage of the LASSO regularized regression approach is that it also allows for heavily overparametrized models (situations where \(p \gg n\)). Such problems commonly occur in genetic data, sociology surveys, econometrics and basically all other areas…

Unlike the classical least squares problem where the vector of the parameter estimates is given as a solution of the normal equations and thus, is explicit, no explicit solution is generaly given for the LASSO regression. But…

LARS-LASSO Algorithm

A very elegant solution to the LASSO regression fitting problem was obtained once we realized that the solution paths are piece-wise linear. Given this, one only needs to calculate the knot point positions and the rest of the path is a linear approximation only.

Using the LARS-LASSO algorithm proposed by Efron et al. (2004) we can do that in a very effective way. Actually in \(p\) linear steps only!!!

The whole algorithm is based on a geometric interpretation of covariances between the current response estimate and covariates still not included in the model yet.

source("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/larsLASSO.r")

In the sourse code loaded above there are two functions for R: solutionPath() and solutionPathPlot().

solutionPath(X, Y, noJumps) - calculates the complete solution paths for the first ‘noJumps’ parameters entering the model. The value of ‘noJumps’ is an integer value between one and the second dimension of \(X\) (the total number of predictors). The matrix \(X\) stands for a classical regression matrix of the type \(n \times p\) however, even situations for \(p \gg n\) are allowed. Finally, the vector of \(\boldsymbol{Y}\) is a response vector. As a result one obtains a list with two elements: ‘betaVector’ with augmented parameter estimates in each row and ‘corrMatrix’ which us a matrix with actual correlations of the actuall reponse estimate and the independent covariates.
solutionPathPlot(solutionVector, corrMatrix, xSeq) - a function which creates two plots: a complete solution path until ‘noJump’ number of nonzero predictors enter the model and the plot with actual correlation decrease at each step.

Y <- Pcelk
X <- as.matrix(data[,5:18])
LARSsolution<- solutionPath(X, Y, 10)

betaVector <- LARSsolution[[1]]
corrMatrix <- LARSsolution[[2]]

And the graphical output can be obtained using this code:

par(mfrow = c(2,1))
solutionPathPlot(betaVector, corrMatrix, Y, data[,1])

Cons and Prons

The main advantage of the LASSO regularized regression approach is that it allows for very high dimensions to be effectively taken care of. Indeed, with the LASSO reqularized regression there is no problem to solve regression problems with \(p \gg n\). However, the number of nonzero parameters at the end (in the final model) is \(n\) at most. Irrelevant regressors are set to zero.
Instead of one set of estimates one can (with the same time efficiency) obtain the whole path of solutions. This makes the LASSO selection and estimation a very effective tool for high dimensional problems.
Probably the main disadvantage is that the final estimates are shrunked towards zero. There were different alternatives proposed to correct for this bias. The most popular approach is to use LASSO for the variable selection only and afterwards one should perform a classical linear regression on the selected covariates only.
Another disadvantage is a lack of inference for these models (see e.g. post-selection inference in the literature).