NMST539 | Lab Session 7

Linear Regression on PCA and LASSO

LS 2017/2018 | Monday 09/04/18

Rmd file (UTF8 coding)

The R-software is available for download from the official website: https://www.r-project.org

A user-friendly interface (one of many): RStudio

Some useful manuals and tutorials for R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
  • Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
  • Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
  • Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

1. Linear Regression on PCs

We will again start with the data sample respresenting 65 river localities in the Czech Republic. For each locality there are 17 biological metric recorded (the status of the biological life within each locality expressed in terms of some well defined and internationally recognized metrics - indexes). In addition to the previous lab session we also consider a second data set on chemical measurements at the same localities (concentrations on 7 different chemical substances).

The data files can be obtained using the following R commands:

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

There is one locality not recorded in the chemical dataset. Therefore, we need to synchronize both dataset before going for any analysis. This can be done by the following part of the R code:

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(chemData[,c(1,6,7,8)], bioData[ind,-c(1,11,12,14)])
attach(data)

To recall the correlation structure in the data we can again use some graphical tools to visualize the correlation matrix using the the library ‘corrplot’ and the R command corrplot() (with three covariates from the chemical dataset added at the beginning):

library(corrplot)
PCdata <- data[,-1]
corrplot(cor(PCdata), method="ellipse")



Reminder


  • There are two standard commands for obtaining the principal components in R: prcomp() and princomp(). Both are available under the standard R installation (library ‘stats’). The first one is using the singular value decomposition of the data matrix (SVD) and the second one uses the spectral decomposition (Eigen Value Decomposition) of the variance-covariance (correlation) matrix instead. For additional details see the help sessions for both functions (type ?prcomp() and ?princomp()).
  • Remember, that it is possible to use the variance-covariance matrix and also the correlation matrix to obtain the principal components for some data. In general, the results are not the same – especially for situations where the scales of the covariates in the data are different. We can easily compare the results of the pricipal component analysis run on the covariance and the correlation matrix. For this purpose we consider the chemical measurements for the river localities in the Czech Republic.

    PCAcov <- prcomp(chemData[,-c(1,11,12,14)], scale = F)
    PCAcor <- prcomp(chemData[,-c(1,11,12,14)], scale = T)
    
    par(mfrow = c(1,2))
    biplot(PCAcov)
    biplot(PCAcor)

    par(mfrow = c(1,2))
    biplot(PCAcov, xlim = c(-0.1, 0.1), ylim = c(-0.1, 0.1))
    biplot(PCAcor)
  • In both cases it seems that the most dominant covariate is the conductivity (Kond). The second principal component is pretty much identified with the oxigen covariate (X.O2).



Instead of using the bio metrics directly we rather apply the principal component analysis and we will try to explain the dependent variable using some (not many) pricipal components. Pricipal components are obtained by

pc1 <- prcomp(PCdata)
varValues <- (pc1$sdev)^2
pComps <- pc1$x

And compare the following:

diag(varValues)
round(var(pComps),4)

with the results when applying EVD based approach:

pc2 <- princomp(PCdata)
diag(pc2$sdev^2)
pc2$scores

The principal components are mutually orthogonal (in most cases we want to get rid of any multicolinearity in the model which is now achieved by applying the PCA approach) and ordered with respect to the decreasing variability.

We can directly use the pricipal components to fit a classical linear regression model. Logically, there is no sense in using e.g. the second PC if we are omitting the first one (in general, the hierarchy should be in a way that all lower order PCs are automatically included in the model). With respect to the interpretability options there is no sense to consider too many PCs in the model.

Let us (e.g.) model the amount of phosphor on the first principal component.

summary(m1 <- lm(Pcelk ~ pComps[,1]))
## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.085628 -0.044760 -0.007637  0.032229  0.140544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.115148   0.006739  17.088  < 2e-16 ***
## pComps[, 1] 0.002262   0.000295   7.665 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05391 on 62 degrees of freedom
## Multiple R-squared:  0.4866, Adjusted R-squared:  0.4783 
## F-statistic: 58.76 on 1 and 62 DF,  p-value: 1.508e-10
summary(m2 <- lm(Pcelk ~ pComps[,1] + I((pComps[,1])^2)))
## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1] + I((pComps[, 1])^2))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.082154 -0.035624 -0.009357  0.027346  0.141047 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.008e-01  7.974e-03  12.638  < 2e-16 ***
## pComps[, 1]        2.919e-03  3.548e-04   8.226  1.8e-11 ***
## I((pComps[, 1])^2) 2.756e-05  9.251e-06   2.980  0.00414 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05078 on 61 degrees of freedom
## Multiple R-squared:  0.5518, Adjusted R-squared:  0.5371 
## F-statistic: 37.55 on 2 and 61 DF,  p-value: 2.344e-11

We can even perform a statistical test to decide whether the amount of explained variability brought by considering the first principal component is statistially relevant or not. For some random variable \(U\) with the Wishart distribution \(m^{-1} W_{p}(\Sigma, m)\) it holds, in general, that \[ \sqrt{m}(\ell - \lambda) \sim N_{p}(\boldsymbol{0}, 2 \Lambda^2), \] where \(\lambda = (\lambda_{1}, \dots, \lambda_{p})^\top\) and \(\ell = (\ell_1, \dots, \ell_{p})^\top\) are the corresponding eigen values from the eigen value decompositions of \(\Sigma = \Gamma \Lambda \Gamma^\top\) and \(U = G L G^\top\).

The proportion of the explained variability by the first principal component (theoretical quantity \(\lambda_{1} / (\lambda_{1} + \dots + \lambda_{p})\)) is estimated as \(\ell_{1} / (\ell_{1} + \dots + \ell_{p})\). In our case, this equals to the following:

print(ell <- (pc2$sdev^2)[1] / sum(pc2$sdev^2))
##    Comp.1 
## 0.8207071

This is a nonlinear transformation of the vector \(\ell = (\ell_1, \dots, \ell_p)^\top\) but it can be show that

\[ T_{1} \equiv \sqrt{n - 1} \Bigg(\frac{\ell_{1}}{\ell_{1} + \dots + \ell_{p}} - \frac{\lambda_{1}}{\lambda_{1} + \dots + \lambda_{p}}\Bigg) \mathop{\sim}^{as.} N(0, \omega^2), \] where \[ \omega^2 = \frac{2 tr(\Sigma^2)}{(tr(\Sigma))^2} \Big( \psi^2 - 2 \psi\beta + \beta\Big), \]

for \(\psi = \lambda_{1} / (\lambda_{1} + \dots + \lambda_{p})\) and \(\beta = \lambda_{1}^2 / (\lambda_{1}^2 + \dots + \lambda_{p}^2)\). For testing the null hypothesis \(H_{0}: \psi = \psi_{0}\), the test statistic \(T_{1}(\psi_{0})/\omega\) follows under the null hypothesis the standard normal distribution \(N(0,1)\).

For instance, we can test the null hypothesis, that the proportion of the explained variability when using only one principal component, is equal to \(85~\%\) or not.

  • This can be further generalized for the proportions of variability explained by the first \(p\) principal components. The only difference is with respect to the definition of \(\psi\) and \(\beta\) where we would have \(\psi_q = (\lambda_{1} + \dots + \lambda_{q}) / (\lambda_{1} + \dots + \lambda_{p})\) and \(\beta = (\lambda_{1}^2 + \dots + \lambda_{q}^2) / (\lambda_{1}^2 + \dots + \lambda_{p}^2)\).
  • Such testing procedure can be effectively used to decide whether some specific number of principal components which are considered is enough to cover the minimum proportion of variability which needs to be explained.



Improving the model

Or we can also improve the overall model performance by adding the second principal component

summary(m3 <- lm(Pcelk ~ pComps[,1] + pComps[,2]))
## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1] + pComps[, 2])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08134 -0.03734 -0.01089  0.03179  0.14616 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1151484  0.0065001  17.715  < 2e-16 ***
## pComps[, 1]  0.0022616  0.0002846   7.947 5.43e-11 ***
## pComps[, 2] -0.0019035  0.0008018  -2.374   0.0208 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.052 on 61 degrees of freedom
## Multiple R-squared:   0.53,  Adjusted R-squared:  0.5146 
## F-statistic: 34.39 on 2 and 61 DF,  p-value: 9.98e-11
plot(Pcelk ~ pComps[,1], pch = 21, bg = "yellow", main = "", xlab = "First Principal Component", ylab = "Total Phosphor")
xseq <- seq(min(pComps[,1], max(pComps[,1])), length = 200)
lines(m2$coeff[1] + m2$coeff[2] * xseq + m2$coeff[3] * xseq^2 ~ xseq, col = "red")

par(mfrow = c(1,4))
biplot(pc1, scale = 0.5, choices = 1:2, cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-1,1), ylim = c(-0.5, 0.5), cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-0.5,0.5), ylim = c(-0.5, 0.5), cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-0.2,0.2), ylim = c(-0.2, 0.2), cex = 0.8, xlabs = rep(".", nrow(PCdata)))

Cons and Prons


  • The main advantage of using the principal components for the regression modelling is (typically) an enormous reduction of dimensionality.
  • In addition, one also gets rid of an existing multicolinearity issue (if there is some).
  • The main disadvantage of any (linear) regression approach based on principal components is a very challinging interpretation of the final model. Each covariate in the model is defined as a linear combination of typically all original covariates.


2. Dimensionality Reduction & Variable Selection via LASSO

Another approach to a significant dimensional reduction and, at the same time the variable selection, is introdused by a reqularized regression based on LASSO penalization. LASSO stands for the Least Absolute Selection and Shrinkage Operator and it was introduced by Tibshirani (1996). The main idea of the LASSO reqularized (penalized) regression is the following one: instead of minimizing

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2}, \]

with respect to the vector of regression coefficients \(\boldsymbol{\beta} \in \mathbb{R}^{p}\) for \(p < n\) once considers an alternative minimization defined as

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2} + \lambda \|\boldsymbol{\beta}\|_{1}, \] where \(\lambda > 0\) is some reqularization parameter and \(\|\cdot\|_{1}\) stands for a classical \(L_{1}\) norm. It is obvious that the larger the value of the regularization parameter \(\lambda > 0\) the more dominant the second term in the minimization problem above. Thus, the elements of \(\boldsymbol{\beta}\) are shrunk towards zero. Actually, for large values of \(\lambda \to \infty\) the most elements of the vector of the regression parameters will be set to zero directly. Such solution is called a sparse solution (the vector of parameter estimates is sparse, it contains non-zero elements very rarely).

On the other hand, for \(\lambda \to 0\) the first term in the minimization problem is dominant and thus, for \(\lambda = 0\) the whole problem reduces to a classical least squares regression.

The main advantage of the LASSO regularized regression approach is that it also allows for heavily overparametrized models (situations where \(p \gg n\)). Such problems commonly occur in genetic data, sociology surveys, econometrics and basically all other areas…

Unlike the classical least squares problem where the vector of the parameter estimates is given as a solution of the normal equations and thus, is explicit, no explicit solution is generaly given for the LASSO regression. But…

3. LARS-LASSO Algorithm

A very elegant solution to the LASSO regression fitting problem was obtained once we realized that the solution paths are piece-wise linear. Given this, one only needs to calculate the knot point positions and the rest of the path is a linear approximation only.

Using the LARS-LASSO algorithm proposed by Efron et al. (2004) we can do that in a very effective way. Actually in \(p\) linear steps only!!!

The whole algorithm is based on a geometric interpretation of covariances between the current response estimate and the covariates still not included in the model yet.

source("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/larsLASSO.r")

In the sourse code loaded above there are two functions for R: solutionPath() and solutionPathPlot().


  • solutionPath(X, Y, noJumps) - calculates the complete solution paths for the first ‘noJumps’ parameters entering the model. The value of ‘noJumps’ is an integer value between one and the second dimension of \(X\) (the total number of predictors). The matrix \(X\) stands for a classical regression matrix of the type \(n \times p\) however, even situations for \(p \gg n\) are allowed. Finally, the vector of \(\boldsymbol{Y}\) is a response vector. As a result one obtains a list with two elements: ‘betaVector’ with augmented parameter estimates in each row and ‘corrMatrix’ which us a matrix with actual correlations of the actuall reponse estimate and the independent covariates.
  • solutionPathPlot(solutionVector, corrMatrix, xSeq) - a function which creates two plots: a complete solution path until ‘noJump’ number of nonzero predictors enter the model and the plot with actual correlation decrease at each step.


Y <- Pcelk
X <- as.matrix(data[,5:18])
LARSsolution<- solutionPath(X, Y, 10)

betaVector <- LARSsolution[[1]]
corrMatrix <- LARSsolution[[2]]

And the graphical output can be obtained using this code:

par(mfrow = c(2,1))
solutionPathPlot(betaVector, corrMatrix, Y, data[,1])

Cons and Prons


  • The main advantage of the LASSO regularized regression approach is that it allows for very high dimensions to be effectively taken care of. Indeed, with the LASSO reqularized regression there is no problem to solve regression problems with \(p \gg n\). However, the number of nonzero parameters at the end (in the final model) is \(n\) at most. Irrelevant regressors are set to zero.
  • Instead of one set of estimates one can (with the same time efficiency) obtain the whole path of solutions. This makes the LASSO selection and estimation a very effective tool for high dimensional problems.
  • Probably the main disadvantage is that the final estimates are shrunked towards zero. There were different alternatives proposed to correct for this bias. The most popular approach is to use LASSO for the variable selection only and afterwards one should perform a classical linear regression on the selected covariates only.
  • Another disadvantage is a lack of inference for these models (see e.g. post-selection inference in the literature).



4. LASSO Methods in R

There are few usefull libraries in R which implement the LASSO shrinkage method and the LARS algorithm too. In addition, there were various modifications of the LASSO estimation approach proposed in the literature. Eventhough, the LASSO is a very effective variable selection tool it still suffers of some issues and disadvantages. Most of the proposed modifications aim to improve the LASSO performance especially with respect to the selection performance and consistency (for instance, to achieve the ORACLE properties or sign consistency instead).

The most convenient package, which offers, beside the LASSO regression modeling, also the Ridge regression and the elastic net algorithms is implemented in the library glmnet (use the command install.packages("glmnet") for installation and library("glmnent") for initialization in the R environment).

library("glmnet")
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
fit1=glmnet(X,Y)
plot(fit1)

coef(fit1,s=0.01)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  0.0089669122
## SaprInd      0.0702733515
## Lital       -0.0014795288
## RETI         .           
## EPTAbu       .           
## Marg         .           
## Metaritr    -0.0002197004
## JepAbu       .           
## Epiritral    .           
## Hyporitral   .           
## Bindex       .           
## PosAbu       .           
## Spasaci      .           
## MDS_1        .           
## MDS_2        .

Another library, which offers LARS fitting appraoch for various models (for instance, LASSO, Least Angle Regresssion, Forward Stagewise, or stepwise regression) is implemented in the package lars (run the command install.packages("lars") to install the library and library("lars") to initialize it in the R environment).

See the next examples for a brief illustration:

library("lars")

object1 <- lars(X,Y, type = "lasso")
plot(object1)

object2 <- lars(X,Y, type = "lar")
plot(object2)

object3 <- lars(X,Y, type = "forward.stagewise")
plot(object3)

object4 <- lars(X,Y, type = "stepwise")
plot(object4)