Cvičenie 6 (týždeň 7)

Application of PCA and Introduction to LASSO

Linear Regression on PCs

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

There is one locality not recorded in the chemical dataset. Therefore, we need to synchronize both dataset before going for any analysis.

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(chemData[,c(1,6,7,8)], bioData[ind,-c(1,11,12,14)])
attach(data)

To remind the correlation structure in the data we can visualize the correlation matrix using the following piece of the R code (with three covariates from the chemical dataset added at the beginning):

library(corrplot)
PCdata <- data[,-1]
corrplot(cor(PCdata), method="ellipse")

Reminder

There are two standard commands for obtaining principal components in R: prcomp() and princomp(). Both are available under the classical R installation (library ‘stats’). The first one is using the singular value decomposition of the data matrix (SVD) and the second one uses the spectral decomposition (Eigen Value Decomposition) of the variance-covariance (correlation) matrix instead. For additional details see the help sessions for both functions (type ?prcomp() and ?princomp()).

Instead of using bio metrics directly we rather apply the principal component analysis and we will try to explain the dependent variable using some (not many) pricipal components. Pricipal components are obtained by

pc1 <- prcomp(PCdata)
varValues <- (pc1$sdev)^2
pComps <- pc1$x

And compare the following:

diag(varValues)
round(var(pComps),4)

or with the results when applying EVD based approach:

pc2 <- princomp(PCdata)
diag(pc2$sdev^2)
pc2$scores

The principal components are mutually orthogonal (in all cases we are getting rid of any multicolinearity by applying this approach) and ordered with respect to the decreasing variability.

We can directly use the pricipal components to fit a classical linear regression model. Logically, there is no sense in using e.g. the second PC if we are omitting the first one (in general, the hierarchy should be in a way that all lover PCs are automatically included in the model). With respect to the interpretability options there is no sense to consider too many PCs in the model.

Let us (e.g.) model the amount of phosphor on the first principal component.

summary(m1 <- lm(Pcelk ~ pComps[,1]))

## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.085628 -0.044760 -0.007637  0.032229  0.140544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.115148   0.006739  17.088  < 2e-16 ***
## pComps[, 1] 0.002262   0.000295   7.665 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05391 on 62 degrees of freedom
## Multiple R-squared:  0.4866, Adjusted R-squared:  0.4783 
## F-statistic: 58.76 on 1 and 62 DF,  p-value: 1.508e-10

summary(m2 <- lm(Pcelk ~ pComps[,1] + I((pComps[,1])^2)))

## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1] + I((pComps[, 1])^2))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.082154 -0.035624 -0.009357  0.027346  0.141047 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.008e-01  7.974e-03  12.638  < 2e-16 ***
## pComps[, 1]        2.919e-03  3.548e-04   8.226  1.8e-11 ***
## I((pComps[, 1])^2) 2.756e-05  9.251e-06   2.980  0.00414 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05078 on 61 degrees of freedom
## Multiple R-squared:  0.5518, Adjusted R-squared:  0.5371 
## F-statistic: 37.55 on 2 and 61 DF,  p-value: 2.344e-11

Or we can also improve the overall model performance by adding the second principal component

summary(m3 <- lm(Pcelk ~ pComps[,1] + pComps[,2]))

## 
## Call:
## lm(formula = Pcelk ~ pComps[, 1] + pComps[, 2])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08134 -0.03734 -0.01089  0.03179  0.14616 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1151484  0.0065001  17.715  < 2e-16 ***
## pComps[, 1]  0.0022616  0.0002846   7.947 5.43e-11 ***
## pComps[, 2] -0.0019035  0.0008018  -2.374   0.0208 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.052 on 61 degrees of freedom
## Multiple R-squared:   0.53,  Adjusted R-squared:  0.5146 
## F-statistic: 34.39 on 2 and 61 DF,  p-value: 9.98e-11

plot(Pcelk ~ pComps[,1], pch = 21, bg = "yellow", main = "", xlab = "First Principal Component", ylab = "Total Phosphor")
xseq <- seq(min(pComps[,1], max(pComps[,1])), length = 200)
lines(m2$coeff[1] + m2$coeff[2] * xseq + m2$coeff[3] * xseq^2 ~ xseq, col = "red")

par(mfrow = c(1,4))
biplot(pc1, scale = 0.5, choices = 1:2, cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-1,1), ylim = c(-0.5, 0.5), cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-0.5,0.5), ylim = c(-0.5, 0.5), cex = 0.8, xlabs = rep(".", nrow(PCdata)))
biplot(pc1, scale = 0.5, choices = 1:2, xlim = c(-0.2,0.2), ylim = c(-0.2, 0.2), cex = 0.8, xlabs = rep(".", nrow(PCdata)))

Cons and Prons

The main advantage of using principal components for regression is (typically) an enormous reduction of dimensionality.
In addition, one also gets rid of an existing multicolinearity issues (if there are some).
The main disadvantage of any (linear) regression approach based on principal components is a very challinging interpretation of the final model.

Dimensionality Reduction & Variable Selection via LASSO

Another approach to a significant variable selection is introdused in a reqularized regression based on LASSO. LASSO stands for Least Absolute Selection and Shrinkage Operator and it was introduced by Tibshirani (1996). The main idea of the LASSO reqularized (penalized) regression is the following one: instead of minimizing

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2}, \]

with respect to the vector of regression coefficients \(\boldsymbol{\beta} \in \mathbb{R}^{p}\) for \(p < n\) once considers an alternative minimization defined as

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2} + \lambda \|\boldsymbol{\beta}\|_{1}, \] where \(\lambda > 0\) is some reqularization parameter and \(\|\cdot\|_{1}\) stands for a classical \(L_{1}\) norm. It is obvious that the larger the value of the regularization parameter \(\lambda > 0\) the more dominant the second term in the minimization problem above. Thus, the elements of \(\boldsymbol{\beta}\) are shrunk towards zero. Actually, for large values of \(\lambda \to \infty\) the most elements of the vector of the regression parameters will be set to zero directly. Such solution is called a sparse solution (the vector of parameter estimates is sparse, it contains non-zero elements very rarely).

On the other hand, for \(\lambda \to 0\) the first term in the minimization problem is dominant and thus, for \(\lambda = 0\) the whole problem reduces to a classical least squares regression.

The main advantage of the LASSO regularized regression approach is that it also allows for heavily overparametrized models (situations where \(p \gg n\)). Such problems commonly occur in genetic data, sociology surveys, econometrics and basically all other areas…

Unlike the classical least squares problem where the vector of the parameter estimates is given as a solution of the normal equations and thus, is explicit, no explicit solution is generaly given for the LASSO regression. But…

LARS-LASSO Algorithm

A very elegant solution to the LASSO regression fitting problem was obtained once we realized that the solution paths are piece-wise linear. Given this, one only needs to calculate the knot point positions and the rest of the path is a linear approximation only.

Using the LARS-LASSO algorithm proposed by Efron et al. (2004) we can do that in a very effective way. Actually in \(p\) linear steps only!!!

The whole algorithm is based on a geometric interpretation of covariances between the current response estimate and covariates still not included in the model yet.

source("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/larsLASSO.r")

In the sourse code loaded above there are two functions for R: solutionPath() and solutionPathPlot().

solutionPath(X, Y, noJumps) - calculates the complete solution paths for the first ‘noJumps’ parameters entering the model. The value of ‘noJumps’ is an integer value between one and the second dimension of \(X\) (the total number of predictors). The matrix \(X\) stands for a classical regression matrix of the type \(n \times p\) however, even situations for \(p \gg n\) are allowed. Finally, the vector of \(\boldsymbol{Y}\) is a response vector. As a result one obtains a list with two elements: ‘betaVector’ with augmented parameter estimates in each row and ‘corrMatrix’ which us a matrix with actual correlations of the actuall reponse estimate and the independent covariates.
solutionPathPlot(solutionVector, corrMatrix, xSeq) - a function which creates two plots: a complete solution path until ‘noJump’ number of nonzero predictors enter the model and the plot with actual correlation decrease at each step.

Y <- Pcelk
X <- as.matrix(data[,5:18])
LARSsolution<- solutionPath(X, Y, 10)

betaVector <- LARSsolution[[1]]
corrMatrix <- LARSsolution[[2]]

And the graphical output can be obtained using this code:

par(mfrow = c(2,1))
solutionPathPlot(betaVector, corrMatrix, Y, data[,1])

Cons and Prons

The main advantage of the LASSO regularized regression approach is that it allows for very high dimensions to be effectively taken care of. Indeed, with the LASSO reqularized regression there is no problem to solve regression problems with \(p \gg n\). However, the number of nonzero parameters at the end (in the final model) is \(n\) at most. Irrelevant regressors are set to zero.
Instead of one set of estimates one can (with the same time efficiency) obtain the whole path of solutions. This makes the LASSO selection and estimation a very effective tool for high dimensional problems.
Probably the main disadvantage is that the final estimates are shrunked towards zero. There were different alternatives proposed to correct for this bias. The most popular approach is to use LASSO for the variable selection only and afterwards one should perform a classical linear regression on the selected covariates only.
Another disadvantage is a lack of inference for these models (see e.g. post-selection inference in the literature).

Domáca úloha (dobrovoľná)

Na webovej stránke Doc. Hlávku

http://www1.karlin.mff.cuni.cz/~hlavka/teac.html

je k dispozícii niekoľko dátových súborov, ktoré do Rka stačí načítať pomocou príkazu load(nazov_suboru.rda).

Vyberte si jeden dátový súbor a skúste použíť regresiu na hlavných komponentách.
Pokúste sa aplikovať LASSO metódu (prípadné chybové hlášky v kóde prosím nahlásiť).

NMST539, LS 2015/16

Cvičenie 6 (týždeň 7)

Application of PCA and Introduction to LASSO

Linear Regression on PCs

Reminder

Cons and Prons

Dimensionality Reduction & Variable Selection via LASSO

LARS-LASSO Algorithm

Cons and Prons

Domáca úloha (dobrovoľná)