NMST539 | Lab Session 6

Application of PCA and Introduction to LASSO

LS 2017 | Tuesday 11/04/2017

Rmd file (UTF8 encoding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

Linear Regression on PCs

We will again start with the data sample respresenting 65 river localities in the Czech Republic. For each locality there are 17 biological metric recorded (the status of the biological life within each locality expressed in terms of some well defined and internationally recognized metrics - indexes). In addition to the previous lab session we also consider a second data set on chemical measurements at the same localities (concentrations on 7 different chemical substances).

The data file can be obtained from the following address:

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

There is one locality not recorded in the chemical dataset. Therefore, we need to synchronize both dataset before going for any analysis. This can be done by the following part of the R code:

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(chemData[,c(1,6,7,8)], bioData[ind,-c(1,11,12,14)])

To remind the correlation structure in the data we can visualize the correlation matrix using the the library ‘corrplot’ and the R command corrplot() (with three covariates from the chemical dataset added at the beginning):

PCdata <- data[,-1]
corrplot(cor(PCdata), method="ellipse")