NMST539 | Lab Session 5

Principal Component Analysis

(application in R)

LS 2017 | Tuesday 04/04/2017

Rmd file (UTF8 encoding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

The principal component analysis (PCA) - is a multivariate exploratory method based on the variance-covariance matrix of some random vector \(\boldsymbol{X} \in \mathbb{R}^{p}\). Of course, instead of working with the vector itself and the theoretical variance-convariance matrix we use a random sample drawn from the same distribution and the sample verion of the varinance covariance matrix.

The main idea is to consider \(\ell \leq p\) principal components instead of the whole \(p \in \mathbb{N}\) dimensional random vector (dimensionality reduction). The principel components are orthogonal linear combinations of the original covariates, and, in addtion, they are ordered with respect to a decreasing variance. In most cases \(\ell \ll p\) (a significant reduction of dimensionality).



Biometrics on some river localities (Czech Republic)

We will start with a data sample which represents 65 river localities in the Czech Republic. For each locality there are 17 biological metric recordid (the status of the biological life within each locality expressed in terms of some well defined and internationally recognized metrics - indexes).

The data file can be obtained from the following address:

data <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
attach(data)

One can use a fancy way to visualize the correlation structure within the data (using the library ‘corrplot’ which needs to be installed by install.packages("corrplot") firstly).

library(corrplot)
PCdata <- data[,-c(1,11,12,14)]
corrplot(cor(PCdata), method="ellipse")

Another nice alternative can be obtained using the qgraph() command from the library ‘qgraph’ (again, the library needs to be installed by the command install.packages(qgraph)). However, it does not work with the most recent R version (v.3.2.4).



Two R functions: prcomp() vs. princomp()

There are various libraries PCA available for the R environment: standard commands are prcomp() and princomp(), which are both available under the classical R installation (library ‘stats’). While the first one is using the singular value decomposition of the data matrix (SVD) the second one uses the spectral decomposition of the variance-covariance (correlation) matrix instead. For additional details see the help sessions for both functions (type ?prcomp() and ?princomp()).

We can compare two results obtained by the two commands mentioned above:

PCdata <- data[,-c(1,11,12,14)]
summary(prcomp(PCdata))
## Importance of components:
##                            PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     23.0661 8.1033 4.50909 3.42042 2.17783 1.82627
## Proportion of Variance  0.8273 0.1021 0.03162 0.01819 0.00738 0.00519
## Cumulative Proportion   0.8273 0.9294 0.96104 0.97923 0.98661 0.99180
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.65944 1.37629 0.60761 0.42677 0.22385 0.13114
## Proportion of Variance 0.00428 0.00295 0.00057 0.00028 0.00008 0.00003
## Cumulative Proportion  0.99608 0.99902 0.99960 0.99988 0.99996 0.99998
##                           PC13    PC14
## Standard deviation     0.09806 0.02037
## Proportion of Variance 0.00001 0.00000
## Cumulative Proportion  1.00000 1.00000
summary(princomp(PCdata))
## Importance of components:
##                            Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     22.8879807 8.0407225 4.4742688 3.39400594
## Proportion of Variance  0.8273203 0.1021054 0.0316157 0.01819215
## Cumulative Proportion   0.8273203 0.9294257 0.9610414 0.97923357
##                             Comp.5      Comp.6      Comp.7     Comp.8
## Standard deviation     2.161016463 1.812162414 1.646620645 1.36566514
## Proportion of Variance 0.007375218 0.005186244 0.004281992 0.00294542
## Cumulative Proportion  0.986608787 0.991795031 0.996077023 0.99902244
##                             Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     0.602917613 0.4234770324 2.221225e-01 1.301262e-01
## Proportion of Variance 0.000574083 0.0002832164 7.791911e-05 2.674164e-05
## Cumulative Proportion  0.999596525 0.9998797419 9.999577e-01 9.999844e-01
##                             Comp.13      Comp.14
## Standard deviation     9.730192e-02 2.021332e-02
## Proportion of Variance 1.495208e-05 6.452592e-07
## Cumulative Proportion  9.999994e-01 1.000000e+00

The results are, however, not invariant with respect to transformations.

PCdata <- scale(PCdata, center = TRUE, scale = TRUE)
summary(prcomp(PCdata))
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.9922 1.3774 1.2203 0.83936 0.57503 0.38570
## Proportion of Variance 0.6395 0.1355 0.1064 0.05032 0.02362 0.01063
## Cumulative Proportion  0.6395 0.7750 0.8814 0.93170 0.95532 0.96594
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.37509 0.31946 0.25953 0.24952 0.21451 0.19375
## Proportion of Variance 0.01005 0.00729 0.00481 0.00445 0.00329 0.00268
## Cumulative Proportion  0.97599 0.98328 0.98809 0.99254 0.99583 0.99851
##                           PC13    PC14
## Standard deviation     0.11730 0.08427
## Proportion of Variance 0.00098 0.00051
## Cumulative Proportion  0.99949 1.00000
summary(princomp(PCdata))
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.9690582 1.3667152 1.2108763 0.83287359 0.57059435
## Proportion of Variance 0.6395033 0.1355069 0.1063665 0.05032265 0.02361893
## Cumulative Proportion  0.6395033 0.7750102 0.8813767 0.93169932 0.95531825
##                            Comp.6     Comp.7      Comp.8      Comp.9
## Standard deviation     0.38272054 0.37219820 0.316989404 0.257526895
## Proportion of Variance 0.01062598 0.01004972 0.007289451 0.004811168
## Cumulative Proportion  0.96594423 0.97599395 0.983283402 0.988094570
##                           Comp.10     Comp.11     Comp.12      Comp.13
## Standard deviation     0.24759625 0.212856261 0.192254476 0.1163912359
## Proportion of Variance 0.00444727 0.003286837 0.002681379 0.0009827565
## Cumulative Proportion  0.99254184 0.995828677 0.998510056 0.9994928124
##                             Comp.14
## Standard deviation     0.0836145071
## Proportion of Variance 0.0005071876
## Cumulative Proportion  1.0000000000

Compare the standard deviation values with the square roots of the eigen values of the sample variance covariance matrix.

sqrt(eigen(var(PCdata))$values)
##  [1] 2.99216408 1.37735123 1.22029957 0.83935520 0.57503484 0.38569896
##  [7] 0.37509473 0.31945628 0.25953103 0.24952310 0.21451275 0.19375064
## [13] 0.11729702 0.08426521



Try by yourself


A simple (visual example) on how the PC analysis actually works can be obtained from the next illustration.
  • Firstly, install the library library("jpeg") and load it by

    library("jpeg")
  • Download a sample immage from the following address: http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/foto.jpg and save it to your working environment;
  • Load the picture into the R software by using the followinng command (with the right working directory):

    foto <- readJPEG("foto.jpg")
  • Decompose the picture into RGB scale and apply the PC analysis to all three:

    r <- foto[,,1]
    g <- foto[,,2]
    b <- foto[,,3]
    
    foto.r.pca <- prcomp(r, center = FALSE)
    foto.g.pca <- prcomp(g, center = FALSE)
    foto.b.pca <- prcomp(b, center = FALSE)
    
    rgb.pca <- list(foto.r.pca, foto.g.pca, foto.b.pca)
  • Now, we will reconstruct the original picture using 3, 10, 20, 50, 100 and 500 components (out of 682 covariates together).

    for (i in c(3,10,20,50,100,500)) {
      pca.img <- sapply(rgb.pca, function(j) {
        compressed.img <- j$x[,1:i] %*% t(j$rotation[,1:i])
      }, simplify = 'array')
      
      writeJPEG(pca.img, paste("foto_", i, "_components.jpg", sep = ""))
    }
  • Compare the compressed images with the original photo and see how much dimensionality reduction (loss of information in each compressed fiture respectively) is present in each figure;