NMST539, LS 2015/16

Cvičenie 5 (týždeň 6)

Principal Components

(Principal Component Analysis (PCA) in R )


PCA - a multivariate method based on the variance-covariance matrix of some random vector \(\boldsymbol{X} \in \mathbb{R}^{p}\). Instead of considering the \(p\)-dimensional random vector \(\boldsymbol{X}\) one only considers \(\ell \leq p\) principal components which are orthogonal and in addtion ordered with respect to a decreasing variance. In most cases \(\ell \ll p\) (a significant reduction of dimensionality).

Biometrics on some river localities (Czech Republic)

data <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
attach(data)

One can use a fancy way to visualize the correlation structure within the data (using the library ‘corrplot’ which needs to be installed by install.packages("corrplot") firstly).

library(corrplot)
PCdata <- data[,-c(1,11,12,14)]
corrplot(cor(PCdata), method="ellipse")

Another nice alternative can be obtained using the qgraph() command from the library ‘qgraph’ (again, the library needs to be installed by the command install.packages(qgraph)). However, it does not work with the most recent R version (v.3.2.4).



Two R functions: prcomp() vs. princomp()

There are many libraries available for the R environment. Standard commands are prcomp() and princomp() which are both available under the classical R installation (library ‘stats’). While the first one is using the singular value decomposition of the data matrix (SVD) the second one uses the spectral decomposition of the variance-covariance (correlation) matrix instead. For additional details see the help sessions for both functions (type ?prcomp() and ?princomp()).

we can compare two results obtained by the two commands mentioned above:

PCdata <- data[,-c(1,11,12,14)]
summary(prcomp(PCdata))
## Importance of components:
##                            PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     23.0661 8.1033 4.50909 3.42042 2.17783 1.82627
## Proportion of Variance  0.8273 0.1021 0.03162 0.01819 0.00738 0.00519
## Cumulative Proportion   0.8273 0.9294 0.96104 0.97923 0.98661 0.99180
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.65944 1.37629 0.60761 0.42677 0.22385 0.13114
## Proportion of Variance 0.00428 0.00295 0.00057 0.00028 0.00008 0.00003
## Cumulative Proportion  0.99608 0.99902 0.99960 0.99988 0.99996 0.99998
##                           PC13    PC14
## Standard deviation     0.09806 0.02037
## Proportion of Variance 0.00001 0.00000
## Cumulative Proportion  1.00000 1.00000
summary(princomp(PCdata))
## Importance of components:
##                            Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     22.8879807 8.0407225 4.4742688 3.39400594
## Proportion of Variance  0.8273203 0.1021054 0.0316157 0.01819215
## Cumulative Proportion   0.8273203 0.9294257 0.9610414 0.97923357
##                             Comp.5      Comp.6      Comp.7     Comp.8
## Standard deviation     2.161016463 1.812162414 1.646620645 1.36566514
## Proportion of Variance 0.007375218 0.005186244 0.004281992 0.00294542
## Cumulative Proportion  0.986608787 0.991795031 0.996077023 0.99902244
##                             Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     0.602917613 0.4234770324 2.221225e-01 1.301262e-01
## Proportion of Variance 0.000574083 0.0002832164 7.791911e-05 2.674164e-05
## Cumulative Proportion  0.999596525 0.9998797419 9.999577e-01 9.999844e-01
##                             Comp.13      Comp.14
## Standard deviation     9.730192e-02 2.021332e-02
## Proportion of Variance 1.495208e-05 6.452592e-07
## Cumulative Proportion  9.999994e-01 1.000000e+00

The results are, however, not invariant with respect to transformations.

PCdata <- scale(PCdata, center = TRUE, scale = TRUE)
summary(prcomp(PCdata))
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.9922 1.3774 1.2203 0.83936 0.57503 0.38570
## Proportion of Variance 0.6395 0.1355 0.1064 0.05032 0.02362 0.01063
## Cumulative Proportion  0.6395 0.7750 0.8814 0.93170 0.95532 0.96594
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.37509 0.31946 0.25953 0.24952 0.21451 0.19375
## Proportion of Variance 0.01005 0.00729 0.00481 0.00445 0.00329 0.00268
## Cumulative Proportion  0.97599 0.98328 0.98809 0.99254 0.99583 0.99851
##                           PC13    PC14
## Standard deviation     0.11730 0.08427
## Proportion of Variance 0.00098 0.00051
## Cumulative Proportion  0.99949 1.00000
summary(princomp(PCdata))
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.9690582 1.3667152 1.2108763 0.83287359 0.57059435
## Proportion of Variance 0.6395033 0.1355069 0.1063665 0.05032265 0.02361893
## Cumulative Proportion  0.6395033 0.7750102 0.8813767 0.93169932 0.95531825
##                            Comp.6     Comp.7      Comp.8      Comp.9
## Standard deviation     0.38272054 0.37219820 0.316989404 0.257526895
## Proportion of Variance 0.01062598 0.01004972 0.007289451 0.004811168
## Cumulative Proportion  0.96594423 0.97599395 0.983283402 0.988094570
##                           Comp.10     Comp.11     Comp.12      Comp.13
## Standard deviation     0.24759625 0.212856261 0.192254476 0.1163912359
## Proportion of Variance 0.00444727 0.003286837 0.002681379 0.0009827565
## Cumulative Proportion  0.99254184 0.995828677 0.998510056 0.9994928124
##                             Comp.14
## Standard deviation     0.0836145071
## Proportion of Variance 0.0005071876
## Cumulative Proportion  1.0000000000

Compare the standard deviation values with the square roots of the eigen values of the sample variance covariance matrix.

sqrt(eigen(var(PCdata))$values)
##  [1] 2.99216408 1.37735123 1.22029957 0.83935520 0.57503484 0.38569896
##  [7] 0.37509473 0.31945628 0.25953103 0.24952310 0.21451275 0.19375064
## [13] 0.11729702 0.08426521



Various graphical representations (visualization)

First of all, it is usefull to know the amount of variability (the proportion of the total variability) which is explained by some first principal components.

pc <- princomp(PCdata)
par(mfrow = c(1,2))
plot(pc, type = "l", col = "red", main = "")
plot(pc, col = "red", main = "")

summary(pc)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.9690582 1.3667152 1.2108763 0.83287359 0.57059435
## Proportion of Variance 0.6395033 0.1355069 0.1063665 0.05032265 0.02361893
## Cumulative Proportion  0.6395033 0.7750102 0.8813767 0.93169932 0.95531825
##                            Comp.6     Comp.7      Comp.8      Comp.9
## Standard deviation     0.38272054 0.37219820 0.316989404 0.257526895
## Proportion of Variance 0.01062598 0.01004972 0.007289451 0.004811168
## Cumulative Proportion  0.96594423 0.97599395 0.983283402 0.988094570
##                           Comp.10     Comp.11     Comp.12      Comp.13
## Standard deviation     0.24759625 0.212856261 0.192254476 0.1163912359
## Proportion of Variance 0.00444727 0.003286837 0.002681379 0.0009827565
## Cumulative Proportion  0.99254184 0.995828677 0.998510056 0.9994928124
##                             Comp.14
## Standard deviation     0.0836145071
## Proportion of Variance 0.0005071876
## Cumulative Proportion  1.0000000000



Try by yourself


  • Use the output above and plot a figure where one can clearly see the proportions of explained variability for each component.
  • Similarly, from the output above, create a figure with cumulative proportions of the explaied variability.
  • Apply a similar approach (PCA) to the iris dataset available in the R environment (use data(iris)).


There are also other options to visualize the results. A standard graphical tool is available through the commnad biplot() which is available in the standard R installation. This fuction automatically projects the data on the first two PCs. Other principle components can be also used in the figure (parameter ‘choices = 1:2’).

biplot(pc, scale = 0.5, choices = 1:2, cex = 0.8)

Or one can bring it all to a more detailed version by

biplot(pc, scale = 0.5, choices = 1:2, xlim = c(-1,1), ylim = c(-1, 1), cex = 0.8)

A more fancy version can by obtained by installing the ggbiplot package from the Github repository (use command install_github("ggbiplot", "vqv")). After loading in the library one can use the ggbiplot() command (and additional parameters).

library("ggbiplot")
ggbiplot(princomp(PCdata), elipse = T, groups = Ntaxon, circle = T)

Another type of plot can be obtained using the library ‘ggfortify’ (use install.packages('ggfortify', dep = T) to install the library) and the command autoplot().

library("ggfortify")
autoplot(prcomp(PCdata), data = data,  colour = 'Ntaxon', loadings = T)

Last but not least, is a very complex command fviz_pca() available in the library ‘factoextra’ (use install.packages("factoextra") for installation).

library("factoextra")
par(mfrow = c(1,3))
fviz_pca_ind(pc)

fviz_pca_var(pc)

fviz_pca_biplot(pc)

Try by yourself


There are many more packages and libraries available for the principal component analysis in R. You can use google to find some others not mentioned in this text.
  • Use some two dimensional artificial datasets with small amount of observations (e.g. 3 - 5) and perform the PC analysis. Using some appropriate visualization techniques try to understand the algebra machinery behind.


Domáca úloha

(Deadline: siedmy týždeň / 04.04.2016)

Na webovej stránke Doc. Hlávku

http://www1.karlin.mff.cuni.cz/~hlavka/teac.html

je k dispozícii niekoľko dátových súborov, ktoré do Rka stačí načítať pomocou príkazu load(nazov_suboru.rda).

  • Vyberte si jeden dátový súbor a aplikujte metodu PCA.
  • Udělejte alespoň jeden graficky výstup a naležite jej okomentujte.