NMST539, LS 2015/16

Cvičenie 11 (týždeň 12)

Canonical Corelations in R


Canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which operates on the same principle as the principal component analysis. The main purpose of the canonical correlation approach is the exploration of sample correlations between two sets of quantitative variables observed on the same experimental units. On the other hand PCA method deals with one data set only and it tries to to reduce the overall dimensionality of the dataset using some linear combination of the initial variables.

In the R statistical software there is function cancor() available under the standard instalation. Some additional packages can be downloaded and installed as well (mainly the R packages ‘CCA’ and ‘vegan’).

In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements in some specific (64-65) river localities in the Czech republic. The first dataset contains measurements of biologial metrics for each locality (17 different metrics and taxons) and the secon dataset contains measurements on chemical concentrations and values at the same localities (7 covariates recored).

The idea is to somehow relate both dataset in order to explain what bio metrics can correlate which which chemical concentrations. One way of answering this is to apply the canonical correlation approach explained below.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

head(bioData)
##   Kod_Canoco SaprInd    Lital    RETI   EPTAbu    Marg Metaritr   JepAbu
## 1   BecvChor 2.57620 31.96916 0.37453 26.90136 7.10911 16.25647 14.89330
## 2   BecvOsek 2.43686 28.12037 0.27790 23.32888 5.80443 14.14831 11.06343
## 3   BecvTrou 2.10235 21.30691 0.31422 36.49575 6.14622 11.21903 30.23011
## 4   BelaBosk 1.53297 37.12914 0.48181 38.49802 9.36742 25.32146 10.11182
## 5   BilyPoto 1.79499 27.21495 0.37100 23.26918 8.62000 17.88813  5.95577
## 6   BlatTova 2.90209 16.04854 0.30351  8.64187 5.51661  7.98030  4.71039
##   Epiritral Hyporitral Ntaxon Nceled  Bindex EPTTax  PosAbu  Spasaci
## 1   7.68869   22.52242     18     40 0.64865     23 0.33939 28.29352
## 2   6.96659   19.74992     12     25 0.37838     16 0.00000 20.84029
## 3   5.92213   18.63735      9     25 0.16216     12 1.47017 20.88415
## 4  20.37168   18.17237     16     40 0.72000     30 8.07504 28.62651
## 5  12.08359   19.09771     21     36 0.54167     21 0.54567 22.02324
## 6   5.23533   14.60371     15     27 0.09091      5 0.00000 16.78539
##         MDS_1      MDS_2
## 1 -0.69831500 -0.7040517
## 2 -0.61323430 -0.3796887
## 3 -0.08536265  1.6051440
## 4  0.55758620 -0.4424091
## 5  0.40888510 -0.2172988
## 6 -0.76699300  0.5244099
head(chemData)
##   Kod_Canoco Tepl_max X.O2 BSK5 Kond N.NH4 N.NO3 Pcelk
## 1   BecvChor     20.9  107  1.3 33.7  0.09  1.55 0.077
## 2   BecvOsek     21.8  105  1.2 37.0  0.06  2.32 0.063
## 3   BecvTrou     22.1  106  1.4 43.9  0.07  2.49 0.080
## 4   BelaBosk     16.4  104  1.1 22.1  0.02  2.05 0.029
## 5   BilyPoto     18.9  104  1.9 30.4  0.10  4.48 0.131
## 6   BrezJaro     18.4   84  2.5 83.8  0.30  3.63 0.237



Again, there is one locality which is missing the dataset of chemical measurements. We identify this locality and we do not consider it for further analysis. Both datasets are alligned with respect to he locality names - the first column in each dataset.

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(bioData[ind, ], chemData[, 2:8])
X <- data[,2:9]
Y <- data[,19:25]

A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (install.packages("CCA")) and later one can use functions matcor() and img.matcor() to graphically display the corration matrix

\[ cor(\mathcal{X}, \mathcal{Y}) = \left(\begin{align}\Sigma_{XX} & \Sigma_{XY}\\\Sigma_{YX} & \Sigma_{YY}\end{align}\right)\].

library("CCA")
correl <- matcor(X, Y )
img.matcor(correl, type = 2)

Next we can directly use cancor() function to obtain canonical correlations for the datasets \(\mathcal{X}\) and \(\mathcal{Y}\). Another option is to use function cc() (from the R package ‘CCA’). Note, the the results are equivalent although not identical.

cc1 <- cancor(X, Y)  ### function from standard R instalation
cc2 <- cc(X, Y)      ### function for the R package 'CCA'

The canonical correlations are the same for both approaches (R commands used)

cc1$cor  ### function from standard R instalation
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277
cc2$cor  ### function for the R package 'CCA'
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277

which can be also visualized by an appropriate picture (e.g.):

par(mfrow = c(1,2))
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")

However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:

cc1$xcoef  ### function from standard R instalation
##                   [,1]         [,2]         [,3]         [,4]         [,5]
## SaprInd    0.162796345  0.089714234 -0.161737589 -0.491927540 -0.186875530
## Lital     -0.003800701  0.001797172 -0.030965801 -0.007214761  0.017517637
## RETI       0.324209881 -0.055178843  0.693084411 -1.283570658 -2.507634813
## EPTAbu    -0.002149837 -0.005077822 -0.009027554  0.008345945  0.005867073
## Marg      -0.011912887 -0.037901448 -0.009666364 -0.065884327 -0.025639954
## Metaritr  -0.012221662  0.001551030  0.043003790  0.012178243 -0.003013795
## JepAbu     0.005168862  0.006308244  0.004048909 -0.009547646 -0.019794148
## Epiritral  0.010432924  0.023391927 -0.013976612 -0.021306313  0.018722542
##                   [,6]        [,7]         [,8]
## SaprInd   -0.098864028  0.06897063 -0.355850803
## Lital      0.004465308 -0.01327305  0.008943729
## RETI       3.047758363  1.14371548  0.285767761
## EPTAbu    -0.005401486  0.01277878 -0.026098676
## Marg      -0.029969261  0.01102501 -0.002459819
## Metaritr  -0.027620185 -0.04963561 -0.056954743
## JepAbu    -0.005346595 -0.00253416  0.030139906
## Epiritral -0.022760565  0.04740075  0.064087475
cc2$xcoef  ### function for the R package 'CCA'
##                  [,1]        [,2]        [,3]         [,4]         [,5]
## SaprInd   -1.29215593 -0.71208465 -1.28375231  -3.90455380  -1.48327854
## Lital      0.03016713 -0.01426461 -0.24578343  -0.05726539   0.13904193
## RETI      -2.57333615  0.43796849  5.50118696 -10.18802625 -19.90373428
## EPTAbu     0.01706380  0.04030396 -0.07165399   0.06624389   0.04656845
## Marg       0.09455561  0.30083342 -0.07672439  -0.52294063  -0.20351082
## Metaritr   0.09700644 -0.01231092  0.34133200   0.09666181  -0.02392126
## JepAbu    -0.04102657 -0.05007014  0.03213722  -0.07578209  -0.15711118
## Epiritral -0.08280877 -0.18566767 -0.11093592  -0.16911362   0.14860557
##                  [,6]        [,7]
## SaprInd   -0.78470889  0.54743742
## Lital      0.03544229 -0.10535158
## RETI      24.19083206  9.07796019
## EPTAbu    -0.04287297  0.10142843
## Marg      -0.23787363  0.08750829
## Metaritr  -0.21922842 -0.39397046
## JepAbu    -0.04243728 -0.02011427
## Epiritral -0.18065638  0.37623175
cc1$ycoef
##                  [,1]         [,2]         [,3]         [,4]         [,5]
## Tepl_max  0.002728344 -0.031587883 -0.022171030  0.015570969 -0.026678176
## X.O2      0.001145406 -0.004145546  0.005890339 -0.012452124 -0.009542703
## BSK5      0.015042715  0.054146231 -0.027996459  0.124516411 -0.026809781
## Kond      0.002545019  0.001425917 -0.004281998  0.001343352  0.001807199
## N.NH4     0.307216652  0.432469799  1.131659010  0.087843642 -0.818023849
## N.NO3    -0.010468525 -0.030934549  0.053453456  0.031622860  0.022842863
## Pcelk     0.412696634 -0.979367658  0.196048060 -3.071986907  0.807813337
##                  [,6]          [,7]
## Tepl_max  0.024294508  0.0072354037
## X.O2     -0.021080517 -0.0001162876
## BSK5     -0.122653042 -0.1252271666
## Kond     -0.003536074  0.0038599055
## N.NH4     0.123847014  0.5657683896
## N.NO3    -0.011571285  0.0045994691
## Pcelk     1.447735216 -1.1452167044
cc2$ycoef
##                  [,1]        [,2]        [,3]         [,4]        [,5]
## Tepl_max -0.021655562  0.25072104 -0.17597710   0.12359074 -0.21175146
## X.O2     -0.009091381  0.03290426  0.04675311  -0.09883567 -0.07574285
## BSK5     -0.119397849 -0.42977239 -0.22221500   0.98831837 -0.21279604
## Kond     -0.020200461 -0.01131786 -0.03398730   0.01066252  0.01434420
## N.NH4    -2.438456583 -3.43262261  8.98226493   0.69723729 -6.49286301
## N.NO3     0.083091342  0.24553537  0.42427365   0.25099867  0.18130960
## Pcelk    -3.275677981  7.77348979  1.55608323 -24.38314016  6.41181959
##                 [,6]          [,7]
## Tepl_max  0.19283168  0.0574292361
## X.O2     -0.16732142 -0.0009230045
## BSK5     -0.97352834 -0.9939598209
## Kond     -0.02806672  0.0306370503
## N.NH4     0.98300520  4.4906473759
## N.NO3    -0.09184423  0.0365071542
## Pcelk    11.49104203 -9.0898757913

Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).

plt.cc(cc2, var.label = TRUE, ind.names = data[,1])

Questions


  • How would you interpret the left hand side of the figure?
  • What is the relationship between the subfigure on the left hand side and the subfigure on the right hand side?
  • What would be your conclussion about the two available dataset and their mutual relationship?



A usefull extension to the classical canonical correlation approach is the regularized version - so called regularized canonical correlation analysis. It is mean for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{n}\) (usually \(p \gg n\)). There is R function rcc() available in the R package ‘CCA’.

Another option how to perform canonical correlation in R is to install an additional R package - package ‘vegan’ (use install.packages("vegan") for instalation) and to apply function cca(). Use the help session in R for more details about the function itself or the corresponding package respectively.

library(vegan)
cc3 <- cca(X, Y)

And the corresponding graphical summary given by plot() function.

plot(cc3, scaling = 1)

To Do


  • Use this markdown and consider (maybe) the whole dataset on biologial measurements. Compare how much the results changes.
  • Consider a lower dimensional profile and try to refit the whole canonical correlation analysis on a lower dimensional datasets (e.g. two covariates for bio data and two covariates for chemical data).
  • Try to manually reconstruct the results from the above by using the correponding canonical correlation theory form the lecture.