NMST539 | Lab Session 10

Canonical Corelation Analysis

(application in R)

LS 2017 | Tuesday 16/05/2017

Rmd file (UTF8 encoding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
Komárek, A.: Základy práce s R. (PDF súbor)
Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

Canonical Correlation Analysis in R

The canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which (roughly) operates on the same principle as the principal component analysis (PCA). However, unlike the PCA where one tries to explore the common variability within the given data set, the canonical correlation is used to explore the mutual variability (correlations) between two different sets of quantitative variables observed on the same experimental units. Thus, the PCA method deals with one dataset only and it tries to to reduce the overall dimensionality of the dataset using some linear combination of the initial variables with the maximum variace possible while the CCA approach searhes for some linear combinations of the initial variables in order to achieve maximum correlations instead.

For instance, for two multivariate random vectors \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) we assume that the common variance-covariance matrix of \((\boldsymbol{X}, \boldsymbol{Y})^\top\) is of the form \[ Var \left(\begin{array}{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right) = \left(\begin{array}{cc} \Sigma_{XX} & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_{YY} \end{array}\right). \]

If \(\boldsymbol{X}\) is a \(p\) dimensional random vector and \(\boldsymbol{Y}\) is a \(q\) dimensional random vecton, then the corresponding correlation of any linear combination of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) in terms of \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\), for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) can be expressed as \[ Cor(\boldsymbol{a}^\top \boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) = \boldsymbol{a}^\top \Sigma_{XY} \boldsymbol{b}. \]

Therefore, the canonical correlations analysis searches for such vectors \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\), that the previous correlation is as large as possible. However, the canonical correlation analysis does not search for just one pair of the canonical variables \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\) but instead, it defines all pairs (up to \(min(p, q)\)), such that all canonical covariates within each dataset are uncorrelated among each other.

In the R statistical software there is function cancor() available under the standard instalation. Some additional packages can be downloaded and installed as well (mainly the R packages ‘CCA’ and ‘vegan’).

In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements in some specific (64-65) river localities in the Czech republic. The first dataset contains measurements of biologial metrics for each locality (17 different metrics and taxons) and the secon dataset contains measurements on chemical concentrations and values at the same localities (7 covariates recored).

The idea is to somehow relate both dataset in order to explain what bio metrics can correlate which which chemical concentrations. One way of answering this is to apply the canonical correlation approach explained below.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

head(bioData)

##   Kod_Canoco SaprInd    Lital    RETI   EPTAbu    Marg Metaritr   JepAbu
## 1   BecvChor 2.57620 31.96916 0.37453 26.90136 7.10911 16.25647 14.89330
## 2   BecvOsek 2.43686 28.12037 0.27790 23.32888 5.80443 14.14831 11.06343
## 3   BecvTrou 2.10235 21.30691 0.31422 36.49575 6.14622 11.21903 30.23011
## 4   BelaBosk 1.53297 37.12914 0.48181 38.49802 9.36742 25.32146 10.11182
## 5   BilyPoto 1.79499 27.21495 0.37100 23.26918 8.62000 17.88813  5.95577
## 6   BlatTova 2.90209 16.04854 0.30351  8.64187 5.51661  7.98030  4.71039
##   Epiritral Hyporitral Ntaxon Nceled  Bindex EPTTax  PosAbu  Spasaci
## 1   7.68869   22.52242     18     40 0.64865     23 0.33939 28.29352
## 2   6.96659   19.74992     12     25 0.37838     16 0.00000 20.84029
## 3   5.92213   18.63735      9     25 0.16216     12 1.47017 20.88415
## 4  20.37168   18.17237     16     40 0.72000     30 8.07504 28.62651
## 5  12.08359   19.09771     21     36 0.54167     21 0.54567 22.02324
## 6   5.23533   14.60371     15     27 0.09091      5 0.00000 16.78539
##         MDS_1      MDS_2
## 1 -0.69831500 -0.7040517
## 2 -0.61323430 -0.3796887
## 3 -0.08536265  1.6051440
## 4  0.55758620 -0.4424091
## 5  0.40888510 -0.2172988
## 6 -0.76699300  0.5244099

head(chemData)

##   Kod_Canoco Tepl_max X.O2 BSK5 Kond N.NH4 N.NO3 Pcelk
## 1   BecvChor     20.9  107  1.3 33.7  0.09  1.55 0.077
## 2   BecvOsek     21.8  105  1.2 37.0  0.06  2.32 0.063
## 3   BecvTrou     22.1  106  1.4 43.9  0.07  2.49 0.080
## 4   BelaBosk     16.4  104  1.1 22.1  0.02  2.05 0.029
## 5   BilyPoto     18.9  104  1.9 30.4  0.10  4.48 0.131
## 6   BrezJaro     18.4   84  2.5 83.8  0.30  3.63 0.237

Again, there is one locality which is missing the dataset of chemical measurements. We identify this locality and we do not consider it for further analysis. Both datasets are alligned with respect to he locality names - the first column in each dataset.

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(bioData[ind, ], chemData[, 2:8])

X <- data[,2:9]
Y <- data[,19:25]

A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (install.packages("CCA")) and later one can use functions matcor() and img.matcor() to graphically display the corration matrix

\[ cor(\mathcal{X}, \mathcal{Y}) = \left(\begin{align}\Sigma_{XX} & \Sigma_{XY}\\\Sigma_{YX} & \Sigma_{YY}\end{align}\right)\].

library("CCA")
correl <- matcor(X, Y )
img.matcor(correl, type = 2)

Next we can directly use cancor() function to obtain canonical correlations for the datasets \(\mathcal{X}\) and \(\mathcal{Y}\). Another option is to use function cc() (from the R package ‘CCA’). Note, the the results are equivalent although not identical.

cc1 <- cancor(X, Y)  ### function from standard R instalation
cc2 <- cc(X, Y)      ### function for the R package 'CCA'

The canonical correlations are the same for both approaches (R commands used)

cc1$cor  ### function from standard R instalation

## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277

cc2$cor  ### function for the R package 'CCA'

## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277

The canonical covariates can be easily reconstructred as follows:

CanX <- as.matrix(X) %*% cc1$xcoef
CanY <- as.matrix(Y) %*% cc1$ycoef

cor(CanX, CanY)

##               [,1]          [,2]          [,3]          [,4]          [,5]
## [1,]  8.749611e-01 -5.641494e-17 -2.837332e-16  2.876720e-16 -2.755136e-16
## [2,]  1.969775e-18  7.049659e-01  6.045359e-17  3.672712e-16  2.833397e-16
## [3,]  2.055352e-15 -6.475529e-16  6.457969e-01 -8.881386e-16 -4.425705e-17
## [4,]  5.598600e-16 -4.649601e-16 -7.193021e-16  4.875075e-01 -1.535789e-16
## [5,] -6.529418e-16  3.505207e-16  1.774706e-16 -7.011448e-18  2.420530e-01
## [6,]  1.389060e-16 -3.532212e-17 -3.754766e-16 -4.455948e-16  5.118790e-16
## [7,] -8.693158e-16  4.868860e-16 -2.229543e-16  2.685094e-19  3.028397e-17
## [8,] -1.861723e-16  3.088850e-16 -3.129291e-16 -4.233271e-16 -4.250917e-16
##               [,6]          [,7]
## [1,]  2.598782e-17  8.001412e-17
## [2,] -4.062772e-17 -1.190124e-16
## [3,]  5.531278e-16  1.034287e-16
## [4,] -2.964996e-16 -6.147935e-17
## [5,] -2.659521e-16  1.510269e-16
## [6,]  1.156383e-01  8.425013e-18
## [7,] -4.480597e-16  4.481277e-02
## [8,]  6.107525e-16  2.362837e-16

Which corresponds with the output provided by the function cancor() and cc()above (see also the picture below for a visual comparison).

par(mfrow = c(1,2))
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")
barplot(cc2$cor, main = "Canonical correlations for 'cc()'", col = "gray")

However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:

cc1$xcoef  ### function from standard R instalation

##                   [,1]         [,2]         [,3]         [,4]         [,5]
## SaprInd    0.162796345  0.089714234 -0.161737589 -0.491927540 -0.186875530
## Lital     -0.003800701  0.001797172 -0.030965801 -0.007214761  0.017517637
## RETI       0.324209881 -0.055178843  0.693084411 -1.283570658 -2.507634813
## EPTAbu    -0.002149837 -0.005077822 -0.009027554  0.008345945  0.005867073
## Marg      -0.011912887 -0.037901448 -0.009666364 -0.065884327 -0.025639954
## Metaritr  -0.012221662  0.001551030  0.043003790  0.012178243 -0.003013795
## JepAbu     0.005168862  0.006308244  0.004048909 -0.009547646 -0.019794148
## Epiritral  0.010432924  0.023391927 -0.013976612 -0.021306313  0.018722542
##                   [,6]        [,7]         [,8]
## SaprInd   -0.098864028  0.06897063 -0.355850803
## Lital      0.004465308 -0.01327305  0.008943729
## RETI       3.047758363  1.14371548  0.285767761
## EPTAbu    -0.005401486  0.01277878 -0.026098676
## Marg      -0.029969261  0.01102501 -0.002459819
## Metaritr  -0.027620185 -0.04963561 -0.056954743
## JepAbu    -0.005346595 -0.00253416  0.030139906
## Epiritral -0.022760565  0.04740075  0.064087475

cc2$xcoef  ### function for the R package 'CCA'

##                  [,1]        [,2]        [,3]         [,4]         [,5]
## SaprInd   -1.29215593 -0.71208465 -1.28375231  -3.90455380  -1.48327854
## Lital      0.03016713 -0.01426461 -0.24578343  -0.05726539   0.13904193
## RETI      -2.57333615  0.43796849  5.50118696 -10.18802625 -19.90373428
## EPTAbu     0.01706380  0.04030396 -0.07165399   0.06624389   0.04656845
## Marg       0.09455561  0.30083342 -0.07672439  -0.52294063  -0.20351082
## Metaritr   0.09700644 -0.01231092  0.34133200   0.09666181  -0.02392126
## JepAbu    -0.04102657 -0.05007014  0.03213722  -0.07578209  -0.15711118
## Epiritral -0.08280877 -0.18566767 -0.11093592  -0.16911362   0.14860557
##                  [,6]        [,7]
## SaprInd   -0.78470889  0.54743742
## Lital      0.03544229 -0.10535158
## RETI      24.19083206  9.07796019
## EPTAbu    -0.04287297  0.10142843
## Marg      -0.23787363  0.08750829
## Metaritr  -0.21922842 -0.39397046
## JepAbu    -0.04243728 -0.02011427
## Epiritral -0.18065638  0.37623175

cc1$ycoef

##                  [,1]         [,2]         [,3]         [,4]         [,5]
## Tepl_max  0.002728344 -0.031587883 -0.022171030  0.015570969 -0.026678176
## X.O2      0.001145406 -0.004145546  0.005890339 -0.012452124 -0.009542703
## BSK5      0.015042715  0.054146231 -0.027996459  0.124516411 -0.026809781
## Kond      0.002545019  0.001425917 -0.004281998  0.001343352  0.001807199
## N.NH4     0.307216652  0.432469799  1.131659010  0.087843642 -0.818023849
## N.NO3    -0.010468525 -0.030934549  0.053453456  0.031622860  0.022842863
## Pcelk     0.412696634 -0.979367658  0.196048060 -3.071986907  0.807813337
##                  [,6]          [,7]
## Tepl_max  0.024294508  0.0072354037
## X.O2     -0.021080517 -0.0001162876
## BSK5     -0.122653042 -0.1252271666
## Kond     -0.003536074  0.0038599055
## N.NH4     0.123847014  0.5657683896
## N.NO3    -0.011571285  0.0045994691
## Pcelk     1.447735216 -1.1452167044

cc2$ycoef

##                  [,1]        [,2]        [,3]         [,4]        [,5]
## Tepl_max -0.021655562  0.25072104 -0.17597710   0.12359074 -0.21175146
## X.O2     -0.009091381  0.03290426  0.04675311  -0.09883567 -0.07574285
## BSK5     -0.119397849 -0.42977239 -0.22221500   0.98831837 -0.21279604
## Kond     -0.020200461 -0.01131786 -0.03398730   0.01066252  0.01434420
## N.NH4    -2.438456583 -3.43262261  8.98226493   0.69723729 -6.49286301
## N.NO3     0.083091342  0.24553537  0.42427365   0.25099867  0.18130960
## Pcelk    -3.275677981  7.77348979  1.55608323 -24.38314016  6.41181959
##                 [,6]          [,7]
## Tepl_max  0.19283168  0.0574292361
## X.O2     -0.16732142 -0.0009230045
## BSK5     -0.97352834 -0.9939598209
## Kond     -0.02806672  0.0306370503
## N.NH4     0.98300520  4.4906473759
## N.NO3    -0.09184423  0.0365071542
## Pcelk    11.49104203 -9.0898757913

Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).

plt.cc(cc2, var.label = TRUE, ind.names = data[,1])

Questions

How would you interpret the left hand side of the figure?
What is the relationship between the subfigure on the left hand side and the subfigure on the right hand side?
What would be your conclussion about the two available dataset and their mutual relationship?

A usefull extension to the classical canonical correlation approach is the regularized version - so called regularized canonical correlation analysis. It is mean for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{n}\) (usually \(p \gg n\)). There is R function rcc() available in the R package ‘CCA’.

Another option how to perform canonical correlation in R is to install an additional R package - package ‘vegan’ (use install.packages("vegan") for instalation) and to apply function cca(). Use the help session in R for more details about the function itself or the corresponding package respectively.

library(vegan)
cc3 <- cca(X, Y)

And the corresponding graphical summary given by plot() function.

plot(cc3, scaling = 1)

To Do

Use this markdown and consider (maybe) the whole dataset on biologial measurements. Compare how much the results changes.
Consider a lower dimensional profile and try to refit the whole canonical correlation analysis on a lower dimensional datasets (e.g. two covariates for bio data and two covariates for chemical data).
Try to manually reconstruct the results from the above by using the correponding canonical correlation theory form the lecture.

Homework Assignment

(Deadline: last Tuesday / 23.05.2007)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Choose one dataset and use some judgement to split the data into two disjoint parts. Alternatively, you could also consider two different datasets however, they should be recorded on the same units. Apply the canonical correlation analysi.
Finilize your report and make it ready for the final submission. The submission deadline is Monday, 22/05/2017, 23:59

. The report should be submitted by emial to (at least) one of the following addresses:

maciak AT karlin.mff.cuni.cz
maciak AT ualberta.ca