NMST539 | Lab Session 11

Discriminant Analysis & Canonical Corelation Analysis

LS 2017/2018 | Monday 14/05/18

Rmd file (UTF8 coding)

The R-software is available for download from the official website: https://www.r-project.org

A user-friendly interface (one of many): RStudio

Some useful manuals and tutorials for R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
  • Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
  • Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
  • Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

1. Discriminant analysis in R

Discriminant analysis is a (not purely a multivariate) statistical method which was designed for allocating various objects into similar (homogenous) groups. The idea is to use some available information (given covariates) in order to create some classification rule which can be used later to assign some new object into one of the existing groups. From this point of view it can be considered as a classification tool. To remind the difference between clustering and clussification, let us recall the following:

  • Classification works with data for which the groups are known and we try to learn what differentiates them in order to be able to assign future observations into the given groups.
  • Clustering has no prior information on what are the underlying groups – the groups are unknown and we try to identify the groups by considering mutual similarities/differencies in the given observations.

There are many different approaches how to define some reasonable descrimination rule. Such rules can be based, for instance, on the prior probabilities (Bayes rule) and the likelihood approach. For a normal model and homoscedastic error the discrimination rule simplifies to a linear form (linear discriminant analysis) while for the heteroscedastic errors the discrimination rule has a quadratic form instead. We mostly focuss on a linear discriminant analysis (LDA) which is a statistical method meant to find a linear combination of features (usually a set of continuous random variables) that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier. This method is implemented in the R software implemented in the function lda() which is available in the packages ‘MASS’. An straightforward extention is a quadratic discrimination rule wich is performed with the R command ‘qda()’ (also in package ‘MASS’).

Linear discriminant analysis is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable instead. On the other hand, the logistic regression approach is more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.

If the normality assumptions is too much restrictive, we can use Fisher’s approach which finds the linear discrimination in a way that between class and within class variability are optimized. This method does not even require homoscedastic errors. Suppose that two groups of observations have their means denoted as \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu_2}\) and the corresponding variance-covariances matrices are \(\Sigma_1\) and \(\Sigma_2\). The Fisher’s linear separation measure between for these two groups (the corresponding distributions respectively) is defined as a ratio of the variance between the groups to the variance within the groupds, which is \[ S = \frac{\sigma_{between}^2}{\sigma_{within}^2} = \frac{(\boldsymbol{w} \boldsymbol{\mu}_2 - \boldsymbol{w} \boldsymbol{\mu}_1)^2}{\boldsymbol{w}^\top (\Sigma_{1} + \Sigma_2) \boldsymbol{w} }. \]

This measure is, in some sense, a measure of the signal-to-noise ratio for the class labeling and it can be shown that the maximum separation occurs for \[ \boldsymbol{w} = (\Sigma_1 + \Sigma_2)^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1). \]



In the following, let us consider the mtcars data set. We use the subset of the continuous covariates and we search for a linear classification rule to classify the given cars by the number of cylinders (thus, we have three groups – cars with 4 cylinders, 6 cylinders, and 8 cylinders).

library(MASS)
lda1 <- lda(cyl ~ mpg + hp + qsec + drat + disp, data = mtcars)

pred <- predict(lda1)
ldahist(data = pred$x[,1], g=mtcars$cyl)

From the theoretical point of view the linear discriminant analysis is a linear partition of the \(n\) dimensional sample space. In the example above we have five dimensional sample space \(R^5\) with the random vector \[ \boldsymbol{X} = (X_1, \dots, X_5)^\top = (mpg, hp, qsec, drat, disp)^\top. \]

The linear discriminant analysis provides a specific partition, which can be visualized using the R function partimat() from the library klaR.

library("klaR")
partimat(as.factor(cyl) ~ mpg + hp + qsec + drat + disp, data = mtcars, method = "lda")

which can be further compared with another plot where we only distinguish for different groups

pairs(mtcars[c("mpg", "hp", "qsec", "drat", "disp")], pch = 22, bg = c("red", "yellow", "green")[unclass(as.factor(mtcars$cyl))])




To Do


  • For a better graphical insight into the LDA results we can consider a simplification where we only use one continuous covariate as a predictor.

    r lda2 <- lda(cyl ~ mpg, data = mtcars) lda2cv <- lda(cyl ~ mpg, data = mtcars, CV = T) The linear discriminant rule is given by the scaling factor. The corresponding transformations can be obtained as follows:

    lda_scores <- as.numeric(lda2$scaling) * mtcars$mpg
    par(mfrow = c(1,2))
    plot(lda_scores ~ mtcars$mpg, pch = 21, bg = mtcars$cyl, ylab ="LDA Scores", xlab = "Miles per Gallon")
    
    lda2cv <- cbind(lda2cv$posterior, mtcars$mpg, mtcars$cyl)
    lda2cv <- lda2cv[order(mtcars$mpg), ]
    
    plot(lda2cv[,1] ~ lda2cv[,4], pch = 21, bg = lda2cv[,5], ylab = "Posterior Probability", xlab = "Miles Per Gallon")
    lines(lda2cv[,1] ~ lda2cv[,4], col = "blue")
    points(lda2cv[,2] ~ lda2cv[,4], pch = 22, bg = lda2cv[,5])
    lines(lda2cv[,2] ~ lda2cv[,4], col = "violet")
    points(lda2cv[,3] ~ lda2cv[,4], pch = 23, bg = lda2cv[,5])
    lines(lda2cv[,3] ~ lda2cv[,4], col = "gray")
  • Consider an analogous situation for more than just one continuous predictor. Use some graphical tools to visualize the results.
  • Use some additional datapoints (feel free to make up some) and try to use the discrimination rules to assign the new points to the existing groups.
  • Consider the heteroscedastic version of the model and try to apply the quadratic discriminant function (command ‘qda()’ in package ‘MASS’).

  • Consider, for instance, an alternative classification perfomed within the framework of a logistic model.



2. Correspondance analysis in R

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.

3. Canonical Correlation Analysis in R

The canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which (roughly) operates on the same principle as the principal component analysis (PCA). However, unlike the PCA where one tries to explore the common variability within the given data set, the canonical correlation is used to explore the mutual variability (correlations) between two different sets of quantitative variables observed on the same experimental units. Thus, the PCA method deals with one dataset only and it tries to to reduce the overall dimensionality of the dataset using some linear combination of the initial variables with the maximum variace possible while the CCA approach searhes for some linear combinations of the initial variables in order to achieve maximum correlations instead.

For instance, for two multivariate random vectors \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) we assume that the common variance-covariance matrix of \((\boldsymbol{X}, \boldsymbol{Y})^\top\) is of the form \[ Var \left(\begin{array}{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right) = \left(\begin{array}{cc} \Sigma_{XX} & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_{YY} \end{array}\right). \]

If \(\boldsymbol{X}\) is a \(p\) dimensional random vector and \(\boldsymbol{Y}\) is a \(q\) dimensional random vecton, then the corresponding correlation of any linear combination of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) in terms of \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\), for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) can be expressed as \[ Cor(\boldsymbol{a}^\top \boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) = \boldsymbol{a}^\top \Sigma_{XY} \boldsymbol{b}. \]

Therefore, the canonical correlations analysis searches for such vectors \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\), that the previous correlation is as large as possible. However, the canonical correlation analysis does not search for just one pair of the canonical variables \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\) but instead, it defines all pairs (up to \(min(p, q)\)), such that all canonical covariates within each dataset are uncorrelated among each other.

In the R statistical software there is function cancor() available under the standard instalation. Some additional packages can be downloaded and installed as well (mainly the R packages ‘CCA’ and ‘vegan’).

In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements in some specific (64-65) river localities in the Czech republic. The first dataset contains measurements of biologial metrics for each locality (17 different metrics and taxons) and the secon dataset contains measurements on chemical concentrations and values at the same localities (7 covariates recored).

The idea is to somehow relate both dataset in order to explain what bio metrics can correlate which which chemical concentrations. One way of answering this is to apply the canonical correlation approach explained below.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

head(bioData)
##   Kod_Canoco SaprInd    Lital    RETI   EPTAbu    Marg Metaritr   JepAbu
## 1   BecvChor 2.57620 31.96916 0.37453 26.90136 7.10911 16.25647 14.89330
## 2   BecvOsek 2.43686 28.12037 0.27790 23.32888 5.80443 14.14831 11.06343
## 3   BecvTrou 2.10235 21.30691 0.31422 36.49575 6.14622 11.21903 30.23011
## 4   BelaBosk 1.53297 37.12914 0.48181 38.49802 9.36742 25.32146 10.11182
## 5   BilyPoto 1.79499 27.21495 0.37100 23.26918 8.62000 17.88813  5.95577
## 6   BlatTova 2.90209 16.04854 0.30351  8.64187 5.51661  7.98030  4.71039
##   Epiritral Hyporitral Ntaxon Nceled  Bindex EPTTax  PosAbu  Spasaci
## 1   7.68869   22.52242     18     40 0.64865     23 0.33939 28.29352
## 2   6.96659   19.74992     12     25 0.37838     16 0.00000 20.84029
## 3   5.92213   18.63735      9     25 0.16216     12 1.47017 20.88415
## 4  20.37168   18.17237     16     40 0.72000     30 8.07504 28.62651
## 5  12.08359   19.09771     21     36 0.54167     21 0.54567 22.02324
## 6   5.23533   14.60371     15     27 0.09091      5 0.00000 16.78539
##         MDS_1      MDS_2
## 1 -0.69831500 -0.7040517
## 2 -0.61323430 -0.3796887
## 3 -0.08536265  1.6051440
## 4  0.55758620 -0.4424091
## 5  0.40888510 -0.2172988
## 6 -0.76699300  0.5244099
head(chemData)
##   Kod_Canoco Tepl_max X.O2 BSK5 Kond N.NH4 N.NO3 Pcelk
## 1   BecvChor     20.9  107  1.3 33.7  0.09  1.55 0.077
## 2   BecvOsek     21.8  105  1.2 37.0  0.06  2.32 0.063
## 3   BecvTrou     22.1  106  1.4 43.9  0.07  2.49 0.080
## 4   BelaBosk     16.4  104  1.1 22.1  0.02  2.05 0.029
## 5   BilyPoto     18.9  104  1.9 30.4  0.10  4.48 0.131
## 6   BrezJaro     18.4   84  2.5 83.8  0.30  3.63 0.237



Again, there is one locality which is missing the dataset of chemical measurements. We identify this locality and we do not consider it for further analysis. Both datasets are alligned with respect to he locality names - the first column in each dataset.

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(bioData[ind, ], chemData[, 2:8])
X <- data[,2:9]
Y <- data[,19:25]

A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (install.packages("CCA")) and later one can use functions matcor() and img.matcor() to graphically display the corration matrix

\[ cor(\mathcal{X}, \mathcal{Y}) = \left(\begin{align}\Sigma_{XX} & \Sigma_{XY}\\\Sigma_{YX} & \Sigma_{YY}\end{align}\right)\].

library("CCA")
correl <- matcor(X, Y )
img.matcor(correl, type = 2)

Next we can directly use cancor() function to obtain canonical correlations for the datasets \(\mathcal{X}\) and \(\mathcal{Y}\). Another option is to use function cc() (from the R package ‘CCA’). Note, the the results are equivalent although not identical.

cc1 <- cancor(X, Y)  ### function from standard R instalation
cc2 <- cc(X, Y)      ### function for the R package 'CCA'

The canonical correlations are the same for both approaches (R commands used)

cc1$cor  ### function from standard R instalation
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277
cc2$cor  ### function for the R package 'CCA'
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277

The canonical covariates can be easily reconstructred as follows:

CanX <- as.matrix(X) %*% cc1$xcoef
CanY <- as.matrix(Y) %*% cc1$ycoef

cor(CanX, CanY)
##               [,1]          [,2]          [,3]          [,4]          [,5]
## [1,]  8.749611e-01 -5.641494e-17 -2.837332e-16  2.876720e-16 -2.755136e-16
## [2,]  1.969775e-18  7.049659e-01  6.045359e-17  3.672712e-16  2.833397e-16
## [3,]  2.055352e-15 -6.475529e-16  6.457969e-01 -8.881386e-16 -4.425705e-17
## [4,]  5.598600e-16 -4.649601e-16 -7.193021e-16  4.875075e-01 -1.535789e-16
## [5,] -6.529418e-16  3.505207e-16  1.774706e-16 -7.011448e-18  2.420530e-01
## [6,]  1.389060e-16 -3.532212e-17 -3.754766e-16 -4.455948e-16  5.118790e-16
## [7,] -8.693158e-16  4.868860e-16 -2.229543e-16  2.685094e-19  3.028397e-17
## [8,] -1.861723e-16  3.088850e-16 -3.129291e-16 -4.233271e-16 -4.250917e-16
##               [,6]          [,7]
## [1,]  2.598782e-17  8.001412e-17
## [2,] -4.062772e-17 -1.190124e-16
## [3,]  5.531278e-16  1.034287e-16
## [4,] -2.964996e-16 -6.147935e-17
## [5,] -2.659521e-16  1.510269e-16
## [6,]  1.156383e-01  8.425013e-18
## [7,] -4.480597e-16  4.481277e-02
## [8,]  6.107525e-16  2.362837e-16

Which corresponds with the output provided by the function cancor() and cc()above (see also the picture below for a visual comparison).

par(mfrow = c(1,2))
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")
barplot(cc2$cor, main = "Canonical correlations for 'cc()'", col = "gray")

However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:

cc1$xcoef  ### function from standard R instalation
##                   [,1]         [,2]         [,3]         [,4]         [,5]
## SaprInd    0.162796345  0.089714234 -0.161737589 -0.491927540 -0.186875530
## Lital     -0.003800701  0.001797172 -0.030965801 -0.007214761  0.017517637
## RETI       0.324209881 -0.055178843  0.693084411 -1.283570658 -2.507634813
## EPTAbu    -0.002149837 -0.005077822 -0.009027554  0.008345945  0.005867073
## Marg      -0.011912887 -0.037901448 -0.009666364 -0.065884327 -0.025639954
## Metaritr  -0.012221662  0.001551030  0.043003790  0.012178243 -0.003013795
## JepAbu     0.005168862  0.006308244  0.004048909 -0.009547646 -0.019794148
## Epiritral  0.010432924  0.023391927 -0.013976612 -0.021306313  0.018722542
##                   [,6]        [,7]         [,8]
## SaprInd   -0.098864028  0.06897063 -0.355850803
## Lital      0.004465308 -0.01327305  0.008943729
## RETI       3.047758363  1.14371548  0.285767761
## EPTAbu    -0.005401486  0.01277878 -0.026098676
## Marg      -0.029969261  0.01102501 -0.002459819
## Metaritr  -0.027620185 -0.04963561 -0.056954743
## JepAbu    -0.005346595 -0.00253416  0.030139906
## Epiritral -0.022760565  0.04740075  0.064087475
cc2$xcoef  ### function for the R package 'CCA'
##                  [,1]        [,2]        [,3]         [,4]         [,5]
## SaprInd   -1.29215593 -0.71208465 -1.28375231  -3.90455380  -1.48327854
## Lital      0.03016713 -0.01426461 -0.24578343  -0.05726539   0.13904193
## RETI      -2.57333615  0.43796849  5.50118696 -10.18802625 -19.90373428
## EPTAbu     0.01706380  0.04030396 -0.07165399   0.06624389   0.04656845
## Marg       0.09455561  0.30083342 -0.07672439  -0.52294063  -0.20351082
## Metaritr   0.09700644 -0.01231092  0.34133200   0.09666181  -0.02392126
## JepAbu    -0.04102657 -0.05007014  0.03213722  -0.07578209  -0.15711118
## Epiritral -0.08280877 -0.18566767 -0.11093592  -0.16911362   0.14860557
##                  [,6]        [,7]
## SaprInd   -0.78470889  0.54743742
## Lital      0.03544229 -0.10535158
## RETI      24.19083206  9.07796019
## EPTAbu    -0.04287297  0.10142843
## Marg      -0.23787363  0.08750829
## Metaritr  -0.21922842 -0.39397046
## JepAbu    -0.04243728 -0.02011427
## Epiritral -0.18065638  0.37623175
cc1$ycoef
##                  [,1]         [,2]         [,3]         [,4]         [,5]
## Tepl_max  0.002728344 -0.031587883 -0.022171030  0.015570969 -0.026678176
## X.O2      0.001145406 -0.004145546  0.005890339 -0.012452124 -0.009542703
## BSK5      0.015042715  0.054146231 -0.027996459  0.124516411 -0.026809781
## Kond      0.002545019  0.001425917 -0.004281998  0.001343352  0.001807199
## N.NH4     0.307216652  0.432469799  1.131659010  0.087843642 -0.818023849
## N.NO3    -0.010468525 -0.030934549  0.053453456  0.031622860  0.022842863
## Pcelk     0.412696634 -0.979367658  0.196048060 -3.071986907  0.807813337
##                  [,6]          [,7]
## Tepl_max  0.024294508  0.0072354037
## X.O2     -0.021080517 -0.0001162876
## BSK5     -0.122653042 -0.1252271666
## Kond     -0.003536074  0.0038599055
## N.NH4     0.123847014  0.5657683896
## N.NO3    -0.011571285  0.0045994691
## Pcelk     1.447735216 -1.1452167044
cc2$ycoef
##                  [,1]        [,2]        [,3]         [,4]        [,5]
## Tepl_max -0.021655562  0.25072104 -0.17597710   0.12359074 -0.21175146
## X.O2     -0.009091381  0.03290426  0.04675311  -0.09883567 -0.07574285
## BSK5     -0.119397849 -0.42977239 -0.22221500   0.98831837 -0.21279604
## Kond     -0.020200461 -0.01131786 -0.03398730   0.01066252  0.01434420
## N.NH4    -2.438456583 -3.43262261  8.98226493   0.69723729 -6.49286301
## N.NO3     0.083091342  0.24553537  0.42427365   0.25099867  0.18130960
## Pcelk    -3.275677981  7.77348979  1.55608323 -24.38314016  6.41181959
##                 [,6]          [,7]
## Tepl_max  0.19283168  0.0574292361
## X.O2     -0.16732142 -0.0009230045
## BSK5     -0.97352834 -0.9939598209
## Kond     -0.02806672  0.0306370503
## N.NH4     0.98300520  4.4906473759
## N.NO3    -0.09184423  0.0365071542
## Pcelk    11.49104203 -9.0898757913

Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).

plt.cc(cc2, var.label = TRUE, ind.names = data[,1])

Questions


  • How would you interpret the left hand side of the figure?
  • What is the relationship between the subfigure on the left hand side and the subfigure on the right hand side?
  • What would be your conclussion about the two available dataset and their mutual relationship?



A usefull extension to the classical canonical correlation approach is the regularized version - so called regularized canonical correlation analysis. It is meant for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{n}\) (usually \(p \gg n\)). There is R function rcc() available in the R package ‘CCA’.

Another option how to perform canonical correlation in R is to install some additional R packages – for instance, package ‘vegan’ (use install.packages("vegan") for instalation) and to apply function cca(). Use the help session in R for more details about the function itself or the corresponding package respectively.

To Do


  • Use this markdown and consider (maybe) the whole dataset on biologial measurements. Compare how much the results changes.
  • Consider a lower dimensional profile and try to refit the whole canonical correlation analysis on a lower dimensional datasets (e.g. two covariates for bio data and two covariates for chemical data).
  • Try to manually reconstruct the results from the above by using the correponding canonical correlation theory form the lecture.



Note


  • Canonical Correspondance Analysis
    An analogous method that the canonical correlation analysis, however, mainly meant for discrete data.

  • Redundancy Analysis
  • A similar method than the canonical correlation analysis, however, not symetric. The idea is tu use linear combinations of one dataset to maximize the corelation with the original covariates in the second dataset. Thus, it depends on which data set is used as the orginal one.








Homework Assignment

(Final Submission Deadline: Tuesday 28.05.2018)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata") 

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma) 
  • Choose one dataset and use some judgement to split the data into two disjoint parts. Alternatively, you could also consider two different datasets however, they should be recorded on the same units. Apply the canonical correlation analysi.
  • Finilize your report and make it ready for the final submission. The submission deadline is Tuesday, 28/05/2017, 23:59
  • . The report should be submitted by emial to (at least) one of the following addresses:
    • maciak AT karlin.mff.cuni.cz
    • maciak AT ualberta.ca
  • Each student who wants to receive the course credit for this term must submit his/her PDF report (and individual evaluation of all questios and tasks assigned during the term) not later than the deadline provided above.
  • If you need to obtain the course credit before, it is necessary to submit your PDF report at leats two days before the day you need to have the credit assigned.