NMST539 | Lab Session 9

Multidimensional Scaling with Applications in R

LS 2017/2018 | Monday 23/04/18

Rmd file (UTF8 coding)

The R-software is available for download from the official website: https://www.r-project.org

A user-friendly interface (one of many): RStudio

Some useful manuals and tutorials for R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

1. Multidimensional Scaling (MDS)

The multidimensional scaling technique is another common method in statistics used for high dimensional data visualization and dimensionality reduction. However, unlike the principal components analysis (and somehow factor analysis as well) the multidimensional scaling rather focuses on visualization of common similarities/dissimilarities in observations rather than visualizing the major variability directions in the data (the overall covariance structure respectively).

The main idea of the multidimensional scaling technique is to identify meaningful underlying dimensions that allow us to detect existing similarities and dissimilarities between the observed data points. There are of course many different approaches how to define similarities and dissimilarities respectively. If we choose to measure the simularities and dissimilarities among the given variables in a sense of a classical correlation matrix then we obtain a classical factor analysis approach. On the other hand, if we choose to measure similarities/dissimilarities using a standard Euclidian distance then we end up with a principal component analysis. Many other options are, however, possible.

The starting point for the MDS analysis (algoritm) is so called a matrix of distances (respectively a similarity/dissimilarity matrix) between all pairs of observations while the underlying dimensionality of the data under investigation is, in general, unknown. Hence, the main task is to find a set of points in some low dimension (usually as low as possible) which would reasonably represent the configuration of the original data.

In some situations the multidimensional scaling approach can be also performed for a similarity/dissimilarity matrix wich is not based on a typical distance (see a nonmetric MDS approach below) but some discrete ranks for instance, or a monotone function of these ranks.

In the statistical software R one can use a standard function ‘dist()’ to calculate similarities/dissimilarities (see the help session for further details) wich is available under the standard R instalation. Such matrix can be consequently used for the MDS algorithm which represents the observations in (usually) a lower dimensional plane in a way that the original distances are preserved as well as possible.

We will again start with the dataset which respresents different river localities in the Czech republic. The life diversity is measured by a set of 17 various bio metrics and we are interested in identifying similar localities in the given dataset.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)

Dmatrix <- dist(bioData[,2:18]) 
dim(as.matrix(Dmatrix))

## [1] 65 65

Do by Yourself

What is the right interpretation of the values in the \(\mathbb{D}\) matrix?
By default, the R function dist() uses the Euclidian distance to calculate the dissimilarities among the given observations. How exactly are these values calculated?
Try to reconstruct the values (at least some of them) in the \(\mathbb{D}\) matrix calculated by the R function dist().
```
sqrt(sum((bioData[1,2:18] - bioData[2,2:18])^2))
```
```
## [1] 20.55428
```
And compare it directly with the values in the \(D\) matrix:
```
as.matrix(Dmatrix)[1:2,1:2]
```
```
##          1        2
## 1  0.00000 20.55428
## 2 20.55428  0.00000
```

Multdimensional scaling with the Euclidian metric

Let the distance matrix \(\mathbb{D} = (d_{ij})_{i, j = 1}^{n}\) is defined via the Euclidean metric \[ d_{ij}^2 = \sum_{k = 1}^{p} (x_{i k} - x_{j k})^2, \] for some two \(p\)-dimensional points \(\boldsymbol{x}_i = (x_{i1}, \dots, x_{ip})^\top \in \mathbb{R}^p\) and \(\boldsymbol{x}_j = (x_{j1}, \dots, x_{jp})^\top \in \mathbb{R}^p\). Let us define the following quantities: \[ a_{ij} = - \frac{d_{ij}^2}{2}, \quad\quad a_{i \bullet} = \frac{1}{n}\sum_{j = 1}^{n} a_{ij}, \quad\quad a_{\bullet j} = \frac{1}{n}\sum_{i = 1}^{n} a_{ij}, \quad\quad a_{\bullet \bullet} = \frac{1}{n^2} \sum_{i = 1}^n \sum_{j = 1}^n a_{ij}. \] Using the notation above prove that the following expressions hold:

For matrix \(\mathbb{B} = (b_{ij})_{i,j = 1}^n\), where \(b_{ij} = a_{ij} - a_{i \bullet} -a_{\bullet j} + a_{\bullet \bullet}\), it holds that \(\mathbb{B} = \mathcal{X}\mathcal{X}^\top\), where \(\mathcal{X}\) is the original \(n \times p\) type data frame \(\mathcal{X} = (\boldsymbol{x}_1, \dots, \boldsymbol{x}_n)^\top\), where \(\boldsymbol{x}_i = (x_{i1}, \dots, x_{ip})^\top \in \mathbb{R}^p\).

Show that \(b_{ii}= a_{\bullet \bullet} - 2a_{i \bullet}\), for \(i = 1, \dots, n\), and \(b_{ij} = a_{ij} - a_{i \bullet} -a_{\bullet j} + a_{\bullet \bullet}\), for \(i \neq j\).

Show that \(\mathbb{B} = -\frac{1}{2} \mathcal{H} \mathbb{D}^2 \mathcal{H}\), where \(\mathcal{H}\) is the centering matrix defined as \(\mathcal{H} = (\mathbb{I}_n - \frac{1}{n} \boldsymbol{1}_n^\top\boldsymbol{1})\).

Various measures of similarities and dissimilarities in R

There are various distance definitions which can be used to calculate mutual similarities/dissimilarities between the pairs of observations. Considering the R function dist() one can take an advantage of the following distances:

Euclidean distance - used by default;
Maximum distance - equivalent with a supremum (maximum) norm;
Manhattan distance - absolute distance between the two vectors (1 norm aka \(L_1\));
Canberra distance - a weighted version of the Manhattan distance;
Binary distance - a proportion of non-zero elements in \(\boldsymbol{x}−\boldsymbol{y}\);
Minkowski distance - classical \(L_p\)-norm (default choice is \(p=2\) which reduces to Euclidean distance);

Let us get back to the dataset with the bio metrics in different localities in the Czech republic. If we stick with the Euclidian metric (default option) for calculating distances (similarities respectively dissimilarities) between different localities (with respect to 17 available bio metrics) we already have the corresponding matrix stored in the R object Dmatrix.

A standard function in the R environment which performs a multidimensional scaling of the dataset is cmdscale(). It is available under the standard R installation.

MDS1 <- cmdscale(Dmatrix, k = 2)
dim(MDS1)

## [1] 65  2

What do the results above respresent? What is the corresponding interpretation?

We can use some scatterplot options to get some insight into the data however, only a very limited information can be given by the means of some 2D scatterplot figures. For instance, see the following plots:

par(mfrow = c(2,2))
cols <- rep("blue", dim(bioData)[1])
cols[which(bioData[,1] == "Olse")] <- "red"
cols[which(bioData[,1] == "DesnSudk")] <- "green"
cols[which(bioData[,1] == "DyjeVran")] <- "green"


plot(bioData[,2], bioData[,3], xlab="SaprInd", ylab="Lital",  type="n", main = "Czech River Localities")
text(bioData[,2], bioData[,3], labels = bioData[,1], cex=.5, col = cols)

plot(bioData[,2], bioData[,11], xlab="SaprInd", ylab="Ntaxon",  type="n", main = "Czech River Localities")
text(bioData[,2], bioData[,11], labels = bioData[,1], cex=.5, col = cols)

plot(bioData[,2], bioData[,17], xlab="SaprInd", ylab="MDS_1",  type="n", main = "Czech River Localities")
text(bioData[,2], bioData[,17], labels = bioData[,1], cex=.5, col = cols)

plot(bioData[,17], bioData[,18], xlab="MDS_1", ylab="MDS_2",  type="n", main = "Czech River Localities")
text(bioData[,17], bioData[,18], labels = bioData[,1], cex=.5, col = cols)

Plotting the original data with using two dimensions only (but still preserving original distances as well as possible) can be done by the following command:

plot(MDS1[,1], MDS1[,2], xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS1[,1], MDS1[,2], labels = bioData[,1], cex=.7, col = cols)

Try by yourself

Use the same dataset on different localities in the Czech republic and calculate the similarity/dissimilarity matrix using some other distance option available in dist(). Plot the results and compare them.
Load in also the dataset with chemical characteristics on the same localities and try to perform a multidimensional scaling on the chemical dataset too (remember that one locality is missing in the chemical data set). Use the following command to load in the corresponding dataset with chemical measurements:
```
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
```

chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
ind <- match(chemData[,1], bioData[,1])
bioData0 <- bioData[ind, ]

MDS3 <- cmdscale(dist(bioData0[,2:18]), k = 2)
MDS4 <- cmdscale(dist(chemData[,2:8]), k = 2)

plot(c(MDS3[,1], MDS4[,1]), c(MDS3[,2], MDS4[,2]), xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS3[,1], MDS3[,2], labels = bioData[,1], cex=.7)
text(MDS4[,1], MDS4[,2], labels = bioData[,1], cex=.7, col = "red")

legend(50, -30, legend = c("Bio Metric Measurements", "Chemical Concentrations"), col = c("black", "red"), pch = 15)

Questions

Is the representation given in the figure above unique?
How much representative is the figure above?
What kind of information can be effectivelly extracted from this figure?

2. Classical MDS vs. Nonmetric MDS

The key difference between these two approaches is that the first one uses the matrix of similarities/dissimilarities being defined in a sence of classical distance while the second one only uses some matrix of ordered ranks of similarities/dissimilarities (an arbitrary monotone function of distances).

For nonmetric MDS one can use the R function isoMDS() which is available in the library ‘MASS’. The function performs the Kruskal’s Non-metric Multidimensional Scaling.

library(MASS)
MDS2 <- isoMDS(Dmatrix, k=2)

## initial  value 7.466727 
## iter   5 value 6.387632
## iter  10 value 6.118217
## iter  15 value 5.959384
## final  value 5.817192 
## converged

The results can be again plotted using an analogous set of commands:

plot(MDS2$points[,1], MDS2$points[,2], xlab="Coordinate 1", ylab="Coordinate 2", type="n", main = "Czech River Localities")
text(MDS2$points[,1], MDS2$points[,2], labels = bioData[,1], cex=.7)

Questions

Which are the suitable situations for using the nonmetric MDS?
In which situations would you prefer the classical MDS (based on a regular distance/metric) instead?

Beside the two function already mentioned above (cmdscale() and isoMDS()) there are many more options available in different R packages (see, for instance, smacofSym() in the R package ‘smacof’, wcmdscale() in package ‘vegan’ or pco() in package ‘ecodist’).

Homework Assignment

(voluntarily)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Take some dataset according to your choice and apply the MDS appraoch for at least two different similarity measures which you find reasonable;
Discuss the obtained results and draw some main (headline) conclussion;