NMST539 | Lab Session 6

Multidimensional Scaling & Clustering

LS 2017 | Tuesday 25/04/2017

Rmd file (UTF8 encoding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

Multidimensional Scaling (MDS)

A multidimensional scaling is another common method in statistics used for high dimensional data visualization. However, unlike the principal components analysis (and somehow factor analysis as well) the multidimensional scaling rather focuses on visualizing similarities/dissimilarities in observations rather than major variability directions in the data.

The main idea of the multidimensional scaling is to identify meaningful underlying dimensions that allow us to detect existing similarities and dissimilarities between the observed data points. There are of course many different approaches how to define similarities and dissimilarities respectively. If we choose to measure simularities and dissimilarities between variables in a sense of a classical correlation matrix (between available covariates) then we obtain a classical factor analysis approach. On the other hand, if we choose to measure similarities/dissimilarities using a standard Euclidian distance then we end up with a principal component analysis. Many other options are, however, possible.

The starting point for the MDS analysis (algoritm) is so called a matrix of distances (respectively a similarity/dissimilarity matrix) between all pairs of observations. The distances are calculated with respect to the available covariates and various definitions for calculating distances can be applied.

In some situations the multidimensional scaling approach can be also performed for a similarity/dissimilarity matrix wich is not based on a typical distance (see a nonmetric MDS approach below).

In the statistical software R one can use a standard function ‘dist()’ to calculate similarities/dissimilarities (see the help session for further details) wich is available under the standard R instalation. Such matrix can be consequently used for the MDS algorithm which represents the observations in (usually) a lower dimensional plane in a way that the original distances are preserved as well as possible.

We will again start with the dataset which respresents different river localities in the Czech republic. The life diversity is measured by a set of 17 various bio metrics and we are interested in identifying similar localities in the dataset.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
Dmatrix <- dist(bioData[,2:18]) 
## [1] 65 65

  • What is the right interpretation of the values in the \(D\) matrix?
  • By default, the R function dist() uses the Euclidian distance to calculate dissimilarities between observations. How exactly are these values calculated?
  • Try to reconstruct the values (at least some of them) in the \(D\) matrix calculated by the R function dist().

    sqrt(sum((bioData[1,2:18] - bioData[2,2:18])^2))
    ## [1] 20.55428
    ##          1        2
    ## 1  0.00000 20.55428
    ## 2 20.55428  0.00000

Measures of Similarities and Dissimilarities in R

There are various distance definitions which can be used to calculate mutual similarities/dissimilarities between the pairs of observations. Considering the R function dist() one can take an advantage of the following distances:

  • Euclidean distance - used by default;
  • Maximum distance - equivalent with a supremum (maximum) norm;
  • Manhattan distance - absolute distance between the two vectors (1 norm aka L_1).;
  • Canberra distance - a weighted version of the Manhattan distance;
  • Binary distance - a proportion of non-zero elements in \(\boldsymbol{x} - \boldsymbol{y}\);
  • Minkowski distance - classical \(L_{p}\)-norm (default choice is \(p = 2\) which reduces to Euclidean distance);
Consider these distances for calculating the similarity/dissimilarity matrix \(D\) and manually try to recover the values from this matrix.

Multidimensional Scaling

We again start with the dataset of bio metrics in different localities in the Czech republic. If we stick with the Euclidian metric for calculating distances (similarities respectively dissimilarities) between different localities (with respect to 17 available bio metrics) we already have the corresponding matrix stored in the R object Dmatrix.

A standard function in the R environment which performs a multidimensional scaling of the dataset is cmdscale(). It is available under the standard R installation.

MDS1 <- cmdscale(Dmatrix, k = 2)
## [1] 65  2

What does the result respresent?

Plotting the original data with using two dimensions only (but still preserving original distances as well as possible) can be done by the following command:

plot(MDS1[,1], MDS1[,2], xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS1[,1], MDS1[,2], labels = bioData[,1], cex=.7)

To Do

  • Use the same dataset on different localities in the Czech republic and calculate the similarity/dissimilarity matrix using some other distance option aviable in dist(). Plot the results and compare them.
  • Load in also the dataset with chemical characteristics on the same localities and try to perform a multidimensional scaling on the chemical dataset too (remember that one locality is missing in the chemical data set). Use the following command to load in the corresponding dataset with chemical measurements: chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
ind <- match(chemData[,1], bioData[,1])
bioData0 <- bioData[ind, ]

MDS3 <- cmdscale(dist(bioData0[,2:18]), k = 2)
MDS4 <- cmdscale(dist(chemData[,2:8]), k = 2)

plot(c(MDS3[,1], MDS4[,1]), c(MDS3[,2], MDS4[,2]), xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS3[,1], MDS3[,2], labels = bioData[,1], cex=.7)
text(MDS4[,1], MDS4[,2], labels = bioData[,1], cex=.7, col = "red")

legend(50, -30, legend = c("Bio Metric Measurements", "Chemical Concentrations"), col = c("black", "red"), pch = 15)