NMST539, LS 2015/16

Cvičenie 8 (týždeň 9)

Multidimensional Scaling & Clustering


Multidimensional Scaling (MDS)

A multidimensional scaling is another common method in statistics used for high dimensional data visualization. However, unlike the principal components analysis (and somehow factor analysis as well) the multidimensional scaling rather focuses on visualizing similarities/dissimilarities in observations rather than major variability directions in the data.

The main idea of the multidimensional scaling is to identify meaningful underlying dimensions that allow us to detect existing similarities and dissimilarities between the observed data points. There are of course many different approaches how to define similarities and dissimilarities respectively. If we choose to measure simularities and dissimilarities between variables in a sense of a classical correlation matrix (between available covariates) then we obtain a classical factor analysis approach. On the other hand, if we choose to measure similarities/dissimilarities using a standard Euclidian distance then we end up with a principal component analysis. Many other options are, however, possible.

The starting point for the MDS analysis (algoritm) is so called a matrix of distances (respectively a similarity/dissimilarity matrix) between all pairs of observations. The distances are calculated with respect to the available covariates and various definitions for calculating distances can be applied.

In some situations the multidimensional scaling approach can be also performed for a similarity/dissimilarity matrix wich is not based on a typical distance (see a nonmetric MDS approach below).

In the statistical software R one can use a standard function ‘dist()’ to calculate similarities/dissimilarities (see the help session for further details) wich is available under the standard R instalation. Such matrix can be consequently used for the MDS algorithm which represents the observations in (usually) a lower dimensional plane in a way that the original distances are preserved as well as possible.

We will again start with the dataset which respresents different river localities in the Czech republic. The life diversity is measured by a set of 17 various bio metrics and we are interested in identifying similar localities in the dataset.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
Dmatrix <- dist(bioData[,2:18]) 
dim(as.matrix(Dmatrix))
## [1] 65 65



  • What is the right interpretation of the values in the \(D\) matrix?
  • By default, the R function dist() uses the Euclidian distance to calculate dissimilarities between observations. How exactly are these values calculated?
  • Try to reconstruct the values (at least some of them) in the \(D\) matrix calculated by the R function dist().

    sqrt(sum((bioData[1,2:18] - bioData[2,2:18])^2))
    ## [1] 20.55428
    as.matrix(Dmatrix)[1:2,1:2]
    ##          1        2
    ## 1  0.00000 20.55428
    ## 2 20.55428  0.00000



Measures of Similarities and Dissimilarities in R


There are various distance definitions which can be used to calculate mutual similarities/dissimilarities between the pairs of observations. Considering the R function dist() one can take an advantage of the following distances:

  • Euclidean distance - used by default;
  • Maximum distance - equivalent with a supremum (maximum) norm;
  • Manhattan distance - a generalization of the Euclidian distance which takes into account also the mutual correlations (spatial distribuion of the data);
  • Canberra distance - a weighted version of the Manhattan distance;
  • Binary distance - a proportion of non-zero elements in \(\boldsymbol{x} - \boldsymbol{y}\);
  • Minkowski distance - classical \(L_{p}\)-norm (default choice is \(p = 2\) which reduces to Euclidean distance);
Consider these distances for calculating the similarity/dissimilarity matrix \(D\) and manually try to recover the values from this matrix.

Multidimensional Scaling

We again start with the dataset of bio metrics in different localities in the Czech republic. If we stick with the Euclidian metric for calculating distances (similarities respectively dissimilarities) between different localities (with respect to 17 available bio metrics) we already have the corresponding matrix stored in the R object Dmatrix.

A classical function in the R environment wich perform a multidimensional scaling is cmdscale(). It is avialable under the standard R instalation.

MDS1 <- cmdscale(Dmatrix, k = 2)
dim(MDS1)
## [1] 65  2

What does the result respresent?

Plotting the original data with using two dimensions only (but still preserving original distances as well as possible) can be done by the following command:

plot(MDS1[,1], MDS1[,2], xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS1[,1], MDS1[,2], labels = bioData[,1], cex=.7)



To Do


  • Use the same dataset on different localities in the Czech republic and calculate the similarity/dissimilarity matrix using some other distance option aviable in dist(). Plot the results and compare them.
  • Load in also the dataset with chemical characteristics on the same localities and try to perform a multidimensional scaling on the chemical dataset too (remember that one locality is missing in the chemical data set). Use the following command to load in the corresponding dataset with chemical measurements: chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
ind <- match(chemData[,1], bioData[,1])
bioData0 <- bioData[ind, ]

MDS3 <- cmdscale(dist(bioData0[,2:18]), k = 2)
MDS4 <- cmdscale(dist(chemData[,2:8]), k = 2)

plot(c(MDS3[,1], MDS4[,1]), c(MDS3[,2], MDS4[,2]), xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS3[,1], MDS3[,2], labels = bioData[,1], cex=.7)
text(MDS4[,1], MDS4[,2], labels = bioData[,1], cex=.7, col = "red")

legend(50, -30, legend = c("Bio Metric Measurements", "Chemical Concentrations"), col = c("black", "red"), pch = 15)

Question


  • How much representative the figure is and how much relevant information can be extracted out of it?



Classical MDS vs. Nonmetric MDS

The key difference between these two approaches is that the first one uses a matrix of similarities/dissimilarities being defined in a sence of classical distance while the second one only uses some matrix of ordered ranks of similarities/dissimilarities (an arbitrary monotone function of distances).

For nonmetric MDS one can use the R function isoMDS() which is available in the library ‘MASS’. The function performs the Kruskal’s Non-metric Multidimensional Scaling.

library(MASS)
MDS2 <- isoMDS(Dmatrix, k=2)
## initial  value 7.466727 
## iter   5 value 6.387632
## iter  10 value 6.118217
## iter  15 value 5.959384
## final  value 5.817192 
## converged

The results can be again plotted using an analogous set of commands:

plot(MDS2$points[,1], MDS2$points[,2], xlab="Coordinate 1", ylab="Coordinate 2", type="n", main = "Czech River Localities")
text(MDS2$points[,1], MDS2$points[,2], labels = bioData[,1], cex=.7)



Question


  • Which are suitable situations for using nonmetric MDS and in which situations a classical MDS (based on a regular distance/metric) should be used?

Another usefull graphical device (especially for small number of observations) is a grah in 2 dimensionscconstructed using a multidimensional scaling approach. It is available in the R library ‘igraph’ (use install.packages("igraph") for installation).

library(igraph)
graphEdges <- graph.full(nrow(bioData[,2:18]))
V(graphEdges)$label <- bioData[,1]
layout <- layout.mds(graphEdges, dist = as.matrix(Dmatrix))
plot(graphEdges, layout = layout, vertex.size = 3)

Beside the two function already mentioned above (cmdscale() and isoMDS()) are are many more available in different R packages (see e.g. smacofSym() in packages ‘smacof’, wcmdscale() in package ‘vegan’ or pco() in package ‘ecodist’ and many others).




Domáca úloha (dobrovoľná)

Na webovej stránke Doc. Hlávku

http://www1.karlin.mff.cuni.cz/~hlavka/teac.html

je k dispozícii niekoľko dátových súborov, ktoré do Rka stačí načítať pomocou príkazu load(nazov_suboru.rda).

  • Vyberte si jeden dátový súbor a aplikujte MDS metódu na zvolený súbor.
  • Použijte alespoň dve rôzne vzdialenosti na výpočet matice vzdialenosti a porovnajte v dvoch rozmeroch získane výsledky.