NMST539  Lab Session 6Multidimensional Scaling & ClusteringLS 2017  Tuesday 25/04/2017Rmd file (UTF8 encoding)The Rsoftware is available for download from the website: https://www.rproject.org A userfriendly interface (one of many): RStudio. Manuals and introduction into R (in Czech or English):
Multidimensional Scaling (MDS)A multidimensional scaling is another common method in statistics used for high dimensional data visualization. However, unlike the principal components analysis (and somehow factor analysis as well) the multidimensional scaling rather focuses on visualizing similarities/dissimilarities in observations rather than major variability directions in the data. The main idea of the multidimensional scaling is to identify meaningful underlying dimensions that allow us to detect existing similarities and dissimilarities between the observed data points. There are of course many different approaches how to define similarities and dissimilarities respectively. If we choose to measure simularities and dissimilarities between variables in a sense of a classical correlation matrix (between available covariates) then we obtain a classical factor analysis approach. On the other hand, if we choose to measure similarities/dissimilarities using a standard Euclidian distance then we end up with a principal component analysis. Many other options are, however, possible. The starting point for the MDS analysis (algoritm) is so called a matrix of distances (respectively a similarity/dissimilarity matrix) between all pairs of observations. The distances are calculated with respect to the available covariates and various definitions for calculating distances can be applied. In some situations the multidimensional scaling approach can be also performed for a similarity/dissimilarity matrix wich is not based on a typical distance (see a nonmetric MDS approach below). In the statistical software R one can use a standard function ‘dist()’ to calculate similarities/dissimilarities (see the help session for further details) wich is available under the standard R instalation. Such matrix can be consequently used for the MDS algorithm which represents the observations in (usually) a lower dimensional plane in a way that the original distances are preserved as well as possible. We will again start with the dataset which respresents different river localities in the Czech republic. The life diversity is measured by a set of 17 various bio metrics and we are interested in identifying similar localities in the dataset.
Measures of Similarities and Dissimilarities in RThere are various distance definitions which can be used to calculate mutual similarities/dissimilarities between the pairs of observations.
Considering the R function
Multidimensional ScalingWe again start with the dataset of bio metrics in different localities in the Czech republic. If we stick with the Euclidian metric for calculating distances
(similarities respectively dissimilarities) between different localities (with respect to 17 available bio metrics) we already have the corresponding matrix stored in the R object A standard function in the R environment which performs a multidimensional scaling of the dataset is
What does the result respresent? Plotting the original data with using two dimensions only (but still preserving original distances as well as possible) can be done by the following command:
To Do
