NMST539 | Lab Session 10

Cluster Analysis and Discriminant Analysis

(application in R)

LS 2017 | Tuesday 09/05/2017

Rmd file (UTF8 encoding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

1. Cluster Analysis in R

The cluster analysis is a statistical method designed for distinguishing various objects, respectively grouping them into some disjoint sets with respect to their similarity/dissimilarity. The similarity/dissimilarty is usually measured with respect to some proximity or distance measure (see function dist() in the R environment - similarly, as it was used in case of the mutidimensional scaling technique). However, the cluster analysis is not just one specific method/approach but it is rather a whole set of various tools and algoritms, which can be used to solve the grouping problem.

Considering the constructing proces we distinguish between two types of the clustering algorithms: partitioning algorithms where the assignments of objects into given groups can change during the algorithm and hierarchical algoritihms, where the assignments of objecs is kept fixed during the algorithm. The result of the clustering algorithm is mainly affected by the choice of the distance/proximity matrix and the clustering method used in the algorithm. Among others, we can include into the clustering approaches the following:
  • Partitioning clustering
  • Hierarchical clustering
  • Centroid-based clustering
  • Distribution-based clustering
  • Density-based clustering
  • and many others…

Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R):

rm(list = ls())
data <- mtcars
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The data consist of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars).

Hierarchical Clustering in R

Let us consider all covariates from the dataset. We will start by calculating the distance mantrix. In the following we are using the default option - the Euclidian distance.

In order to improve the visibility in the following dendograms (which we will construct later) we will not consider the whole dataset (\(n = 803\) students) but only a smaller subsample. A random subsample is created using the following piece of the R code (remmember to set the seed - command set.seed() in order to get the same data set each time you run the command).

D <- dist(mtcars) ### for the euclidian distance

Now, we can apply function hclust() (available under the standard R instalation) to run a hierarchical clustering approach based on the proximity matrix \(D\). See the help session in R for more details (type ?hclust()).

HC1 <- hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")

A little more fancy version available in the R software is the package called ‘sparcl’, which needs to be installed first (use install.packages('sparcl') for the instalation). After loading the library the corrresponding command is ColorDendrogram() (see the help session for more details).

In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:

plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:

  • ‘method = ’single’
  • ‘method = ’complete’
  • ‘method = ’average’
  • ‘method = ’median’
  • ‘method = ’centroid’
  • ‘method = ’mcquitty’
  • ‘method = ’ward.D’

The default setting for the hclust() function is ‘method = complete’ which is also displayed on each dendogram figure. One can of course improve the overall impression by considering more fancy versions of dendograms.

plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = -1)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")