NMST539 | Lab Session 10

Cluster Analysis and Discriminant Analysis

(application in R)

LS 2017 | Tuesday 09/05/2017

Rmd file (UTF8 encoding)

The R-software is available for download from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
Komárek, A.: Základy práce s R. (PDF súbor)
Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)

1. Cluster Analysis in R

The cluster analysis is a statistical method designed for distinguishing various objects, respectively grouping them into some disjoint sets with respect to their similarity/dissimilarity. The similarity/dissimilarty is usually measured with respect to some proximity or distance measure (see function dist() in the R environment - similarly, as it was used in case of the mutidimensional scaling technique). However, the cluster analysis is not just one specific method/approach but it is rather a whole set of various tools and algoritms, which can be used to solve the grouping problem.

Considering the constructing proces we distinguish between two types of the clustering algorithms: partitioning algorithms where the assignments of objects into given groups can change during the algorithm and hierarchical algoritihms, where the assignments of objecs is kept fixed during the algorithm. The result of the clustering algorithm is mainly affected by the choice of the distance/proximity matrix and the clustering method used in the algorithm. Among others, we can include into the clustering approaches the following:

Partitioning clustering
Hierarchical clustering
Centroid-based clustering
Distribution-based clustering
Density-based clustering
and many others…

Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R):

rm(list = ls())
data <- mtcars
attach(data)
head(data)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The data consist of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars).

Hierarchical Clustering in R

Let us consider all covariates from the dataset. We will start by calculating the distance mantrix. In the following we are using the default option - the Euclidian distance.

In order to improve the visibility in the following dendograms (which we will construct later) we will not consider the whole dataset (\(n = 803\) students) but only a smaller subsample. A random subsample is created using the following piece of the R code (remmember to set the seed - command set.seed() in order to get the same data set each time you run the command).

D <- dist(mtcars) ### for the euclidian distance

Now, we can apply function hclust() (available under the standard R instalation) to run a hierarchical clustering approach based on the proximity matrix \(D\). See the help session in R for more details (type ?hclust()).

HC1 <- hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")

A little more fancy version available in the R software is the package called ‘sparcl’, which needs to be installed first (use install.packages('sparcl') for the instalation). After loading the library the corrresponding command is ColorDendrogram() (see the help session for more details).

In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:

plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:

‘method = ’single’
‘method = ’complete’
‘method = ’average’
‘method = ’median’
‘method = ’centroid’
‘method = ’mcquitty’
‘method = ’ward.D’

The default setting for the hclust() function is ‘method = complete’ which is also displayed on each dendogram figure. One can of course improve the overall impression by considering more fancy versions of dendograms.

plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = -1)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

and even more sophisticated alternatives, e.g.:

cutHC1 = cutree(HC1, 3)
labelColors = c("red", "green", "blue")

# function to get color labels
colLab <- function(n) {
    if (is.leaf(n)) {
        a <- attributes(n)
        labCol <- labelColors[cutHC1[which(names(cutHC1) == a$label)]]
        attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
    }
    n
}
# using dendrapply
clusDendro = dendrapply(as.dendrogram(HC1), colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")

The same idea of the code can be also used to define colors of branches with respect to some additional covariate (for instance, the covariate gear).

local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
    }
    n
  }
  mycols <- mtcars$gear
  i <- 0
})

x.clust.dend1 <- dendrapply(as.dendrogram(HC1), colLab)
plot(x.clust.dend1)

To Do

Consider different types of distances for function dist() and apply the aglomerative clustering approch while selecting different methods for the cluster calculations (various settings of the ‘method =’ parameter when calling hclust() function.
Use some dendogram pictures and compare results from different distances and different clustering approaches.
How many clusters seem to be a reasonable choice and what would be the natural interpretation of them?

Now we apply the same approach for much bigger dataset (for this purpose we consider the data from the geographical questionare in the Czech republic and Slovakia) distinguishing for three clusters and using colors to distinguish for different grades (grade1, grade2 and grade3). The idea is to see whether clusters somehow corresponds with three available grades (as we had maybe an impression above) or rather not…

dataGeo <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/geoData2017.csv", header = T, sep = " ", dec = ".")
n <- dim(dataGeo)[1]
gColor <- rep("lightblue", n)
gColor[dataGeo[,2] == "grade2"] <- "blue"
gColor[dataGeo[,2] == "grade3"] <- "darkred"

local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
    }
    n
  }
  mycols <- gColor
  i <- 0
})

Dall <- dist(dataGeo[, 6:27])
HC2 <- hclust(Dall)

x.clust.dend2 <- dendrapply(as.dendrogram(HC2), colLab)
plot(x.clust.dend2)

groups2 <- cutree(HC2, k=3)
rect.hclust(HC2, k=3, border="red")

More sophisticated dendograms can be obtained with some additional libraries (e.g. library ‘ape’ which can be installed by install.packages("ape") or library ‘sparcl’ which can be installed by running install.packages("sparcl")).

library(sparcl)
ColorDendrogram(HC1, y = cutHC1, labels = names(cutHC1), main = "Mtcars dataset", 
    branchlength = 80)

Non-hierarchical Clustering in R

The most common approache for non-hierarchical clustering is so-called K means algorithm. The idea behind the algorithm is to partition the observations into some prespecified number of groups (\(k\) groups - thus the name of the method) such that the sum of squares from points to the assigned cluster centres is minimized.

In the R software the function which performs the non-hierarchical K-means clustering is kmeans() available under the standard R instalation (see the help session for more details).

Kmeans <- kmeans(mtcars, centers = 3)
colVector <- as.numeric(Kmeans$cluster)

plot(mtcars$mpg ~ mtcars$hp, bg = colVector, xlab = "Horse Power", ylab = "Miles per Gallon", pch = 21, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,4], col = 1:3, pch = 8, cex = 2)

Comment

Some other methodologies and approaches are avialable in different R packages and various R functions.

2. Introduction to discriminant analysis in R

Discriminant analysis is a (not purely a multivariate) statistical method which was designed for allocating various objects into similar (homogenous) groups. The idea is to use some available information (given covariates) in order to create some classification rule which can be used later to assign some new object into one of the existing groups.

There are many different approaches how to define a descrimination rule. The most classical approaches include a linear discriminant function (in the R software implemented in lda() function available in the packages ‘MASS’) and quadratic discriminant function (command ‘qda()’ in package ‘MASS’).

Use the help sessions in R for both functions to find out more about there implementation in R.

The following is a nice R Markdown available in the R studio prepared by G. Martos:

http://rstudio-pubs-static.s3.amazonaws.com/35817_2552e05f1d4e4db8ba87b334101a43da.html

For instance, we can use some continuous covariates to try to classify the avialable cars by the number of cylinders.

library(MASS)
lda1 <- lda(cyl ~ mpg + hp + qsec + drat + disp, data = mtcars)

pred <- predict(lda1)
ldahist(data = pred$x[,1], g=cyl)

To Do

Use the R Markdown available on the given website and apply some linear and quadratic distriminant functions to some dataset.
Provide some graphical outputs and discuss obtained results.
Use some additional datapoints (feel free to make up some) and try to use the discrimination rules to assign the new points to the existing groups.

Homework Assignment

(Deadline: sixth lab session / 11.04.2007)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Choose one dataset and apply the cluster analysis;
Discuss the optimal choice of the number of the clusters and interpret the results;