NMST539 | Lab Session 10

Cluster Analysis & Applications in R

LS 2017/2018 | Monday 30/04/18

Rmd file (UTF8 coding)

The R-software is available for download from the official website: https://www.r-project.org

A user-friendly interface (one of many): RStudio

Some useful manuals and tutorials for R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

1. Cluster Analysis in R

The cluster analysis is a statistical method designed for distinguishing various objects, respectively grouping them into some disjoint sets with respect to their similarity/dissimilarity. The similarity/dissimilarty is usually measured with respect to some proximity measure or some distance (metric respectively) – see the R function dist() and the corresponding helps session. However, the cluster analysis is not just one specific method/approach but it is rather a whole set of various tools and algoritms, which can be used to solve a general grouping problem (assigning varibles into some disjoint groups).

From the theoretical point of view we distinguish between two types of the clustering algorithms: partitioning algorithms where the assignments of objects into given groups can change along the algorithmic path and hierarchical algoritihms, where the assignments of any objec is kept fixed once the assignment of the given object/ was already performed.

The result of the clustering algorithm is mainly affected by the choice of the distance/proximity matrix and the clustering method used for the object assignment (choice of some specific clustering algorithm). There are many different clustering approaches available in various statistical software for clustering and many modifications of exisitng algorithms are still being proposed. The most commonly known are the following:

Partitioning clustering
Hierarchical clustering
Centroid-based clustering
Distribution-based clustering
Density-based clustering
and many others…

Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R):

rm(list = ls())
data <- mtcars
attach(data)
head(data)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The data consist of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars).

Note

Clustering techniques are largerly exploratory tools and the clustering approaches can be also considered as some dimensionality reduction techniques. Thus, it is common to choose the number of clusters in a way that makes the most sense. By defining clusters we try to derive a reduced representation of the original data – a respresentation via clusters.

On the other hand, clustering should not be confused with classification.

Classification works, in general, with some data for which the groups are known and we try to learn what differentiates them to assign future observations.
Clustering stants with no prior information what are the underlying groups – the groups are unknown and we try to identify the groups by considering mutual similarities/differencies in the given observations.

Hierarchical Clustering in R

Hierarchical agglomerative clustering methods generate a classification (clusters) in a bottom-up manner, by a series of agglomerations in which small clusters, initially containing individual observations, are fused together to form progressively larger clusters. Agglomerative techniques are computationaly simple because the distances between clusters can be easily calculated from the distance matrix \(\mathbb{D}\) calculated from the original observations.

Hierarchical divisive clustering methods generate a classification in a top-down manner, by progressively sub-dividing the single cluster which represents an entire dataset. Hierarchical methods tend to be very demanding of computational resources since a complete hierarchy of partitions has to be built-up rather than just a single partition.

One problem with these methods, in general, is how to choose which clusters or partitions to extract from the hierarchy since display of the full hierarchy is not really appropriate for datasets of more than a few hundred compounds.

Let us consider all covariates from the dataset. We will start by calculating the distance mantrix. In the following we are using the default option – the Euclidian distance.

D <- dist(mtcars) ### for the euclidian distance by default

Now, we can apply function hclust() (available under the standard R instalation) to run a hierarchical (agglomerative) clustering approach based on the proximity matrix \(\mathbb{D}\). See the help session in R for more details (type ?hclust()).

HC1 <- hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")

The default clustering method in hclust() is the complete linkage approach which uses the furthest neighbor to create groups where all points are relatively close to each other. Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:

‘method = ’single’ (nearest neighbor approach)
‘method = ’complete’ (furthest neighbor apprach)
‘method = ’average’ (compromise between the single and complete linkage)
‘method = ’median’ (robust version of the average method)
‘method = ’centroid’ (apprach which facilitates the geometrical distance)
‘method = ’mcquitty’ (weighted version of the average method)
‘method = ’ward.D’ (joins groups that do not increase too much a given measure of heterogeneity)

A little more fancy version available in the R software is the package called ‘sparcl’, which needs to be installed first (use install.packages('sparcl') for the instalation). After loading the library the corrresponding command is ColorDendrogram() (see the help session for more details).

In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:

plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

One can also improve the overall impression of the final dendogram by considering an aligned version with some additional parameter settings.

plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = -1)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

and even more sophisticated alternatives, e.g.:

cutHC1 = cutree(HC1, 3)
labelColors = c("red", "green", "blue")

# function to get color labels
colLab <- function(n) {
    if (is.leaf(n)) {
        a <- attributes(n)
        labCol <- labelColors[cutHC1[which(names(cutHC1) == a$label)]]
        attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
    }
    n
}
# using dendrapply
clusDendro = dendrapply(as.dendrogram(HC1), colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")

The same idea of the code can be also used to define colors of branches with respect to some additional covariate (for instance, the covariate gear).

local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
    }
    n
  }
  mycols <- mtcars$gear
  i <- 0
})

x.clust.dend1 <- dendrapply(as.dendrogram(HC1), colLab)
plot(x.clust.dend1)

To Do

Consider the standard clustering function in R (function hclust()) and use various options for calculating the distance matrix \(\mathbb{D}\).
Consider different types of distances for function dist() and apply the aglomerative clustering approch while selecting different methods for the cluster calculations (various settings of the ‘method =’ parameter when calling hclust() function.
Use some dendogram pictures and compare results from different distances and different clustering approaches.
How many clusters seem to be a reasonable choice and what would be the natural interpretation of them?

Non-hierarchical Clustering in R

The most common approaches for non-hierarchical clustering is so-called K means algorithm. The idea behind the algorithm is to partition the observations into some prespecified number of groups (\(K\) groups - thus the name of the method) such that the sum of squares from points to the assigned cluster centres is minimized.

Suppose you have \(K\) possible means for each observed value \(\boldsymbol{x}_i\) for \(i = 1,\dots, n\). Theoretically, we can use a formulation that \[ E[\boldsymbol{x}_i | k_i] = \mu_{k_{i}}, \] where \(k_{i} \in \{1, \dots, K\}\). In other words, if \(k_i = 1\), then the corresponding observation \(\boldsymbol{x}_i\) belongs to cluster 1. Each mean parameter \(\mu_{k}\), for \(k = 1, \dots, K\) has an associated probability or ‘’weight’‘in the overall’‘mixture’’. For some new observation with unknown cluster assignment we can write \[ E[x] = P(k = 1) \mu_1 + \dots + P(k = K) \mu_{K}. \]

In the R software the function which performs the non-hierarchical K-means clustering is kmeans() available under the standard R instalation (see the help session for more details). The goal of the algorithm is to find \(K\) clusters (the number of clusters is given in ahead) so that the within-cluster variability is small. However, we do not know the cluster centroids (K-means) \(\mu_k\).

This reminds a classical chicken-and-egg problem: if we knew the right cluster assignment \(k_i\) for some observation \(\boldsymbol{x}_i\), we could easily estimate the corresponding group mean \(\mu_{k}\). On the other hand, if we knew the corresponding cluster means (centroids) \(\mu_k\), for \(k = 1, \dots, K\), we could easily calculate the corresponding assignment \(k_i\). The solution to the problem is to use iterations between choosing the assingment \(k_i\) and the estimating the group mean \(\mu_k\).

Kmeans <- kmeans(mtcars, centers = 3)
colVector <- as.numeric(Kmeans$cluster)

plot(mtcars$mpg ~ mtcars$hp, bg = colVector, xlab = "Horse Power", ylab = "Miles per Gallon", pch = 21, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,4], col = 1:3, pch = 8, cex = 2)
text((mtcars$mpg + 1) ~ mtcars$hp,labels=rownames(mtcars), col=colVector, cex = 0.5)

plot(mtcars$mpg ~ mtcars$disp, bg = colVector, xlab = "Displacement", ylab = "Miles per Gallon", pch = 19 + cyl/2, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,3], col = "black", pch = 21, cex = 2, bg = 1:3)
text((mtcars$mpg + 1) ~ mtcars$disp,labels=rownames(mtcars), col=colVector, cex = 0.5)
legend(420, 34, legend = c("4 Cylinders", "6 Cylinders", "8 Cylinders"), pch = c(21, 22, 23), bg = "grey")

Note

The scale of the data matters: if you replace some some covariate \(X_{j}\) with \(2 X_{j}\) then the corresponding dimension counts twice as much in determining the corresponding distance from the center (and will have more influence on the cluster membership).
The recommended approach is to consider standardized data. Then for the centers \(\mu_{k j}\) for \(k = 1, \dots, K\) and covariate \(j = 1, \dots, p\) the clusters are interpreted as standard deviations from the marginal averages.

Do by yourself

Consider the example above, however, standardaze the data first. Consider three clusters and run the non-hierarchical k-means algorithm to obtain three clusters. Compare the results with the output presented above.

Another example relates to the protein consuption in Europe. The corresponding data can be loaded in (and properly scaled) by the following commands:

protein <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/protein.csv", header = T, row.names=1)
sprotein <- scale(protein)

subset<-sprotein[,(1:2)]
plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein))

We are only interested in the overall consumption of red and white meat in European countries and we want to find countries which are similar with respect to the meat consumption. Let us consider three clusters.

grpMeat<- kmeans(subset, center=3, nstart=10)

plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein), col=grpMeat$cluster+1)
means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)

In all the exercises above we choosed the number of clusters – value \(K\) - apriori. However, if no prior knowledge about \(K\) is avialable in ahead, how to choose the number of clusters? THere are some recommendations to use:

Most often the clustering methods are exploration tools, therefore, it is recommended to choose \(K\) such that it makes the most sense with respect to the final interpretaion.
Another approach to choose \(K\) is to apply a data-based model building technique and to built models for a sequence of \(K_1 < K_2 < \dots, K_M\) and to use some selection tool to choose the best model.
We can use some information criterion (BIC, AIC) or cross-validation technique to evaluate the within sum of squares in the given clusters. The corresponding degrees of freedom are \(K \times p\), where \(K\) is the number of clusters and \(p \in \mathbb{N}\) is the number of dimensions (thus, one mean parameter for each cluster and each dimension).

mean_overall<-apply(subset,2,mean)
means_between<-(t(means)-mean_overall)^2
sum(t(means_between)*grpMeat$size)

## [1] 35.96535

For the within sum of squares we obtain:

grpMeat<- kmeans(subset, center=3, nstart=10)

plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein), col=grpMeat$cluster+1)
means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)

for(i in (1:nrow(protein))){  
    if (grpMeat$cluster[i]==1){
        lines(c(subset[i,1],means[1,1]), c(subset[i,2],means[1,2]),col=2)
    }
    if (grpMeat$cluster[i]==2){
        lines(c(subset[i,1],means[2,1]), c(subset[i,2],means[2,2]),col=3)
    }
    if (grpMeat$cluster[i]==3){
        lines(c(subset[i,1],means[3,1]), c(subset[i,2],means[3,2]),col=4)
    }}

And for the between sum of squares we have:

plot(subset,pch=19,col=grpMeat$cluster+1,cex=0.2)

means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)
points(mean_overall[1],mean_overall[2],pch=4,lwd=3)

lines(c(mean_overall[1],means[3,1]), c(mean_overall[2],means[3,2]), col=4)
lines(c(mean_overall[1],means[2,1]), c(mean_overall[2],means[2,2]), col=3)
lines(c(mean_overall[1],means[1,1]), c(mean_overall[2],means[1,2]), col=2)

text(c(10,12,9),c(10,8,6),grpMeat$size)

The sum of squares are (similarly as in a linear regression appoach) decomposed into three major parts: the within sum of squares, which can be obtained as

grpMeat$tot.withinss

## [1] 12.03465

the between sum of squares, given as

grpMeat$betweenss

## [1] 35.96535

and, finaly, the total sum of squares which is defined as the sum of the previous too sums of squares:

grpMeat$tot.withinss+grpMeat$betweenss

## [1] 48

Do by yourself

Use different starting values for the k-means algoritm and compare the results you obtain.
Some other methodologies and approaches are avialable in different R packages and various R functions.

2. Mixture models

The fundamental model of clustering is a mixture model: observations are randomly drawn from K different populations. Each population has different average characteristics. For the data set with the cars we can assume, for instance, the distribution of the consuption (miles per gallon) given by the number of cylinders.

The estimated mixture distribution is the following:

mean1 <- mean(mpg[cyl == 4])
mean2 <- mean(mpg[cyl == 6])
mean3 <- mean(mpg[cyl == 8])

prop <- table(cyl)/dim(mtcars)[1]

mixture <- function(x, prop, means){
  mixValue <- 0
  for (i in 1:length(prop)){
    mixValue <- mixValue + prop[i] * dnorm(x, means[i])
  }
  return(mixValue)
}

xSeq <- seq(10, 34, length = 100)
density <- mixture(xSeq, prop, c(mean1, mean2, mean3))
plot(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density")

The mixture is composed from three mixtures (normal distributions with the unit variance and some corresponding mean parameters).

plot(density ~ xSeq, col = "red", type = "l", lwd = 3, xlab = "Miles per Gallon", ylab = "Density")
d1 <- density(rnorm(10000, mean1, 1))
d2 <- density(rnorm(10000, mean2, 1))
d3 <- density(rnorm(10000, mean3, 1))
lines(d1$x, prop[1] * d1$y, col = "black", lwd = 2)
lines(d2$x, prop[2] * d2$y, col = "violet", lwd = 2)
lines(d3$x, prop[3] * d3$y, col = "gray", lwd = 2)

The joint density (mixture distribution) has multiple nodes – one for each cluster mean \(\mu_k\), for \(k = 1, 2, 3\). However, for two centroids which are close enough, it is possible that only one node will be present in the final mixture.

Do by yourself

How the mixture model can be used for clustering? How would you define the decision algorithm if you need to assign some new observation into one of three clusters?
Try to improve the mixture distribution, for instance, by considering heterogenous variances. Compare the results under homoscedasticity and heteroscedasticity.
The mixture can be further improved by considering some other than normal distribution. On the other hand, the normal distribution is quite convenient from some theoretical point of view.
Consider a simulated example with at least 4 clusters in the mixture distribution and try to find centroid positions in a way that there will be less nodes in the final mixture than the total number of the mean parameters.

Homework Assignment

(Deadline: lab session 14.05.2018)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Choose one dataset and apply the cluster analysis;
Discuss the optimal choice of the number of the clusters and try to interpret the results;