The R-software is available for download from the official website: https://www.r-project.org
A user-friendly interface (one of many): RStudio
Some useful manuals and tutorials for R (in Czech or English):
The cluster analysis is a statistical method designed for distinguishing various objects, respectively grouping them into some disjoint sets with respect to their similarity/dissimilarity. The similarity/dissimilarty is usually measured with respect to some proximity measure or some distance (metric respectively) – see the R function dist()
and the corresponding helps session. However, the cluster analysis is not just one specific method/approach but it is rather a whole set of various tools and algoritms, which can be used to solve a general grouping problem (assigning varibles into some disjoint groups).
From the theoretical point of view we distinguish between two types of the clustering algorithms: partitioning algorithms where the assignments of objects into given groups can change along the algorithmic path and hierarchical algoritihms, where the assignments of any objec is kept fixed once the assignment of the given object/ was already performed.
The result of the clustering algorithm is mainly affected by the choice of the distance/proximity matrix and the clustering method used for the object assignment (choice of some specific clustering algorithm). There are many different clustering approaches available in various statistical software for clustering and many modifications of exisitng algorithms are still being proposed. The most commonly known are the following:
Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R):
rm(list = ls())
data <- mtcars
attach(data)
head(data)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The data consist of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars
).
Clustering techniques are largerly exploratory tools and the clustering approaches can be also considered as some dimensionality reduction techniques. Thus, it is common to choose the number of clusters in a way that makes the most sense. By defining clusters we try to derive a reduced representation of the original data – a respresentation via clusters.
On the other hand, clustering should not be confused with classification.Hierarchical agglomerative clustering methods generate a classification (clusters) in a bottom-up manner, by a series of agglomerations in which small clusters, initially containing individual observations, are fused together to form progressively larger clusters. Agglomerative techniques are computationaly simple because the distances between clusters can be easily calculated from the distance matrix \(\mathbb{D}\) calculated from the original observations.
Hierarchical divisive clustering methods generate a classification in a top-down manner, by progressively sub-dividing the single cluster which represents an entire dataset. Hierarchical methods tend to be very demanding of computational resources since a complete hierarchy of partitions has to be built-up rather than just a single partition.
One problem with these methods, in general, is how to choose which clusters or partitions to extract from the hierarchy since display of the full hierarchy is not really appropriate for datasets of more than a few hundred compounds.
Let us consider all covariates from the dataset. We will start by calculating the distance mantrix. In the following we are using the default option – the Euclidian distance.
D <- dist(mtcars) ### for the euclidian distance by default
Now, we can apply function hclust()
(available under the standard R instalation) to run a hierarchical (agglomerative) clustering approach based on the proximity matrix \(\mathbb{D}\). See the help session in R for more details (type ?hclust()
).
HC1 <- hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")
The default clustering method in hclust()
is the complete linkage approach which uses the furthest neighbor to create groups where all points are relatively close to each other. Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:
A little more fancy version available in the R software is the package called ‘sparcl’, which needs to be installed first (use install.packages('sparcl')
for the instalation). After loading the library the corrresponding command is ColorDendrogram()
(see the help session for more details).
In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:
plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")
One can also improve the overall impression of the final dendogram by considering an aligned version with some additional parameter settings.
plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = -1)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")
and even more sophisticated alternatives, e.g.:
cutHC1 = cutree(HC1, 3)
labelColors = c("red", "green", "blue")
# function to get color labels
colLab <- function(n) {
if (is.leaf(n)) {
a <- attributes(n)
labCol <- labelColors[cutHC1[which(names(cutHC1) == a$label)]]
attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
}
n
}
# using dendrapply
clusDendro = dendrapply(as.dendrogram(HC1), colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")
The same idea of the code can be also used to define colors of branches with respect to some additional covariate (for instance, the covariate gear
).
local({
colLab <<- function(n) {
if(is.leaf(n)) {
a <- attributes(n)
i <<- i+1
attr(n, "edgePar") <-
c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
}
n
}
mycols <- mtcars$gear
i <- 0
})
x.clust.dend1 <- dendrapply(as.dendrogram(HC1), colLab)
plot(x.clust.dend1)
hclust()
) and use various options for calculating the distance matrix \(\mathbb{D}\).
dist()
and apply the aglomerative clustering approch while selecting different methods for the cluster calculations (various settings of the ‘method =’ parameter when calling hclust()
function.
The most common approaches for non-hierarchical clustering is so-called K means algorithm. The idea behind the algorithm is to partition the observations into some prespecified number of groups (\(K\) groups - thus the name of the method) such that the sum of squares from points to the assigned cluster centres is minimized.
Suppose you have \(K\) possible means for each observed value \(\boldsymbol{x}_i\) for \(i = 1,\dots, n\). Theoretically, we can use a formulation that \[ E[\boldsymbol{x}_i | k_i] = \mu_{k_{i}}, \] where \(k_{i} \in \{1, \dots, K\}\). In other words, if \(k_i = 1\), then the corresponding observation \(\boldsymbol{x}_i\) belongs to cluster 1. Each mean parameter \(\mu_{k}\), for \(k = 1, \dots, K\) has an associated probability or ‘’weight’‘in the overall’‘mixture’’. For some new observation with unknown cluster assignment we can write \[ E[x] = P(k = 1) \mu_1 + \dots + P(k = K) \mu_{K}. \]
In the R software the function which performs the non-hierarchical K-means clustering is kmeans()
available under the standard R instalation (see the help session for more details). The goal of the algorithm is to find \(K\) clusters (the number of clusters is given in ahead) so that the within-cluster variability is small. However, we do not know the cluster centroids (K-means) \(\mu_k\).
This reminds a classical chicken-and-egg problem: if we knew the right cluster assignment \(k_i\) for some observation \(\boldsymbol{x}_i\), we could easily estimate the corresponding group mean \(\mu_{k}\). On the other hand, if we knew the corresponding cluster means (centroids) \(\mu_k\), for \(k = 1, \dots, K\), we could easily calculate the corresponding assignment \(k_i\). The solution to the problem is to use iterations between choosing the assingment \(k_i\) and the estimating the group mean \(\mu_k\).
Kmeans <- kmeans(mtcars, centers = 3)
colVector <- as.numeric(Kmeans$cluster)
plot(mtcars$mpg ~ mtcars$hp, bg = colVector, xlab = "Horse Power", ylab = "Miles per Gallon", pch = 21, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,4], col = 1:3, pch = 8, cex = 2)
text((mtcars$mpg + 1) ~ mtcars$hp,labels=rownames(mtcars), col=colVector, cex = 0.5)
plot(mtcars$mpg ~ mtcars$disp, bg = colVector, xlab = "Displacement", ylab = "Miles per Gallon", pch = 19 + cyl/2, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,3], col = "black", pch = 21, cex = 2, bg = 1:3)
text((mtcars$mpg + 1) ~ mtcars$disp,labels=rownames(mtcars), col=colVector, cex = 0.5)
legend(420, 34, legend = c("4 Cylinders", "6 Cylinders", "8 Cylinders"), pch = c(21, 22, 23), bg = "grey")
Another example relates to the protein consuption in Europe. The corresponding data can be loaded in (and properly scaled) by the following commands:
protein <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/protein.csv", header = T, row.names=1)
sprotein <- scale(protein)
subset<-sprotein[,(1:2)]
plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein))
We are only interested in the overall consumption of red and white meat in European countries and we want to find countries which are similar with respect to the meat consumption. Let us consider three clusters.
grpMeat<- kmeans(subset, center=3, nstart=10)
plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein), col=grpMeat$cluster+1)
means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)
In all the exercises above we choosed the number of clusters – value \(K\) - apriori. However, if no prior knowledge about \(K\) is avialable in ahead, how to choose the number of clusters? THere are some recommendations to use:
mean_overall<-apply(subset,2,mean)
means_between<-(t(means)-mean_overall)^2
sum(t(means_between)*grpMeat$size)
## [1] 35.96535
For the within sum of squares we obtain:
grpMeat<- kmeans(subset, center=3, nstart=10)
plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein), col=grpMeat$cluster+1)
means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)
for(i in (1:nrow(protein))){
if (grpMeat$cluster[i]==1){
lines(c(subset[i,1],means[1,1]), c(subset[i,2],means[1,2]),col=2)
}
if (grpMeat$cluster[i]==2){
lines(c(subset[i,1],means[2,1]), c(subset[i,2],means[2,2]),col=3)
}
if (grpMeat$cluster[i]==3){
lines(c(subset[i,1],means[3,1]), c(subset[i,2],means[3,2]),col=4)
}}
And for the between sum of squares we have:
plot(subset,pch=19,col=grpMeat$cluster+1,cex=0.2)
means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)
points(mean_overall[1],mean_overall[2],pch=4,lwd=3)
lines(c(mean_overall[1],means[3,1]), c(mean_overall[2],means[3,2]), col=4)
lines(c(mean_overall[1],means[2,1]), c(mean_overall[2],means[2,2]), col=3)
lines(c(mean_overall[1],means[1,1]), c(mean_overall[2],means[1,2]), col=2)
text(c(10,12,9),c(10,8,6),grpMeat$size)
The sum of squares are (similarly as in a linear regression appoach) decomposed into three major parts: the within sum of squares, which can be obtained as
grpMeat$tot.withinss
## [1] 12.03465
the between sum of squares, given as
grpMeat$betweenss
## [1] 35.96535
and, finaly, the total sum of squares which is defined as the sum of the previous too sums of squares:
grpMeat$tot.withinss+grpMeat$betweenss
## [1] 48
The fundamental model of clustering is a mixture model: observations are randomly drawn from K different populations. Each population has different average characteristics. For the data set with the cars we can assume, for instance, the distribution of the consuption (miles per gallon) given by the number of cylinders.
The estimated mixture distribution is the following:
mean1 <- mean(mpg[cyl == 4])
mean2 <- mean(mpg[cyl == 6])
mean3 <- mean(mpg[cyl == 8])
prop <- table(cyl)/dim(mtcars)[1]
mixture <- function(x, prop, means){
mixValue <- 0
for (i in 1:length(prop)){
mixValue <- mixValue + prop[i] * dnorm(x, means[i])
}
return(mixValue)
}
xSeq <- seq(10, 34, length = 100)
density <- mixture(xSeq, prop, c(mean1, mean2, mean3))
plot(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density")
The mixture is composed from three mixtures (normal distributions with the unit variance and some corresponding mean parameters).
plot(density ~ xSeq, col = "red", type = "l", lwd = 3, xlab = "Miles per Gallon", ylab = "Density")
d1 <- density(rnorm(10000, mean1, 1))
d2 <- density(rnorm(10000, mean2, 1))
d3 <- density(rnorm(10000, mean3, 1))
lines(d1$x, prop[1] * d1$y, col = "black", lwd = 2)
lines(d2$x, prop[2] * d2$y, col = "violet", lwd = 2)
lines(d3$x, prop[3] * d3$y, col = "gray", lwd = 2)
The joint density (mixture distribution) has multiple nodes – one for each cluster mean \(\mu_k\), for \(k = 1, 2, 3\). However, for two centroids which are close enough, it is possible that only one node will be present in the final mixture.
Use the command below to instal the SMSdata
packages with various multivariate datasets:
install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")
Chose one dataset of your choise from the list of all available datasets in the package:
data(package = "SMSdata")
There are 21 different datasets and you can load each of them by typing and running data(dataset_name)
, for instance:
data(plasma)