1. Cluster Analysis in R
The cluster analysis is a statistical method designed for distinguishing various objects, respectively grouping them into some disjoint sets with respect to their similarity/dissimilarity. The similarity/dissimilarty is usually measured with respect to some proximity or distance measure (see function dist()
in the R environment  similarly, as it was used in case of the mutidimensional scaling technique). However, the cluster analysis is not just one specific method/approach but it is rather a whole set of various tools and algoritms, which can be used to solve the grouping problem.
Considering the constructing proces we distinguish between two types of the clustering algorithms: partitioning algorithms where the assignments of objects into given groups can change during the algorithm and hierarchical algoritihms, where the assignments of objecs is kept fixed during the algorithm. The result of the clustering algorithm is mainly affected by the choice of the distance/proximity matrix and the clustering method used in the algorithm. Among others, we can include into the clustering approaches the following:

Partitioning clustering

Hierarchical clustering

Centroidbased clustering

Distributionbased clustering

Densitybased clustering

and many others…
Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R):
rm(list = ls())
data < mtcars
attach(data)
head(data)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The data consist of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars
).
Hierarchical Clustering in R
Let us consider all covariates from the dataset. We will start by calculating the distance mantrix. In the following we are using the default option  the Euclidian distance.
In order to improve the visibility in the following dendograms (which we will construct later) we will not consider the whole dataset (\(n = 803\) students) but only a smaller subsample. A random subsample is created using the following piece of the R code (remmember to set the seed  command set.seed()
in order to get the same data set each time you run the command).
D < dist(mtcars) ### for the euclidian distance
Now, we can apply function hclust()
(available under the standard R instalation) to run a hierarchical clustering approach based on the proximity matrix \(D\). See the help session in R for more details (type ?hclust()
).
HC1 < hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")
A little more fancy version available in the R software is the package called ‘sparcl’, which needs to be installed first (use install.packages('sparcl')
for the instalation). After loading the library the corrresponding command is ColorDendrogram()
(see the help session for more details).
In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:
plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups < cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")
Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:

‘method = ’single’

‘method = ’complete’

‘method = ’average’

‘method = ’median’

‘method = ’centroid’

‘method = ’mcquitty’

‘method = ’ward.D’
The default setting for the hclust()
function is ‘method = complete’ which is also displayed on each dendogram figure. One can of course improve the overall impression by considering more fancy versions of dendograms.
plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = 1)
groups < cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")