1. Cluster Analysis in R (briefly)
The cluster analysis is a statistical method meant for grouping objents into disjoint sets with respect their similarity/dissimilarity. The similarity/dissimilarty is measured with respect to some proximity/distance measure (see function dist()
in the R environment - standard instalation)However, cluster analysis is not just one specific method/approach but rather a whole set of different tools (algoritms) which can be used to solve the grouping problem.
Different approaches to clustering problems include
-
Hierarchical clustering
-
Centroid-based clustering
-
Distribution-based clustering
-
Density-based clustering
-
and many others…
Firstly, we loading the dataset:
rm(list = ls())
data <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/geoDataEN.csv", header = T)
attach(data)
head(data)
## country grade gender geoMark popularity Q1 Q2 Q3 Q4 Q5 Q6
## 1 sr grade1 female 1 1 100.00000 100 100 100 100 100
## 2 sr grade1 male 3 3 0.00000 0 0 100 75 50
## 3 cr grade1 male 2 1 100.00000 100 0 100 75 50
## 4 sr grade1 male 3 3 66.66667 0 0 50 0 50
## 5 cr grade1 male 2 2 66.66667 0 100 50 0 100
## 6 cr grade1 female 1 1 100.00000 0 100 100 50 100
## Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21
## 1 100 100 100.00000 100.00000 100 100 100 100 100 75 100 100 0 0 0
## 2 100 0 100.00000 100.00000 0 0 100 0 0 0 0 100 0 0 0
## 3 100 100 71.42857 100.00000 0 50 0 0 100 75 100 100 0 0 0
## 4 0 0 100.00000 0.00000 0 100 100 50 0 0 0 100 0 0 0
## 5 0 0 100.00000 66.66667 0 100 100 50 0 0 0 50 0 0 0
## 6 0 75 100.00000 100.00000 100 100 100 0 0 0 0 100 0 0 0
## Overall
## 1 84.52
## 2 34.52
## 3 58.16
## 4 29.37
## 5 37.30
## 6 53.57
The dataset consits of 803 individual (independent) observations where 27 different covariates are recorded. A brief description is given below:
K dispozici sú výsledky testov od 803 žiakov a sledovaných je 28 proměnných.
-
country - two level factor to distinguish for the Czech republic (cz) and Slovakia (sk);
-
grade - three level factor to distinguish for 11, 15 and 18 years old students;
-
gender - two level factor for a gender (1 - female student, 2 - male student)
-
grade - geography grade;
-
popularity - level of popularity given by each student (from 1 - liked till 3 - disliked);
-
Q1 - Q21 - percentual gain for each of 21 geography questions;
-
Overall - overall percentual gain;
Hierarchical Clustering in R
Let us consider the results from all 21 questions including the overall summary to calculate the proximity matrix \(D\). Diffenent distance definitions can be used to achieve this. In the following we are using the default setting - Euclidian distance.
However, in order to improve the visibility in the following dendograms (which we will construct later) we will not consider the whole dataset (\(n = 803\) students) but rather a subsample only. A random subsample is created using the following piece of the R code (remmember to set the seed - command set.seed()
in order to get the same data set each time you run the command).
set.seed(1982)
id <- 1:803
sample <- sample(1:803, 40, replace = F)
rowNames <- paste(gender, grade,id, sep = "_")
gradeGroup <- data[sample, 2]
Qresults <- data.frame(data[sample,6:27], row.names = rowNames[sample])
D <- dist(Qresults) ### for the euclidian distance
and we can apply function hclust()
(available under the standard R instalation) to run a hierarchical clustering approach based on the proximity matrix \(D\). See the help session in R for more details (type ?hclust()
).
HC1 <- hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")
More fancy version available with the R package ‘sparcl’ which needs to be installed first (use install.packages('sparcl')
for instalation). After loading the library into the R environment the corrresponding command is ColorDendrogram()
(see the help session for more details).
In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:
plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")
Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:
-
‘method = ’single’
-
‘method = ’complete’
-
‘method = ’average’
-
‘method = ’median’
-
‘method = ’centroid’
-
‘method = ’mcquitty’
-
‘method = ’ward.D’
The default setting for the hclust()
function is ‘method = complete’ which is also displayed on each dendogram figure. One can of course improve the overall impression by considering more fancy versions of dendograms.
plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = -1)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")
and even more sophisticated alternatives, e.g.:
cutHC1 = cutree(HC1, 3)
labelColors = c("red", "green", "blue")
# function to get color labels
colLab <- function(n) {
if (is.leaf(n)) {
a <- attributes(n)
labCol <- labelColors[cutHC1[which(names(cutHC1) == a$label)]]
attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
}
n
}
# using dendrapply
clusDendro = dendrapply(as.dendrogram(HC1), colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")
The same idea of the code can be also used to define colors of branches with respect to some additional covariate (e.g. the grade of the students - covariate gradeGroup
created at the beginning).
local({
colLab <<- function(n) {
if(is.leaf(n)) {
a <- attributes(n)
i <<- i+1
attr(n, "edgePar") <-
c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
}
n
}
mycols <- gradeGroup
i <- 0
})
x.clust.dend1 <- dendrapply(as.dendrogram(HC1), colLab)
plot(x.clust.dend1)
To Do
-
Consider different types of distances for function
dist()
and apply the aglomerative clustering approch while selecting different methods for the cluster calculations (various settings of the ‘method =’ parameter when calling hclust()
function.
-
Use some dendogram pictures and compare results from different distances and different clustering approaches.
Now we apply the same approach for much bigger dataset (e.g. the number of observations \(n = 200\)) distinguishing for three clusters and using colors to distinguish for different grades. The idea is to see whether clusters somehow corresponds with three available grades (as we had maybe an impression above) or rather not…
n <- 200 ### number of observations to be used
gColor <- rep("lightblue", n)
gColor[data[1:n,2] == "grade2"] <- "blue"
gColor[data[1:n,2] == "grade3"] <- "darkred"
local({
colLab <<- function(n) {
if(is.leaf(n)) {
a <- attributes(n)
i <<- i+1
attr(n, "edgePar") <-
c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
}
n
}
mycols <- gColor
i <- 0
})
Dall <- dist(data[1:n,6:27])
HC2 <- hclust(Dall)
x.clust.dend2 <- dendrapply(as.dendrogram(HC2), colLab)
plot(x.clust.dend2)
groups2 <- cutree(HC2, k=3)
rect.hclust(HC2, k=3, border="red")
More sophisticated dendograms can be obtained with some additional libraries (e.g. library ‘ape’ which can be installed by install.packages("ape")
or library ‘sparcl’ which can be installed by running install.packages("sparcl")
).
library(sparcl)
ColorDendrogram(HC1, y = cutHC1, labels = names(cutHC1), main = "Geography Data",
branchlength = 80)
Non-hierarchical Clustering in R
The most common approache for non-hierarchical clustering is so-called K means algorithm. The idea behind the algorithm is to partition the observations into some prespecified number of groups (\(k\) groups - thus the name of the method) such that the sum of squares from points to the assigned cluster centres is minimized.
In the R software the function which performs the non-hierarchical K-means clustering is kmeans()
available under the standard R instalation (see the help session for more details).
2. Introduction to discriminant analysis in R
Discriminant analysis is a (not purely a multivariate) statistical method which was designed for allocating various objects into similar (homogenous) groups. The idea is to use some available information (given covariates) in order to create some classification rule which can be used later to assign some new object into one of the existing groups.
There are many different approaches how to define a descrimination rule. The most classical approaches include a linear discriminant function (in the R software implemented in lda()
function available in the packages ‘MASS’) and quadratic discriminant function (command ‘qda()’ in package ‘MASS’).
Use the help sessions in R for both functions to find out more about there implementation in R.
The following is a nice R Markdown available in the R studio prepared by G. Martos:
http://rstudio-pubs-static.s3.amazonaws.com/35817_2552e05f1d4e4db8ba87b334101a43da.html
To Do
-
Use the R Markdown available on the given website and apply some linear and quadratic distriminant functions to some dataset.
-
Provide some graphical outputs and discuss obtained results.
-
Use some additional datapoints (feel free to make up some) and try to use the discrimination rules to assign the new points to the existing groups.