NMST539, LS 2015/16

Cvičenie 10 (týždeň 11)

Cluster Analysis and Discriminant Analysis


1. Cluster Analysis in R (briefly)

The cluster analysis is a statistical method meant for grouping objents into disjoint sets with respect their similarity/dissimilarity. The similarity/dissimilarty is measured with respect to some proximity/distance measure (see function dist() in the R environment - standard instalation)However, cluster analysis is not just one specific method/approach but rather a whole set of different tools (algoritms) which can be used to solve the grouping problem.

Different approaches to clustering problems include
  • Hierarchical clustering
  • Centroid-based clustering
  • Distribution-based clustering
  • Density-based clustering
  • and many others…

Firstly, we loading the dataset:

rm(list = ls())
data <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/geoDataEN.csv", header = T)
attach(data)
head(data)
##   country  grade gender geoMark popularity        Q1  Q2  Q3  Q4  Q5  Q6
## 1      sr grade1 female       1          1 100.00000 100 100 100 100 100
## 2      sr grade1   male       3          3   0.00000   0   0 100  75  50
## 3      cr grade1   male       2          1 100.00000 100   0 100  75  50
## 4      sr grade1   male       3          3  66.66667   0   0  50   0  50
## 5      cr grade1   male       2          2  66.66667   0 100  50   0 100
## 6      cr grade1 female       1          1 100.00000   0 100 100  50 100
##    Q7  Q8        Q9       Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21
## 1 100 100 100.00000 100.00000 100 100 100 100 100  75 100 100   0   0   0
## 2 100   0 100.00000 100.00000   0   0 100   0   0   0   0 100   0   0   0
## 3 100 100  71.42857 100.00000   0  50   0   0 100  75 100 100   0   0   0
## 4   0   0 100.00000   0.00000   0 100 100  50   0   0   0 100   0   0   0
## 5   0   0 100.00000  66.66667   0 100 100  50   0   0   0  50   0   0   0
## 6   0  75 100.00000 100.00000 100 100 100   0   0   0   0 100   0   0   0
##   Overall
## 1   84.52
## 2   34.52
## 3   58.16
## 4   29.37
## 5   37.30
## 6   53.57



The dataset consits of 803 individual (independent) observations where 27 different covariates are recorded. A brief description is given below:

K dispozici sú výsledky testov od 803 žiakov a sledovaných je 28 proměnných.
  • country - two level factor to distinguish for the Czech republic (cz) and Slovakia (sk);
  • grade - three level factor to distinguish for 11, 15 and 18 years old students;
  • gender - two level factor for a gender (1 - female student, 2 - male student)
  • grade - geography grade;
  • popularity - level of popularity given by each student (from 1 - liked till 3 - disliked);
  • Q1 - Q21 - percentual gain for each of 21 geography questions;
  • Overall - overall percentual gain;



Hierarchical Clustering in R

Let us consider the results from all 21 questions including the overall summary to calculate the proximity matrix \(D\). Diffenent distance definitions can be used to achieve this. In the following we are using the default setting - Euclidian distance.

However, in order to improve the visibility in the following dendograms (which we will construct later) we will not consider the whole dataset (\(n = 803\) students) but rather a subsample only. A random subsample is created using the following piece of the R code (remmember to set the seed - command set.seed() in order to get the same data set each time you run the command).

set.seed(1982)
id <- 1:803
sample <- sample(1:803, 40, replace = F)
rowNames <- paste(gender, grade,id, sep = "_")
gradeGroup <- data[sample, 2]
Qresults <- data.frame(data[sample,6:27], row.names = rowNames[sample])
D <- dist(Qresults) ### for the euclidian distance

and we can apply function hclust() (available under the standard R instalation) to run a hierarchical clustering approach based on the proximity matrix \(D\). See the help session in R for more details (type ?hclust()).

HC1 <- hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")

More fancy version available with the R package ‘sparcl’ which needs to be installed first (use install.packages('sparcl') for instalation). After loading the library into the R environment the corrresponding command is ColorDendrogram() (see the help session for more details).

In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:

plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:

  • ‘method = ’single’
  • ‘method = ’complete’
  • ‘method = ’average’
  • ‘method = ’median’
  • ‘method = ’centroid’
  • ‘method = ’mcquitty’
  • ‘method = ’ward.D’

The default setting for the hclust() function is ‘method = complete’ which is also displayed on each dendogram figure. One can of course improve the overall impression by considering more fancy versions of dendograms.

plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = -1)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

and even more sophisticated alternatives, e.g.:

cutHC1 = cutree(HC1, 3)
labelColors = c("red", "green", "blue")

# function to get color labels
colLab <- function(n) {
    if (is.leaf(n)) {
        a <- attributes(n)
        labCol <- labelColors[cutHC1[which(names(cutHC1) == a$label)]]
        attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
    }
    n
}
# using dendrapply
clusDendro = dendrapply(as.dendrogram(HC1), colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")

The same idea of the code can be also used to define colors of branches with respect to some additional covariate (e.g. the grade of the students - covariate gradeGroup created at the beginning).

local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
    }
    n
  }
  mycols <- gradeGroup
  i <- 0
})

x.clust.dend1 <- dendrapply(as.dendrogram(HC1), colLab)
plot(x.clust.dend1)




To Do


  • Consider different types of distances for function dist() and apply the aglomerative clustering approch while selecting different methods for the cluster calculations (various settings of the ‘method =’ parameter when calling hclust() function.
  • Use some dendogram pictures and compare results from different distances and different clustering approaches.



Now we apply the same approach for much bigger dataset (e.g. the number of observations \(n = 200\)) distinguishing for three clusters and using colors to distinguish for different grades. The idea is to see whether clusters somehow corresponds with three available grades (as we had maybe an impression above) or rather not…

n <- 200 ### number of observations to be used
gColor <- rep("lightblue", n)
gColor[data[1:n,2] == "grade2"] <- "blue"
gColor[data[1:n,2] == "grade3"] <- "darkred"

local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
    }
    n
  }
  mycols <- gColor
  i <- 0
})

Dall <- dist(data[1:n,6:27])
HC2 <- hclust(Dall)

x.clust.dend2 <- dendrapply(as.dendrogram(HC2), colLab)
plot(x.clust.dend2)

groups2 <- cutree(HC2, k=3)
rect.hclust(HC2, k=3, border="red")

More sophisticated dendograms can be obtained with some additional libraries (e.g. library ‘ape’ which can be installed by install.packages("ape") or library ‘sparcl’ which can be installed by running install.packages("sparcl")).

library(sparcl)
ColorDendrogram(HC1, y = cutHC1, labels = names(cutHC1), main = "Geography Data", 
    branchlength = 80)



Non-hierarchical Clustering in R

The most common approache for non-hierarchical clustering is so-called K means algorithm. The idea behind the algorithm is to partition the observations into some prespecified number of groups (\(k\) groups - thus the name of the method) such that the sum of squares from points to the assigned cluster centres is minimized.

In the R software the function which performs the non-hierarchical K-means clustering is kmeans() available under the standard R instalation (see the help session for more details).

Comment


  • Many other methodologies and approaches are avialable in many different R packages and various R functions.



2. Introduction to discriminant analysis in R

Discriminant analysis is a (not purely a multivariate) statistical method which was designed for allocating various objects into similar (homogenous) groups. The idea is to use some available information (given covariates) in order to create some classification rule which can be used later to assign some new object into one of the existing groups.

There are many different approaches how to define a descrimination rule. The most classical approaches include a linear discriminant function (in the R software implemented in lda() function available in the packages ‘MASS’) and quadratic discriminant function (command ‘qda()’ in package ‘MASS’).

Use the help sessions in R for both functions to find out more about there implementation in R.

The following is a nice R Markdown available in the R studio prepared by G. Martos:

http://rstudio-pubs-static.s3.amazonaws.com/35817_2552e05f1d4e4db8ba87b334101a43da.html



To Do


  • Use the R Markdown available on the given website and apply some linear and quadratic distriminant functions to some dataset.
  • Provide some graphical outputs and discuss obtained results.
  • Use some additional datapoints (feel free to make up some) and try to use the discrimination rules to assign the new points to the existing groups.



Domáca úloha (dobrovoľná)

(Deadline: 16.05.2016)

Na webovej stránke Doc. Hlávku

http://www1.karlin.mff.cuni.cz/~hlavka/teac.html

je k dispozícii niekoľko dátových súborov, ktoré do Rka stačí načítať pomocou príkazu load(nazov_suboru.rda).

  • vyberte si jeden dátový súbor, ktorý obsahuje kategorické aj spojité proměnné. Použijte niektorú z kategorických proměnných pro definici skupín a nasledně aplikujte diskriminační analýzu a vytvořte diskriminační funkci pomocou niž je možné klasifikovat nové pozorování do některé z existujících skupín.