NMST539 | Lab Session 11

Discriminant Analysis and Similar Approaches in R

Summer Term 2019/2020 | April 13th - April 17th, 2020 (week 9)

Rmd file (UTF8 coding)

The R-software is available for download from the official website: https://www.r-project.org

A user-friendly interface (one of many): RStudio

Some useful manuals and tutorials for R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

1. Discriminant analysis in R

Discriminant analysis is a (not purely a multivariate) statistical method which was designed for allocating various objects into similar (homogenous) groups. The idea is to use some available information (given covariates—and if there are many of them, we can also speak about using a multivariate methods—same as, for instance, in a regression analysis) in order to create some classification rule which can be used later to assign some new object into one of the existing groups. From this point of view it can be considered as a classification tool. To remind the difference between clustering and clussification, let us recall the following:

Classification works with data for which the groups are known and we try to learn what differentiates them in order to be able to assign future observations into the given groups.
Clustering has no prior information on what are the underlying groups – the groups are unknown and we try to identify the groups by considering mutual similarities/differencies in the given observations.

There are many different approaches how to define some reasonable descrimination rule. Such rules can be based, for instance, on the prior probabilities (Bayes rule), the likelihood approach, or some other methods (including, for instance, some expert judgement involvement). For a normal model and homoscedastic error terms the discrimination rule simplifies to a linear form (linear discriminant analysis) while for the heteroscedastic errors the discrimination rule has a quadratic form instead. We mostly focuss on a linear discriminant analysis (LDA) which is a statistical method meant to find a linear combination of features (usually a set of continuous random variables) that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier. This method is implemented in the R software as the function lda() which is available in the packages ‘MASS’. A straightforward extention is a quadratic discrimination rule wich is performed with the R command qda() (also in package ‘MASS’).

The linear discriminant analysis is closely related to the analysis of variance (ANOVA) and the regression analysis, which both also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable instead. On the other hand, the logistic regression approach is more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.

If the normality assumptions is too much restrictive, we can use Fisher’s approach which finds the linear discrimination in a way that the between class variability and the within class variability are optimized. This method does not even require homoscedastic errors. Suppose that two groups of observations have their means denoted as \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu_2}\) and the corresponding variance-covariances matrices are \(\Sigma_1\) and \(\Sigma_2\). The Fisher’s linear separation measure between these two groups (the corresponding distributions respectively) is defined as a ratio of the variance between the groups to the variance within the groupds, which is \[ S = \frac{\sigma_{between}^2}{\sigma_{within}^2} = \frac{(\boldsymbol{w} \boldsymbol{\mu}_2 - \boldsymbol{w} \boldsymbol{\mu}_1)^2}{\boldsymbol{w}^\top (\Sigma_{1} + \Sigma_2) \boldsymbol{w} }. \]

This measure is, in some sense, a measure of the signal-to-noise ratio for the class labeling and it can be shown that the maximum separation occurs for \[ \boldsymbol{w} = (\Sigma_1 + \Sigma_2)^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1). \]

In the following, let us consider the mtcars data set. We use the subset of the continuous covariates and we search for a linear classification rule to classify the given cars by the number of cylinders (thus, we have three groups – cars with 4 cylinders, 6 cylinders, and 8 cylinders).

library(MASS)
lda1 <- lda(cyl ~ mpg + hp + qsec + drat + disp, data = mtcars)

pred <- predict(lda1)
ldahist(data = pred$x[,1], g=mtcars$cyl)

From the theoretical point of view the linear discriminant analysis is a linear partition of the \(n\) dimensional sample space. In the example above we have a five dimensional sample space \(R^5\) with the random vector \[ \boldsymbol{X} = (X_1, \dots, X_5)^\top = (mpg, hp, qsec, drat, disp)^\top. \]

The linear discriminant analysis provides a specific partition, which can be visualized using the R function partimat() from the library klaR. The particular partition is plotted in terms of 10 two-dimensional plots where for each plot a specific (and unique) combination of two viariables is considered.

library("klaR")
partimat(as.factor(cyl) ~ mpg + hp + qsec + drat + disp, data = mtcars, method = "lda")

The plots aboe can further compared with another plots where we only distinguish for different groups

pairs(mtcars[c("mpg", "hp", "qsec", "drat", "disp")], pch = 22, bg = c("red", "yellow", "green")[unclass(as.factor(mtcars$cyl))])

To Do

For a better graphical insight into the LDA results we can consider a simplification where we only use one continuous covariate as a predictor.

r lda2 <- lda(cyl ~ mpg, data = mtcars) lda2cv <- lda(cyl ~ mpg, data = mtcars, CV = T) The linear discriminant rule is given by the scaling factor. The corresponding transformations can be obtained as follows:

lda_scores <- as.numeric(lda2$scaling) * mtcars$mpg
par(mfrow = c(1,2))
plot(lda_scores ~ mtcars$mpg, pch = 21, bg = mtcars$cyl, ylab ="LDA Scores", xlab = "Miles per Gallon")

lda2cv <- cbind(lda2cv$posterior, mtcars$mpg, mtcars$cyl)
lda2cv <- lda2cv[order(mtcars$mpg), ]

plot(lda2cv[,1] ~ lda2cv[,4], pch = 21, bg = lda2cv[,5], ylab = "Posterior Probability", xlab = "Miles Per Gallon")
lines(lda2cv[,1] ~ lda2cv[,4], col = "blue")
points(lda2cv[,2] ~ lda2cv[,4], pch = 22, bg = lda2cv[,5])
lines(lda2cv[,2] ~ lda2cv[,4], col = "violet")
points(lda2cv[,3] ~ lda2cv[,4], pch = 23, bg = lda2cv[,5])
lines(lda2cv[,3] ~ lda2cv[,4], col = "gray")

In the plots above we can see both: the actual group reprezentation and, also, the given linear discrimination rule. The given data points belong to one of three specific groups and the assignment is visualized by different colors. However, for a new observation (without the group assignment information) we need the consuption information—and using the estimated probabilities for three groups we can perform the corresponding assignment.
Consider an analogous situation for more than just one continuous predictor. Use some graphical tools to visualize the results.
Use some additional datapoints (feel free to make up some) and try to use the discrimination rules to assign the new points to the existing groups.
Consider the heteroscedastic version of the model and try to apply the quadratic discriminant function (command ‘qda()’ in package ‘MASS’).
Consider, for instance, an alternative classification perfomed within the framework of a logistic model.

2. Alternative approaches for analogous results

Mowers data:

Let us consider an artificial and balanced dataset where only three covariates are given for each subject (house holder): the income of the house holder (considered to be a continuous covariate); the lot landscape (again a continuous covariate); type of the grass mower (a binary covariate representating either manual mowers or motorized-tractor mowers).

The idea is that more rich house holders with a large lot landscape should have a tendency to buy more expensive motorized tractor mowers. On the other hand, house holders with low income or a small backyard should probably buy a small manual mower. Using the information about the income and the landscape we want to find a classification rule for the mower type. Four different methods are used: generalized linear model; linear discriminant analysis; quadratic discriminant analysis; linear regression model;

The data can be loaded in by the command

Mowers <- read.table("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/mowers.txt", sep = "", header = T)

### 2D grid expansion
mowers.gr = expand.grid(seq(20,120,len=1000),seq(14,24,len=1000))
names(mowers.gr) = c("income","lot")

And the results of the four methods are the following:

par(mfrow=c(2,2))
cols <- c("red", "blue")
attach(Mowers)
mowe.ii = unique(mowers.gr$income)
mowe.ll = unique(mowers.gr$lot)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Generalized Linear Model")
mow.nn = glm(factor(riding)~income+lot,family=binomial,data=Mowers)

contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0.5),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Linear Discriminant Analysis")
mow.nn = lda(factor(riding)~income+lot,data=Mowers)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Quadratic Discriminant Analysis")
mow.nn = qda(factor(riding)~income+lot,data=Mowers)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Linear Regression Model")
mow.nn = lm(I(2*(riding-1)-1)~income+lot,data=Mowers)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

Note, that the LDA approach and the standard linear regression performs equivalently (under the assumption that there are two groups only, the groups are labeled with minus one and one and, finaly, the data are balanced – same number of observations in each group – 12 in our particular example) while GLM is slightly different. The quadratic discriminant analysis performs very differently which is, however, expected from the underlying theoretical background.

Another, more popular and recent approach is to use a deep neural network which can perform this classification. The deep neural network is highly non-linear non-convex optimization method. Unfortunately, the results heavily depends on an initial configuration: compare, for instance, the following four outputs (the same neural network is applied for all for cases and the same dataset is used in all for cases as well).

To run the command, one needs to install the corresponding package install.library(nnet).

require(nnet)

## Loading required package: nnet

par(mfrow=c(3,2))

for (k in 1:6) {
  mow.nn = nnet(factor(riding)~income+lot,size=50,data=Mowers)
  plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Generalized Linear Model")
  contour(mowe.ii,mowe.ll,
          matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="class")),1000,1000),
          nlev=1,drawlabels=FALSE,add=T)
}

## # weights:  201
## initial  value 16.703385 
## iter  10 value 14.911405
## iter  20 value 13.842053
## iter  30 value 10.729652
## iter  40 value 1.528541
## iter  50 value 0.006520
## final  value 0.000078 
## converged

## # weights:  201
## initial  value 16.978673 
## iter  10 value 15.146512
## iter  20 value 6.557024
## iter  30 value 5.953321
## iter  40 value 5.774542
## iter  50 value 5.732956
## iter  60 value 5.597748
## iter  70 value 3.039931
## iter  80 value 2.935625
## iter  90 value 2.826205
## iter 100 value 2.786252
## final  value 2.786252 
## stopped after 100 iterations

## # weights:  201
## initial  value 18.738950 
## iter  10 value 15.469498
## iter  20 value 13.611634
## iter  30 value 13.064297
## iter  40 value 9.177095
## iter  50 value 2.775699
## iter  60 value 0.023799
## final  value 0.000065 
## converged

## # weights:  201
## initial  value 17.574015 
## iter  10 value 15.372480
## iter  20 value 14.420627
## iter  30 value 11.458143
## iter  40 value 4.739300
## iter  50 value 3.328623
## iter  60 value 1.815343
## iter  70 value 1.401203
## iter  80 value 1.042757
## iter  90 value 0.008876
## iter 100 value 0.001329
## final  value 0.001329 
## stopped after 100 iterations

## # weights:  201
## initial  value 23.186127 
## iter  10 value 15.780740
## iter  20 value 14.594349
## iter  30 value 13.002454
## iter  40 value 6.229950
## iter  50 value 5.986465
## iter  60 value 5.745564
## iter  70 value 5.689259
## iter  80 value 5.640800
## iter  90 value 5.260440
## iter 100 value 2.842388
## final  value 2.842388 
## stopped after 100 iterations

## # weights:  201
## initial  value 20.176142 
## iter  10 value 15.439204
## iter  20 value 14.663280
## iter  30 value 13.518560
## iter  40 value 9.880790
## iter  50 value 4.637804
## iter  60 value 2.292573
## iter  70 value 0.000984
## final  value 0.000054 
## converged

The disadvantage of the neural network approach is obvious in the plots obove. The same data are used, the same methodological framework is applied but each time a very different result is obtained. This is due to the fact that the neural network heavily depends on its initial (random) setting.

Therefore, in order to get some meaningful results one usually runs the deep neural network for a few times and the final result is averaged over all runs.

3. Correspondance analysis in R

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.

Homework Assignment

(Deadline: Week 10)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

The dataset should contain at least one discrite covariate which can be used to identify groups. In addition, at least two other continuous covariates should be present in order to train the discrimination rules.

Consider one discrete covariate and one continuous covariate and apply the linear and quadratic discrimination rule.
Is it possible to improve the performance of the linear discrimination rule by adding some other covariates?
Can you propose an alternative ‘’discrimination approach’’ which would be based on (generalized) linear regression for instance?