<table width = 900 border = 0><tr><td>
# <b>NMST539 | Lab Session 7</b>

## <b>Factor Analysis and Applications in R</b>
### <b><font color = "#880000">Summer Term 2019/2020 | April 6th - April 10th, 2020 (week 8)</font></b>
###### <a href = "individual_week8.Rmd">Rmd file (UTF8 coding)</a>

**************


The R-software is available for download from the official website: <a href = "https://www.r-project.org">https://www.r-project.org</a>

A user-friendly interface (one of many): <a href = "https://www.rstudio.com">RStudio</a><br><br>

<b>Some useful manuals and tutorials for <i>R</i>
(in Czech or English):</b><br>
    
<ul>
<li>Bína, V., Komárek, A. a Komárková, L.: <i>Jak na jazyk R.</i> (in Czech) | (<a href = "NMFM301/Rmanual2.pdf">PDF manual</a>)</li>
<li>Komárek, A.: <i>Základy práce s R.</i> (in Czech) | (<a href = "http://msekce.karlin.mff.cuni.cz/~komarek/vyuka/dataRko/Rmanual1.pdf">PDF manual</a>)</li>
<li>Kulich, M.: <i>Velmi stručný úvod do R.</i> (in Czech) | (<a href = "NMFM301/uvodr.pdf" >PDF manual</a>)</li>
<li>Venables, W.N., Smith, D.M., and the R Core Team: <i>An Introduction to R.</i> |  (<a href = "https://cran.r-project.org/doc/manuals/R-intro.pdf">PDF manual</a>)</li>
<li>De Vries, A. a Meys, J.: <i>R for Dummies.</i> (ISBN-13: 978-1119055808) |   (<a href = "http://www.ievbras.ru/ecostat/Kiril/R/Biblio/R_eng/R%20dummies.pdf">PDF book</a>)</li>    
</ul>
**************

  


### <font color = "#880000"><b>1. Factor Analysis</b></font>

The factor analysis is an effective statistical method meant for dimensionaly reduction. It is suitable especially in situations 
where one needs to use the data for some further analysis and statistical modeling (e.g. regression).
The factor analysis is used to describe the overall correlation structure among some set of variables by using a potentially lower number of variables which are called factors. These factors are, however, unobserved -- latent random variables.

The factor analysis  searches for similar covariates with respect to their mutuall correlation. All variables with mutually high correlation are represented with a factor (or a linear combination of factors) instead. 

In some sense the factor analysis approach can be considered to be a generalization of the classical principal component analysis (PCA) with one great advantage at the end -- much more convenient interpretation of the factors. However, in the principal component analysis the primary interest lies in variance while the primary interest in the factor analysis is the covariance/correlation. 

In the statistical software R there is a function called `factanal()` which is available under the standard R instalation. 
Beside that, there are many more additinal functions and packages which can be easily downloaded and installed in R (e.g. `Factanal()` function in the 'FAiR' package; `fa.promax()` function in the 'psych' package;). 

For our purposes we mainly use the standard function `factanal()`. Let us again recall the biological metrics data from the Czech republic. The data represent 65 different river localities in the Czech Republic where on each locality there are various biological metrics assessed (17 metrics in total). We will use this dataset for some futher illustrations. 

```{r}
rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
```
<br><br>

The covariance/correlation structure, which is of the main interest here (and will be later assessed using the factor analysis method), can be either estimated using a standard variance covariance matrix (command `var(bioData[,2:18])`) or the correlations can be considered instead -- they can be visualized using the `corrplot()` function from the 'corrplot' package instead. 

```{r, message=F, fig.width=10, fig.height=8}
library(corrplot)
corrplot(cor(bioData[,2:18]), method="ellipse")
```

Recall, that unlike the covariance which is not scale invariant the already correlation is. Compare the outcomes for the following commands: 

```{r, eval = F}
cov(bioData[,2:18])
cov(scale(bioData[,2:18]))
cor(bioData[,2:18])
cor(scale(bioData[,2:18]))
```

Can you justfify why the last three commands give the same results? 

<br><br>

The idea behind the `factanal()` function is to consider a $p$-dimensional (random) vector $\boldsymbol{X}$ and to express this vector using a set of $k$ factors  where we require want that $k \ll p$ (dimensionaly reduction). The model which is fitted by the factor analysis approach when appling the `factanal()' function in R is the following one:

$$
\boldsymbol{X} = \Lambda \boldsymbol{F} + \boldsymbol{e}, 
$$

where $\Lambda$ is a $p\times k$ dimensional (nonrandom) matrix of so called <b>factor loadings</b>, $\boldsymbol{F}$ is a $k$-dimensional (random) vector represending $k$ factors (common factors or scores respectively) and $\boldsymbol{e}$ is an approximation error (or specific factors).  

The form of the expression above closely reminds the model of the linear regression, where 

$$
\boldsymbol{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon},
$$
where $\boldsymbol{Y}$ is the vector of dependent covariates, $\mathbb{X}$ is the given model matrix of the type $n \times p$ and $\boldsymbol{\varepsilon}$ is a random vector -- the errror terms. 

However, the tricky part in the factor decomposition  $\boldsymbol{X} = \Lambda \boldsymbol{F} + \boldsymbol{e}$ is that beside the random vector $\boldsymbol{X}$ no other quantity is directly observed. The problem, as such, is unsolvable in a direct and unique way unless we impose some additional restrictions. 
This is done by specifying a varinace covariance structure among all quantities which appear in the expression above: <br><br>

<ul>
<li>The (random) common factors $\boldsymbol{F}$ are required to be uncorrelated, with a unit variance, such that $Var (\boldsymbol{F}) = \mathbb{I}_{k};$</li>
<li>The specific factors (error terms) $\boldsymbol{e}$ are independent with some variance, thus $Var (\boldsymbol{e}) = \boldsymbol{\Psi}$, where $Var (e_{i}) = \psi_{ii}$;</li>
<li>The correlation matrix of $\boldsymbol{X}$ is decomposed as $\Sigma = \Lambda \Lambda^{T} + \boldsymbol{\Psi}$.</li>

<li>The mutual covariance between the common factors $\boldsymbol{F}$ and the specific factors $\boldsymbol{e}$ is zero, thus $Cov(\boldsymbol{F}, \boldsymbol{e}) = \boldsymbol{0}$;</li>
</ul>

<br><br>

A common technique used in the factor analysis approach is to assume that the matrix $\Lambda^\top \boldsymbol{\Psi}^{-1} \Lambda$ is dianogal. This introduces additional equations needed to solve the problem. 

<ul>
<li>from the factor decomposition itself, thus the equation $\Sigma  = \Lambda \Lambda^\top + \boldsymbol{\Psi}$, we have $p (p + 1)/2$ equations in total;</li>

<li>the total number of unknown parameters to find is $p \times k$ parameters from the matrix $\Lambda$ and, in addition $p$ parameters from the diagonal matrix $\boldsymbol{\Psi}$;</li>

<li>the additional requirement on the diagonality of $\Lambda^\top \boldsymbol{\Psi}^{-1} \Lambda$ introduces additional $k(k + 1)/2$ equations;</li>

<li>therefore, the overall degrees of freedom of the factor model can be obtained as 
$$
\frac{(p - k)^2}{2} - \frac{k + p}{2};
$$</li>
</ul>

The degrees of freedom value as obtained by the equation above can be used to get an idea on what is the overall (upper) number of factors which can be effectively estimated from the underlying model.<br><br>





#### <font color = "#880000">Note</font>
<hr size="3" width="100%" noshade style="color:#440000" align="left" />
<ul>
<li>Factor analysis is not unique -- any rotation of $X$ gives the same results;</li>
<li>Factor analysis is invariant with respect to the scale;</li>
</ul>
<hr size="3" width="100%" noshade style="color:#440000" align="left" />

<br><br>

Since the factors are not unique with respect to the rotation, it is usefull to provide some rotation which makes sense. 
The automatic proceduce which tries to find factors in a way that original covariates can be splitted into disjoint sets is called the <b>varimax</b> procedure. <br>
To use the varimax procedure for detecting the right rotation we can use the additional parameter 'rotation="varimax"' when calling the function `factanal()` in R. 


```{r}
fa1 <- factanal(bioData[,2:18], factors = 3, rotation="varimax")
print(fa1, digits=2, cutoff=.6, sort=TRUE)
```

The blank spaces for some covariates and some factors are not trully missing values but they rather 
represent some correlation which is below a pre-defined threshold limit (too small correlation between the given covariate 
and the given latent factor). If there is no reported correlation for some specific covariate, maybe the number of latent 
factors is not enough (no factor is able to sufficiently correlate with the given covariate). Alternatively, one can also try 
to lower the threshold value instead. <br><br>

Is such factor representation enough? Compare the model with some others for the increasing number of factors used: 

```{r}
fa2 <- factanal(bioData[,2:18], factors = 4, rotation="varimax")
fa3 <- factanal(bioData[,2:18], factors = 5, rotation="varimax")
fa4 <- factanal(bioData[,2:18], factors = 9, rotation="varimax")
```

There are mainly two different approaches on how to define the number of factors which should be used. 
For instance, it is easy to see that for each element $X_{j}$ from the random vector $\boldsymbol{X} = (X_{1}, \dots, X_{p})^\top$
it holds that

$$
var(X_{j}) =h_{j}^2 + \psi_{j}, \qquad \textrm{where} \qquad h_{j}^2 = \sum_{\ell = 1}^{k} q_{j \ell}^2,
 $$
 
 and $\boldsymbol{\psi} = Diag\{\psi_{1}, \dots, \psi_{p}\}$, and $\Lambda = \{q_{i j}\}_{i, j = 1}^{p,k}$. The quantity $h_{j}^2$ is the overall variability of $X_{j}$
 which is explained by the common factors in $F$ (also called the communality) while the second quantity, $\psi_{j}$, is the variability which is left (also called the specific variability).<br><br>
 
It is obvious, that for $p$ factors we obtain that $var(X_{j}) = \sum_{\ell = 1}^{p} q_{j \ell}^2$ and the specific fariability is zero, thus $\psi_j  = 0$, for any $j = 1, \dots, p$.<br><br>

Therefore, to decide on how many factors are needed, one can base the decision on a comparison of the communality and the specific variability. <br><br>

<b>Alternative approaches to judge the right number of factors:</b>
<ul>
<li><b>Statistical pre-analysis</b><br>
One usually runs, for instance, a principal components analysis to determine how many factors should be enough (in order to have some reasonable proportion of the overal variability explained);<br><br></li>

<li><b>Expert judgement</b><br>
Sometimes (under some optimal interpretation options) the estimated factors nicely correspond with some latent variables which are not observed, however, can be indentified by some expert judgement. <br><br></li>

<li><b>Maximum Likelihood Test</b><br>
Using the likelihood approach we can also test the null hypothesis $H_{0}$ that $\Sigma = \Lambda\Lambda^\top + \boldsymbol{\psi}$ against a general altternative $H_{1}$ 
specifing no restrictions on the variance covariance matrix $\Sigma$. The likelihood ratio test is given by the test statistic
$$
T = - 2 log \Bigg[\frac{\textrm{maximum likelihood under $H_{0}$}}{\textrm{maximum likelihood under $H_{1}$}}\Bigg].
$$
However, this approach assumes that the data follow the normal distribution with some specific parameters. 
<br><br></li>
</ul>

<br><br>


#### <font color = "#880000">Note</font>
<ul>
<li>Another option on how to determine the optimal number of factor to be extracted in the analysis is 
to use the tools from the library 'nFactors' (the library can be installed by `install.packages("nFactors")`);</li>
</ul>
<hr size="3" width="100%" noshade style="color:#440000" align="left" />


```{r, warning = F, message = F, fig.width=10, fig.height=6}
library(nFactors)
ev <- eigen(cor(bioData[,2:18])) 
ap <- parallel(subject=nrow(bioData[,2:18]),var=ncol(bioData[,2:18]), rep=100, cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)

plotnScree(nS)
```

For different qualitative criteria are used to propoes the right number of factors to be used. 

```{r, fig.width=10, fig.height=6}
### colors
COLS <- c("red", "green", "blue")
M <- apply(abs(fa1$loadings[,1:3]), 1, max)
cols <- COLS[matrix(rep(c(1,2,3), 17), nrow = 17, byrow = T)[match(M, abs(fa1$loadings[,1:3]))]]

load <- fa1$loadings[,1:2] 
plot(load,type="n") 
text(load,labels=names(bioData[,2:18]),cex=.7, col = cols)
```

Can you explain why there is one covariate (<i>Saprind</i>) clearly out of the ren buch of other related covariates? Remember, that there are three factors used in total and the plot obove only uses two of these factors. How would you interpret the fact, that the covariate <i>MDS_2</i>
is in between the red ones? <br><br>

```{r, fig.width=10, fig.height=6}
### colors
par(mfrow = c(1,2))

load1 <- fa1$loadings[,c(1,3)] 
plot(load1,type="n") 
text(load1,labels=names(bioData[,2:18]),cex=.7, col = cols)


load2 <- fa1$loadings[,2:3] 
plot(load2,type="n") 
text(load2,labels=names(bioData[,2:18]),cex=.7, col = cols)
```

Does it make more sense now? Remember, that the original data are with the dimension $p = 17$! <br><br>

Alternatively, we can also try the library `FactoMineR` and the corresponding function `PCA()` from this library to obtain the complete factor map (covariate map with the mutual correlation). 

```{r, fig.width = 10}
library(FactoMineR)
par(mfrow = c(1,2))
result <- PCA(bioData[,2:18])
```

<br><br>

#### <font color = "#880000">Note</font>
<hr size="3" width="100%" noshade style="color:#440000" align="left" />
<ul>
<li>The `factanal()` function in R uses scaled variables as the starting point for the factor analysis -- the variance-covariance matrix of $\boldsymbol{X}$ is replaced by the corresponding correlation matrix with ones on its diagonal. Remember, that the factor analysis is, unlike the principal component analysis, scale invariant as we are focussing on the covariance structure (the off diagonal elements of the variance-covariance matrix) rather than the variance structure (diagonal elements of the variance-covariance matrix). </li>
</ul>
<hr size="3" width="100%" noshade style="color:#440000" align="left" />
<br>


Now, we have an idea that maybe three factors should be ok. In orther to get the corresponding score values (those are the  values in $\boldsymbol{F}$ in the expression above given for each observations - in our case $n = 65$) from the `factanal()` function we need an additional parameter to be specified: 

```{r}
fa1 <- factanal(bioData[,2:18], factors = 3, scores = "regression")
fa1$scores
```

and the corresponding loadings (matrix $\Lambda$ in the expression above) is obtained as

```{r}
fa1$loadings
```

<br><br>

<hr size="3" width="100%" noshade style="color:#440000" align="left" />
#### <font color = "#880000">Try by yourself</font>
<ul>
<li>Use the factor analysis approach and consider various number of factors used to explain the covariance structure. </li>

<li>Obtain the corresponding regression scores and reconstruct the original covariates using the factor values and the given factor loadings. 
Compare the quality of this approximation. </li>

<li>Use some reasonable error measure and compare the approximation error with respect to the various number of factors.</li>
</ul>
<hr size="3" width="100%" noshade style="color:#440000" align="left" />

<br><br>

### <font color = "#880000"><b>2. Application in Regression or SEM</b></font>
Structural Equation Models (SEM) are statistical modelling techniqes showing great potential especially in causal dependencies between endogenous and exogenous variables, and the measurement model showing the relations between latent variables and their indicators (which is also the case of the latent factors $F$ in the factor analysis). <br><br>

It this situations is common to firstly draw a schematic diagram with the assumed underlying structure. Later, the diagram is used to specify the final model. <br><br>

An example of such diagram is below. 

<center>
<img src = "sem.jpg", width = 900>
</center>

<br><br>


In the R software there is package 'sem' available for the structural equation modelling. 

```{r, warning=F,message=F}
library("sem")
```

and a simple example using the dataset 'PoliticalDemocracy' from the package `library("lavaan")`: 

```{r, warning=F,message=F}
library("lavaan")
data(PoliticalDemocracy)
PoliticalDemocracy
```

with the corresponding model structure, which we want to fit: 

<center>
<img src = "lavaan.png", width = 900>
</center>

Firstly, we define the model:

```{r}
model <- '
  # measurement model
    ind60 =~ x1 + x2 + x3
    dem60 =~ y1 + y2 + y3 + y4
    dem65 =~ y5 + y6 + y7 + y8
  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60
  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'
```


And now, we can fit the corresponding SEM model: 

```{r}
fit <- sem(model, data=PoliticalDemocracy)
summary(fit, standardized=TRUE)
```

<br><br>

<hr size="3" width="100%" noshade style="color:#440000" align="left" />
#### <font color = "#880000">Try by yourself</font>
<ul>
<li>Consider the dataset `HolzingerSwineford1939` from the R package `library("lavaan")` which can be loadid in by the command
```{r}
data(HolzingerSwineford1939)
```
</li>

<li>Apply the factor analysis approach for the covariates $X_{1}, \dots, X_9$ and consider three different factors when explaining the underlying covariance structure 
in amoung these covariats:

```{r}
fa_Hol <- factanal(HolzingerSwineford1939[,7:15], factors = 3, rotation="varimax")
print(fa_Hol, digits=2, cutoff=.45, sort=TRUE)
```
</li>

<li>What is the interpretation of the results from above? Could you propose a linear regresion model (or SEM alternatively) to get some quantitative results?</li>

<li>The confirmatory factor analysis can do the work for you... The R function `cfa()` from the R packages "lavaan" is very similar to the `sem()` function used before. In fact, these two functions are almost identical. 

```{r}
HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

fit <- cfa(HS.model, data=HolzingerSwineford1939)
summary(fit, fit.measures=TRUE)
``` 

</li>

<li>What would be the corresponding diagram capturing the dependence structure from the model above?</li>

</ul>
<hr size="3" width="100%" noshade style="color:#440000" align="left" />


<br><br><br>

## <font color = "#880000"><b>Homework Assignment</b></font>
### (Deadline: Lab Session 8 / 17.04.2020)

Use the command below to instal the `SMSdata` packages with various multivariate datasets: 

```{r, eval = F}
install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")
```

Chose one dataset of your choise from the list of all available datasets in the package:

```{r, eval = F}
data(package = "SMSdata") 
```
 
 There are 21 different datasets and you can load each of them by typing and running 
`data(dataset_name)`, for instance:

```{r, eval = F}
data(plasma) 
```

<ul>
<li>Consider the same dataset you applied the principal component analysis to and apply the factor analysis now; </li>

<li>Compare the outcomes of both, PCA and FA;</li>

<li>Chose one covariate as the dependent covariate and try to use the regression modelling approach to model the expectation of this 
covariate using the factors obtained from the factor analysis. Try to find some reasonable interpretation of the final model.</li>
</ul>

<hr size="3" width="100%" noshade style="color:#440000" align="left" />

</td></tr></table>
<br><br><br>