NMST539 | Lab Session 8

Factor Analysis and Applications in R

LS 2017/2018 | Monday 16/04/18

Rmd file (UTF8 coding)

The R-software is available for download from the official website: https://www.r-project.org

A user-friendly interface (one of many): RStudio

Some useful manuals and tutorials for R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

1. Factor Analysis

Factor analysis is an effective statistical method meant for dimensionaly reduction. It is suitable especially in situations where one needs to use the data for some further analysis and statistical modeling (e.g. regression). The factor analysis is used to describe the overall correlation structure among some set of variables by using a potentially lower number of variables which are called factors. These factors are, however, unobserved – latent random variables.

The factor analysis approach searches for similar covariates with respect to their mutuall correlation. All variables with mutually high correlation are represented with a factor (or a linear combination of factors) instead.

In some sense the factor analysis approach can be considered to be a generalization of the classical principal component analysis (PCA) with one great advantage at the end – much more convenient interpretation of the factors.

In the statistical software R there is function factanal() available under the standard R instalation. Beside that, there are many more additinal functions and packages which can be downloaded and installed in R (e.g. Factanal() function in the ‘FAiR’ package; fa.promax() function in the ‘psych’ package;).

For our purposes we mainly use the standard function factanal(). Let us again recall the biological metrics data from the Czech republic. The data represent 65 different river localities in the Czech Republic where on each locality there are various biological metrics assessed (17 metrics in total).

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)

The correlation structure (wich will be later assessed using the factor analysis approach) can be either estimated using a standard variance covariance matrix (command var(bioData[,2:18])) or it can be visualized using the corrplot() function from the ‘corrplot’ package instead.

library(corrplot)
corrplot(cor(bioData[,2:18]), method="ellipse")

The idea behind the factanal() function is to consider a $p$-dimensional (random) vector $\boldsymbol{X}$ and to express this vector using a set of $k$ factors where we require want that $k \ll p$ (dimensionaly reduction). The model which is fitted by the factor analysis approach when appling the `factanal()’ function in R is the following one:

\[ \boldsymbol{X} = \Lambda \boldsymbol{F} + \boldsymbol{e}, \]

where $\Lambda$ is a $p\times k$ dimensional (nonrandom) matrix of so called factor loadings, $\boldsymbol{F}$ is a $k$-dimensional (random) vector represending $k$ factors (common factors or scores respectively) and $\boldsymbol{e}$ is an approximation error (or specific factors).

The form of the expression above closely reminds the model of the linear regression, where

\[ \boldsymbol{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \] where $\boldsymbol{Y}$ is the vector of dependent covariates, $\mathbb{X}$ is the given model matrix of the type $n \times p$ and $\boldsymbol{\varepsilon}$ is a random vector – the errror terms.

However, the tricky part in the factor decomposition $\boldsymbol{X} = \Lambda \boldsymbol{F} + \boldsymbol{e}$ is that beside the random vector $\boldsymbol{X}$ no other quantity is directly observed. The problem, as such, is unsolvable in a direct and unique way unless we impose some additional restrictions. This is done by specifying a varinace covariance structure among all quantities which appear in the expression above:

The (random) common factors $\boldsymbol{F}$ are required to be uncorrelated, with a unit variance, such that $Var (\boldsymbol{F}) = \mathbb{I}_{k};$
The specific factors (error terms) $\boldsymbol{e}$ are independent with some variance, thus $Var (\boldsymbol{e}) = \boldsymbol{\Psi}$, where $Var (e_{i}) = \psi_{ii}$;
The correlation matrix of $\boldsymbol{X}$ is decomposed as $\Sigma = \Lambda \Lambda^{T} + \boldsymbol{\Psi}$.
The mutual covariance between the common factors $\boldsymbol{F}$ and the specific factors $\boldsymbol{e}$ is zero, thus $Cov(\boldsymbol{F}, \boldsymbol{e}) = \boldsymbol{0}$;

A common technique used in the factor analysis approach is to assume that the matrix $\Lambda^\top \boldsymbol{\Psi}^{-1} \Lambda$ is dianogal. This introduces additional equations needed to solve the problem.

from the factor decomposition itself, thus the equation $\Sigma = \Lambda \Lambda^\top + \boldsymbol{\Psi}$, we have $p (p + 1)/2$ equations in total;
the total number of unknown parameters to find is $p \times k$ parameters from the matrix $\Lambda$ and, in addition $p$ parameters from the diagonal matrix $\boldsymbol{\Psi}$;
the additional requirement on the diagonality of $\Lambda^\top \boldsymbol{\Psi}^{-1} \Lambda$ introduces additional $k(k + 1)/2$ equations;
therefore, the overall degrees of freedom of the factor model can be obtained as \[ \frac{(p - k)^2}{2} - \frac{k + p}{2}; \]

The degrees of freedom value as obtained by the equation above can be used to get an idea on what is the overall (upper) number of factors which can be effective estimated from the model.

Note

Factor analysis is not unique – any rotation of $X$ gives the same results;
Factor analysis is invariant with respect to the scale;

Since the factors are not unique with respect to the rotation, it is usefull to provide some rotation which makes sense. The automatic proceduce which tries to find factors in a way that original covariates can be splitted into disjoint sents is called the varimax procedure.
To use the varimax procedure for detecting the right rotation we can use the additional parameter ‘rotation=“varimax”’ when calling the function factanal() in R.

fa1 <- factanal(bioData[,2:18], factors = 3, rotation="varimax")
print(fa1, digits=2, cutoff=.6, sort=TRUE)

## 
## Call:
## factanal(x = bioData[, 2:18], factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##    SaprInd      Lital       RETI     EPTAbu       Marg   Metaritr 
##       0.15       0.09       0.05       0.05       0.03       0.02 
##     JepAbu  Epiritral Hyporitral     Ntaxon     Nceled     Bindex 
##       0.10       0.07       0.39       0.21       0.09       0.19 
##     EPTTax     PosAbu    Spasaci      MDS_1      MDS_2 
##       0.10       0.35       0.11       0.13       0.45 
## 
## Loadings:
##            Factor1 Factor2 Factor3
## SaprInd    -0.79                  
## Lital       0.94                  
## RETI        0.94                  
## EPTAbu      0.90                  
## Metaritr    0.96                  
## JepAbu      0.68            0.65  
## Epiritral   0.89                  
## Hyporitral  0.71                  
## PosAbu      0.79                  
## Spasaci     0.94                  
## MDS_1       0.91                  
## Marg                0.98          
## Ntaxon              0.65          
## Nceled              0.94          
## Bindex              0.79          
## EPTTax              0.83          
## MDS_2                       0.71  
## 
##                Factor1 Factor2 Factor3
## SS loadings       8.96    4.05    1.41
## Proportion Var    0.53    0.24    0.08
## Cumulative Var    0.53    0.77    0.85
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 414.98 on 88 degrees of freedom.
## The p-value is 6.88e-44

Is such factor representation enough? Compare the model with some others for the increasing number of factors used:

fa2 <- factanal(bioData[,2:18], factors = 4, rotation="varimax")
fa3 <- factanal(bioData[,2:18], factors = 5, rotation="varimax")
fa4 <- factanal(bioData[,2:18], factors = 9, rotation="varimax")

There are mainly two different approaches on how to define the number of factors which should be used. For instance, it is easy to see that for each element $X_{j}$ from the random vector $\boldsymbol{X} = (X_{1}, \dots, X_{p})^\top$ it holds that

\[ var(X_{j}) =h_{j}^2 + \psi_{j}, \qquad \textrm{where} \qquad h_{j}^2 = \sum_{\ell = 1}^{k} q_{j \ell}^2, \]

and $\boldsymbol{\psi} = Diag\{\psi_{1}, \dots, \psi_{p}\}$, and $\Lambda = \{q_{i j}\}_{i, j = 1}^{p,k}$. The quantity $h_{j}^2$ is the overall variability of $X_{j}$ which is explained by the common factors in $F$ (also called the communality) while the second quantity, $\psi_{j}$ is the variability which is left (also called the specific variability).

It is obvious, that for $p$ factors we obtain that $var(X_{j}) = \sum_{\ell = 1}^{k} q_{j \ell}^2$ and the specific fariability is zero, thus $\psi_j = 0$, for any $j = 1, \dots, p$.

Therefore, to decide on how many factors are needed, one can base the decision on a comparison of the communality and the specific variability.

Alternative approaches to judge the right number of factors:

Statistical pre-analysis
One usually runs e.g. a principal components analysis to determin how many factors should be enough (in order to have some reasonable proportion of the overal variability explained);
Expert judgement
Sometimes (under some optimal interpretation options) the estimated factors nicely correspond with some latent variables which are not observed, however, can be indentified by some expert judgement.
Maximum Likelihood Test
Using the likelihood approach we can also test the null hypothesis $H_{0}$ that $\Sigma = \Lambda\Lambda^\top + \boldsymbol{\psi}$ against a general altternative $H_{1}$ specifing no restrictions on the variance covariance matrix $\Sigma$. The likelihood ratio test is given by the test statistic \[ T = - 2 log \Bigg[\frac{\textrm{maximum likelihood under $H_{0}$}}{\textrm{maximum likelihood under $H_{1}$}}\Bigg]. \] However, this approach assumes that the data follow the normal distribution with some specific parameters.

Note

Another option on how to determine the optimal number of factor to be extracted in the analysis is to use the tools from the library ‘nFactors’ (the library can be installed by install.packages("nFactors"));

library(nFactors)
ev <- eigen(cor(bioData[,2:18])) 
ap <- parallel(subject=nrow(bioData[,2:18]),var=ncol(bioData[,2:18]), rep=100, cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)

plotnScree(nS)

load <- fa1$loadings[,1:2] 
plot(load,type="n") 
text(load,labels=names(bioData[,2:18]),cex=.7)

Alternatively, using the library FactoMineR and function PCA() we can obtain the complete factor map (covariate map with the mutual correlation).

library(FactoMineR)
result <- PCA(bioData[,2:18])

Note

The factanal() function in R uses scaled variables as the starting point for the factor analysis – the variance-covariance matrix of $\boldsymbol{X}$ is replaced by the corresponding correlation matrix with ones on its diagonal. Remember, that the factor analysis is, unlike the principal component analysis, scale invariant as we are focussing on the covariance structure (the off diagonal elements of the variance-covariance matrix) rather than the variance structure (diagonal elements of the variance-covariance matrix).

Now, we an idea that maybe three factors should be ok. In orther to get the corresponding score values (those are values of $\boldsymbol{F}$ in the expression above given for each observations - in our case $n = 65$) from the factanal() function we need an additional parameter to be specified:

fa1 <- factanal(bioData[,2:18], factors = 3, scores = "regression")
fa1$scores

##           Factor1     Factor2      Factor3
##  [1,] -0.12449662  0.49318396  0.019226166
##  [2,] -0.33849930 -0.39230318  0.119550138
##  [3,] -0.22044557 -0.38580832  2.320568152
##  [4,]  0.82566318  1.20827606 -0.940891112
##  [5,] -0.13932079  0.89310776 -0.532355972
##  [6,] -1.01455286 -0.65600386 -0.375228204
##  [7,] -0.82343873 -0.62783914 -0.701274829
##  [8,] -0.98728818 -0.13782643 -0.433312546
##  [9,]  0.22418720  0.88632819  0.795206993
## [10,]  0.52292639  1.06261823  0.295526977
## [11,]  0.29559045  1.25010730 -0.529517247
## [12,] -0.50645369 -0.62925864 -0.373956123
## [13,] -0.93565845 -1.35045706 -0.637327245
## [14,] -0.95854352 -0.20085958  1.192711934
## [15,]  0.42955162 -0.44374222 -1.634492803
## [16,]  1.52669131  0.90258023 -0.647334942
## [17,] -1.06358562 -0.98644111 -0.614191669
## [18,] -0.17714883  0.29705426 -0.273963895
## [19,] -1.17296128  0.15482177  0.740586340
## [20,] -0.78876331  1.64768437  1.135756173
## [21,] -0.68592393  1.86494676  0.622906404
## [22,]  0.03378541  0.51636339 -0.492220100
## [23,] -0.98102088  0.67287079  0.410026081
## [24,]  1.72144792  0.93062040 -1.616832483
## [25,] -0.76259043 -1.35110582 -0.851702050
## [26,] -0.91691053 -1.09563523 -0.287746278
## [27,] -0.28310899 -0.08405725 -0.554590635
## [28,]  1.01603809  0.64177486 -1.212012177
## [29,]  0.02890208  0.83487030  0.220552649
## [30,]  2.69443073 -1.16900885  0.606585871
## [31,] -0.47082315 -0.18615905  2.222434682
## [32,]  1.95090263 -1.29573724  1.303667307
## [33,]  0.28558011  1.04412770  0.364248451
## [34,] -0.18909235  0.97786635  0.090795648
## [35,] -0.48400090 -0.41407901 -0.459370022
## [36,] -0.86086894 -1.09247855 -0.356923055
## [37,] -0.83806848 -0.92300383 -0.815412673
## [38,] -0.91971280 -0.62821762 -0.098009986
## [39,] -0.87317702 -0.29454466 -0.077311663
## [40,]  0.96502394 -1.12911936  4.327243670
## [41,] -0.63704689  0.20145093  0.197835134
## [42,] -0.19959392  1.30938359  0.667994311
## [43,]  3.12996486 -1.62033577  0.044354682
## [44,]  1.20880189  0.64341909 -0.149924195
## [45,]  0.06135524 -0.02822842 -0.557196522
## [46,] -0.86225779 -1.61696484 -1.552055930
## [47,] -0.61532943 -0.45390266 -0.419664831
## [48,] -0.14042956  0.55569570 -0.113699662
## [49,] -0.76637088  0.22722578 -0.178657943
## [50,]  0.92871029  1.82419989 -0.448391534
## [51,]  1.52938447  1.37527251 -1.001758261
## [52,] -0.06380299 -2.13721348  0.714719247
## [53,] -0.48147547  0.55671249  1.028660417
## [54,] -0.01773075 -0.25909705 -0.961153139
## [55,]  0.44621941  0.62532458 -0.231747386
## [56,] -0.48055946  0.70571466  0.712530342
## [57,] -0.57932009 -0.16888336  0.267753908
## [58,]  0.41294090  0.99097762 -0.110134319
## [59,] -0.54167359  0.01836471  0.755755149
## [60,] -0.08414099  0.39657781  0.326141043
## [61,] -0.92105906 -1.35457565 -0.853275093
## [62,]  2.50167388 -1.84503963 -0.802830680
## [63,] -0.84252936 -0.72197336 -0.191797001
## [64,]  1.79639695 -1.48006272  0.583465441
## [65,]  0.21360641  1.45044090  0.001460895

and the corresponding loadings (matrix $\Lambda$ in the expression above) is obtained as

fa1$loadings

## 
## Loadings:
##            Factor1 Factor2 Factor3
## SaprInd    -0.794  -0.457  -0.109 
## Lital       0.939   0.164         
## RETI        0.941   0.203  -0.142 
## EPTAbu      0.903           0.357 
## Marg                0.984         
## Metaritr    0.959   0.193  -0.150 
## JepAbu      0.684  -0.115   0.646 
## Epiritral   0.886   0.182  -0.340 
## Hyporitral  0.710   0.184   0.275 
## Ntaxon     -0.572   0.654  -0.190 
## Nceled      0.102   0.936  -0.151 
## Bindex      0.421   0.789         
## EPTTax      0.456   0.834         
## PosAbu      0.785          -0.159 
## Spasaci     0.939                 
## MDS_1       0.907   0.190   0.113 
## MDS_2      -0.165  -0.129   0.714 
## 
##                Factor1 Factor2 Factor3
## SS loadings      8.961   4.053   1.409
## Proportion Var   0.527   0.238   0.083
## Cumulative Var   0.527   0.766   0.848

Try by yourself

Use the factor analysis approach and consider various number of factors used to explain the variance-covariance structure.
Obtain the corresponding regression scores and reconstruct the original covariates using the factor values and the given factor loadings.
Use some reasonable error measure and compare the approximation error with respect to the various number of factors.

2. Application in Regression or SEM

Structural Equation Models (SEM) are statistical modelling techniqes showing great potential especially in causal dependencies between endogenous and exogenous variables, and the measurement model showing the relations between latent variables and their indicators (which is also the case of the latent factors $F$ in the factor analysis).

It this situations is common to firstly draw a schematic diagram with the assumed underlying structure. Later, the diagram is used to specify the final model.

An example of such diagram is below.

In the R software there is package ‘sem’ available for the structural equation modelling.

library("sem")

and a simple example using the dataset ‘PoliticalDemocracy’ from the package library("lavaan"):

library("lavaan")
data(PoliticalDemocracy)
PoliticalDemocracy

##       y1       y2        y3        y4        y5       y6       y7
## 1   2.50 0.000000  3.333333  0.000000  1.250000 0.000000 3.726360
## 2   1.25 0.000000  3.333333  0.000000  6.250000 1.100000 6.666666
## 3   7.50 8.800000  9.999998  9.199991  8.750000 8.094061 9.999998
## 4   8.90 8.800000  9.999998  9.199991  8.907948 8.127979 9.999998
## 5  10.00 3.333333  9.999998  6.666666  7.500000 3.333333 9.999998
## 6   7.50 3.333333  6.666666  6.666666  6.250000 1.100000 6.666666
## 7   7.50 3.333333  6.666666  6.666666  5.000000 2.233333 8.271257
## 8   7.50 2.233333  9.999998  1.496333  6.250000 3.333333 9.999998
## 9   2.50 3.333333  3.333333  3.333333  6.250000 3.333333 3.333333
## 10 10.00 6.666666  9.999998  8.899991  8.750000 6.666666 9.999998
## 11  7.50 3.333333  9.999998  6.666666  8.750000 3.333333 9.999998
## 12  7.50 3.333333  6.666666  6.666666  8.750000 3.333333 6.666666
## 13  7.50 3.333333  9.999998  6.666666  7.500000 3.333333 6.666666
## 14  7.50 7.766664  9.999998  6.666666  7.500000 0.000000 9.999998
## 15  7.50 9.999998  3.333333 10.000000  7.500000 6.666666 9.999998
## 16  7.50 9.999998  9.999998  7.766666  7.500000 1.100000 6.666666
## 17  2.50 3.333333  6.666666  6.666666  5.000000 1.100000 6.666666
## 18  1.25 0.000000  3.333333  3.333333  1.250000 3.333333 3.333333
## 19 10.00 9.999998  9.999998 10.000000  8.750000 9.999998 9.999998
## 20  7.50 3.333299  3.333333  6.666666  7.500000 2.233299 6.666666
## 21 10.00 9.999998  9.999998 10.000000 10.000000 9.999998 9.999998
## 22  1.25 0.000000  0.000000  0.000000  2.500000 0.000000 0.000000
## 23  2.50 0.000000  3.333333  3.333333  2.500000 0.000000 3.333333
## 24  7.50 6.666666  9.999998 10.000000  7.500000 6.666666 9.999998
## 25  8.50 9.999998  6.666666  6.666666  8.750000 9.999998 7.351018
## 26  6.10 0.000000  5.400000  3.333333  0.000000 0.000000 4.696028
## 27  3.30 0.000000  6.666666  3.333333  6.250000 0.000000 6.666666
## 28  2.90 3.333333  6.666666  3.333333  2.385559 0.000000 3.177568
## 29  9.20 0.000000  9.900000  3.333333  7.609660 0.000000 8.118828
## 30  6.90 0.000000  6.666666  3.333333  4.226033 0.000000 0.000000
## 31  2.90 0.000000  3.333333  3.333333  5.000000 0.000000 3.333333
## 32  2.00 0.000000  0.000000  0.000000  0.000000 0.000000 0.000000
## 33  5.00 0.000000  3.333333  3.333333  5.000000 0.000000 3.333333
## 34  5.00 0.000000  9.999998  3.333333  0.000000 0.000000 3.333333
## 35  4.10 9.999998  4.700000  6.666666  3.750000 0.000000 7.827667
## 36  6.30 9.999998  9.999998  6.666666  6.250000 2.233333 6.666666
## 37  5.20 4.999998  6.600000  3.333333  3.633403 1.100000 3.314128
## 38  5.00 3.333333  6.400000  6.666666  2.844997 0.000000 4.429657
## 39  3.10 4.999998  4.200000  5.000000  3.750000 0.000000 6.164304
## 40  4.10 9.999998  6.666666  3.333333  5.000000 0.000000 4.938089
## 41  5.00 9.999998  6.666666  1.666666  5.000000 0.000000 6.666666
## 42  5.00 7.700000  6.666666  8.399997  6.250000 4.358243 9.999998
## 43  5.00 6.200000  9.999998  6.060997  5.000000 2.782771 6.666666
## 44  5.60 4.900000  0.000000  0.000000  6.555647 4.055463 6.666666
## 45  5.70 4.800000  0.000000  0.000000  6.555647 4.055463 0.000000
## 46  7.50 9.999998  7.900000  6.666666  3.750000 9.999998 7.631891
## 47  2.50 0.000000  6.666666  3.333333  2.500000 0.000000 0.000000
## 48  8.90 9.999998  9.700000  6.666666  5.000000 9.999998 9.556024
## 49  7.60 0.000000 10.000000  0.000000  5.000000 1.100000 6.666666
## 50  7.80 9.999998  6.666666  6.666666  5.000000 3.333333 6.666666
## 51  2.50 0.000000  6.666666  3.333333  5.000000 0.000000 6.666666
## 52  3.80 0.000000  5.100000  0.000000  3.750000 0.000000 6.666666
## 53  5.00 3.333333  3.333333  2.233333  5.000000 3.333333 6.666666
## 54  6.25 3.333333  9.999998  2.955702  6.250000 5.566663 9.999998
## 55  1.25 0.000000  3.333333  0.000000  2.500000 0.000000 0.000000
## 56  1.25 0.000000  4.700000  0.736999  2.500000 0.000000 3.333333
## 57  1.25 0.000000  6.666666  0.000000  2.500000 0.000000 5.228375
## 58  7.50 7.766664  9.999998  6.666666  7.500000 3.333333 9.999998
## 59  2.50 0.000000  6.666666  4.433333  5.000000 0.000000 6.666666
## 60  7.50 9.999998  9.999998 10.000000  8.750000 9.999998 9.999998
## 61  1.25 0.000000  0.000000  0.000000  1.250000 0.000000 0.000000
## 62  1.25 0.000000  0.000000  0.000000  0.000000 0.000000 0.000000
## 63  2.50 0.000000  0.000000  0.000000  0.000000 0.000000 6.666666
## 64  6.25 2.233299  6.666666  2.970332  3.750000 3.333299 6.666666
## 65  7.50 9.999998  9.999998 10.000000  7.500000 9.999998 9.999998
## 66  5.00 0.000000  6.100000  0.000000  5.000000 3.333333 9.999998
## 67  7.50 9.999998  9.999998 10.000000  3.750000 9.999998 9.999998
## 68  4.90 2.233333  9.999998  0.000000  5.000000 0.000000 3.621989
## 69  5.00 0.000000  8.200000  0.000000  5.000000 0.000000 0.000000
## 70  2.90 3.333333  6.666666  3.333333  2.500000 3.333333 6.666666
## 71  5.40 9.999998  6.666666  3.333333  3.750000 6.666666 6.666666
## 72  7.50 8.800000  9.999998  6.066666  7.500000 6.666666 9.999998
## 73  7.50 7.000000  9.999998  6.852998  7.500000 6.348340 6.666666
## 74 10.00 6.666666  9.999998 10.000000 10.000000 6.666666 9.999998
## 75  3.75 3.333333  0.000000  0.000000  1.250000 3.333333 0.000000
##           y8       x1       x2       x3
## 1   3.333333 4.442651 3.637586 2.557615
## 2   0.736999 5.384495 5.062595 3.568079
## 3   8.211809 5.961005 6.255750 5.224433
## 4   4.615086 6.285998 7.567863 6.267495
## 5   6.666666 5.863631 6.818924 4.573679
## 6   0.368500 5.533389 5.135798 3.892270
## 7   1.485166 5.308268 5.075174 3.316213
## 8   6.666666 5.347108 4.852030 4.263183
## 9   3.333333 5.521461 5.241747 4.115168
## 10 10.000000 5.828946 5.370638 4.446216
## 11  6.666666 5.916202 6.423247 3.791545
## 12  6.666666 5.398163 6.246107 4.535708
## 13 10.000000 6.622736 7.872074 4.906154
## 14  0.000000 5.204007 5.225747 4.561047
## 15 10.000000 5.509388 6.202536 4.586286
## 16  6.666666 5.262690 5.820083 3.948911
## 17  0.368500 4.700480 5.023881 4.394491
## 18  3.333333 5.209486 4.465908 4.510268
## 19 10.000000 5.916202 6.732211 5.829084
## 20  2.948164 6.523562 6.992096 6.424591
## 21 10.000000 6.238325 6.746412 5.741711
## 22  0.000000 5.976351 6.712956 5.948168
## 23  3.333333 5.631212 5.937536 5.686755
## 24  7.766666 6.033086 6.093570 4.611429
## 25  6.666666 6.196444 6.704414 5.475261
## 26  3.333333 4.248495 2.708050 1.740830
## 27  3.333333 5.141664 4.564348 2.255134
## 28  1.116666 4.174387 3.688879 3.046927
## 29  3.333333 4.382027 2.890372 1.711279
## 30  0.000000 4.290459 1.609438 1.001674
## 31  3.333333 4.934474 4.234107 1.418971
## 32  0.000000 3.850148 1.945910 2.345229
## 33  3.333333 5.181784 4.394449 3.167167
## 34  0.744370 5.062595 4.595120 3.834970
## 35  6.666666 4.691348 4.143135 2.255134
## 36  2.955702 4.248495 3.367296 3.217506
## 37  3.333333 5.564520 5.236442 2.677633
## 38  1.485166 4.727388 3.610918 1.418971
## 39  3.333333 4.143135 2.302585 1.418971
## 40  2.233333 4.317488 4.955827 4.249888
## 41  0.368500 5.141664 4.430817 3.046927
## 42  4.141377 4.488636 3.465736 2.013579
## 43  4.974739 4.615121 4.941642 2.255134
## 44  3.821796 3.850148 2.397895 1.740830
## 45  0.000000 3.970292 2.397895 1.050741
## 46  6.666666 3.784190 3.091042 2.113313
## 47  0.000000 3.806662 2.079442 2.137561
## 48  6.666666 4.532599 3.610918 1.587802
## 49  1.099999 5.117994 4.934474 3.834970
## 50  6.666666 5.049856 5.111988 4.381490
## 51  3.333333 5.393628 5.638355 4.169451
## 52  1.485166 4.477337 3.931826 2.474671
## 53  5.566663 5.257495 5.840642 5.001796
## 54  6.666666 5.379897 5.505332 3.299937
## 55  0.000000 5.298317 6.274762 4.381490
## 56  3.333333 4.859812 5.669881 3.537416
## 57  0.000000 4.969813 5.564520 4.510268
## 58  6.666666 6.011267 6.253829 5.001796
## 59  1.485166 5.075174 5.252273 5.350708
## 60 10.000000 6.736967 7.125283 6.330518
## 61  0.000000 5.225747 5.451038 3.167167
## 62  0.000000 4.025352 1.791759 2.657972
## 63  2.948164 4.234107 2.708050 2.474671
## 64  3.333333 4.644391 5.564520 3.046927
## 65 10.000000 4.418841 4.941642 3.380653
## 66  3.333333 4.262680 4.219508 4.368462
## 67 10.000000 4.875197 4.700480 3.834970
## 68  3.333333 4.189655 1.386294 1.418971
## 69  0.000000 4.521789 4.127134 2.113313
## 70  3.333333 4.653960 3.555348 1.881917
## 71  1.485166 4.477337 3.091042 1.987909
## 72  6.666666 5.337538 5.631212 3.491004
## 73  7.508044 6.129050 6.403574 5.001796
## 74 10.000000 5.003946 4.962845 3.976994
## 75  0.000000 4.488636 4.897840 2.867566

with the corresponding model structure, which we want to fit:

Firstly, we define the model:

model <- '
  # measurement model
    ind60 =~ x1 + x2 + x3
    dem60 =~ y1 + y2 + y3 + y4
    dem65 =~ y5 + y6 + y7 + y8
  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60
  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

later, the covariates:

variable ~~ variable

## variable ~ ~variable

and finaly, we can fit the model:

fit <- sem(model, data=PoliticalDemocracy)
summary(fit, standardized=TRUE)

## lavaan (0.5-23.1097) converged normally after  68 iterations
## 
##   Number of observations                            75
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               38.125
##   Degrees of freedom                                35
##   P-value (Chi-square)                           0.329
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ind60 =~                                                              
##     x1                1.000                               0.670    0.920
##     x2                2.180    0.139   15.742    0.000    1.460    0.973
##     x3                1.819    0.152   11.967    0.000    1.218    0.872
##   dem60 =~                                                              
##     y1                1.000                               2.223    0.850
##     y2                1.257    0.182    6.889    0.000    2.794    0.717
##     y3                1.058    0.151    6.987    0.000    2.351    0.722
##     y4                1.265    0.145    8.722    0.000    2.812    0.846
##   dem65 =~                                                              
##     y5                1.000                               2.103    0.808
##     y6                1.186    0.169    7.024    0.000    2.493    0.746
##     y7                1.280    0.160    8.002    0.000    2.691    0.824
##     y8                1.266    0.158    8.007    0.000    2.662    0.828
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   dem60 ~                                                               
##     ind60             1.483    0.399    3.715    0.000    0.447    0.447
##   dem65 ~                                                               
##     ind60             0.572    0.221    2.586    0.010    0.182    0.182
##     dem60             0.837    0.098    8.514    0.000    0.885    0.885
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .y1 ~~                                                                 
##    .y5                0.624    0.358    1.741    0.082    0.624    0.296
##  .y2 ~~                                                                 
##    .y4                1.313    0.702    1.871    0.061    1.313    0.273
##    .y6                2.153    0.734    2.934    0.003    2.153    0.356
##  .y3 ~~                                                                 
##    .y7                0.795    0.608    1.308    0.191    0.795    0.191
##  .y4 ~~                                                                 
##    .y8                0.348    0.442    0.787    0.431    0.348    0.109
##  .y6 ~~                                                                 
##    .y8                1.356    0.568    2.386    0.017    1.356    0.338
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x1                0.082    0.019    4.184    0.000    0.082    0.154
##    .x2                0.120    0.070    1.718    0.086    0.120    0.053
##    .x3                0.467    0.090    5.177    0.000    0.467    0.239
##    .y1                1.891    0.444    4.256    0.000    1.891    0.277
##    .y2                7.373    1.374    5.366    0.000    7.373    0.486
##    .y3                5.067    0.952    5.324    0.000    5.067    0.478
##    .y4                3.148    0.739    4.261    0.000    3.148    0.285
##    .y5                2.351    0.480    4.895    0.000    2.351    0.347
##    .y6                4.954    0.914    5.419    0.000    4.954    0.443
##    .y7                3.431    0.713    4.814    0.000    3.431    0.322
##    .y8                3.254    0.695    4.685    0.000    3.254    0.315
##     ind60             0.448    0.087    5.173    0.000    1.000    1.000
##    .dem60             3.956    0.921    4.295    0.000    0.800    0.800
##    .dem65             0.172    0.215    0.803    0.422    0.039    0.039

Try by yourself

Consider the dataset HolzingerSwineford1939 from the R package library("lavaan") which can be loadid in by the command
```
data(HolzingerSwineford1939)
```

Apply the factor analysis approach for the covariates $X_{1}, \dots, X_9$ and consider three different factors when explaining the underlying covariance structure in amoung these covariats:

fa_Hol <- factanal(HolzingerSwineford1939[,7:15], factors = 3, rotation="varimax")
print(fa_Hol, digits=2, cutoff=.45, sort=TRUE)

## 
## Call:
## factanal(x = HolzingerSwineford1939[, 7:15], factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##   x1   x2   x3   x4   x5   x6   x7   x8   x9 
## 0.51 0.75 0.54 0.28 0.24 0.31 0.50 0.47 0.54 
## 
## Loadings:
##    Factor1 Factor2 Factor3
## x4  0.83                  
## x5  0.86                  
## x6  0.80                  
## x1          0.62          
## x3          0.66          
## x7                  0.70  
## x8                  0.71  
## x9                  0.52  
## x2          0.49          
## 
##                Factor1 Factor2 Factor3
## SS loadings       2.18    1.34    1.33
## Proportion Var    0.24    0.15    0.15
## Cumulative Var    0.24    0.39    0.54
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 22.38 on 12 degrees of freedom.
## The p-value is 0.0335

What is the interpretation of the results from above? Could you propose a linear regresion model (or SEM alternatively) to get some quantitative results?

The confirmatory factor analysis can do the work for you… The R function cfa() from the R packages “lavaan” is very similar to the sem() function used before. In fact, these two functions are almost identical.

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

fit <- cfa(HS.model, data=HolzingerSwineford1939)
summary(fit, fit.measures=TRUE)

## lavaan (0.5-23.1097) converged normally after  35 iterations
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               85.306
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              918.852
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.931
##   Tucker-Lewis Index (TLI)                       0.896
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745
##   Loglikelihood unrestricted model (H1)      -3695.092
## 
##   Number of free parameters                         21
##   Akaike (AIC)                                7517.490
##   Bayesian (BIC)                              7595.339
##   Sample-size adjusted Bayesian (BIC)         7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092
##   90 Percent Confidence Interval          0.071  0.114
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.100    5.554    0.000
##     x3                0.729    0.109    6.685    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.113    0.065   17.014    0.000
##     x6                0.926    0.055   16.703    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.180    0.165    7.152    0.000
##     x9                1.082    0.151    7.155    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.408    0.074    5.552    0.000
##     speed             0.262    0.056    4.660    0.000
##   textual ~~                                          
##     speed             0.173    0.049    3.518    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.114    4.833    0.000
##    .x2                1.134    0.102   11.146    0.000
##    .x3                0.844    0.091    9.317    0.000
##    .x4                0.371    0.048    7.779    0.000
##    .x5                0.446    0.058    7.642    0.000
##    .x6                0.356    0.043    8.277    0.000
##    .x7                0.799    0.081    9.823    0.000
##    .x8                0.488    0.074    6.573    0.000
##    .x9                0.566    0.071    8.003    0.000
##     visual            0.809    0.145    5.564    0.000
##     textual           0.979    0.112    8.737    0.000
##     speed             0.384    0.086    4.451    0.000

What would be the corresponding diagram capturing the dependence structure from the model above?

Homework Assignment

(Deadline: Lab Session 10 / 30.04.2018)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Consider the same dataset you applied the principal component analysis to and apply the factor analysis now;
Compare the outcomes of both, PCA and FA;
Chose one covariate as the dependent covariate and try to use the regression modelling approach to model the expectation of this covariate using the factors obtained from the factor analysis. Try to find some reasonable interpretation of the final model.