NMST539, LS 2015/16

Cvičenie 7 (týždeň 8)

Factor Analysis with R


Introduction into FA

Factor analysis is an effective statistical method for dimensionaly reduction especially in situation where one needs to use the data for further analysis and calculations (e.g. regression). It is used to describe the overall correlation among all variables however, using a potentially lower number of variables which are called factors. These factors are, however, unobserved random variables.

The factor analysis approach searches for similar covariates with respect to their mutuall correlation. All variables with mutually high correlation are represented with a factor (or a linear combination of factors) instead.

In some sense the factor analysis approach can be considered to be a generalization of the classical principal component analysis (PCA) with one great advantage at the end - much more convenient interpretation of the factors.

In the statistical software R there is function factanal() available under the standard R instalation. Beside that, there are many more additinal functions and packages which can be downloaded and installed in R (e.g. Factanal() function in the ‘FAiR’ package; fa.promax() function in the ‘psych’ package;).

For our purposes we mainly use the standard function factanal(). Let us again recall the biological metrics data from the Czech republic. The data represent 65 different river localities in the Czech Republic where on each locality there are various biological metrics assessed (17 metrics in total).

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)



The correlation structure (wich will be later assessed using the factor analysis approach) can be either estimated using a standard variance covariance matrix (command var(bioData[,2:18])) or it can be visualized using the corrplot() function from the ‘corrplot’ package instead.

library(corrplot)
corrplot(cor(bioData[,2:18]), method="ellipse")

The idea behind the factanal() function is to consider a \(p\)-dimensional (random) vector \(\bolsbymbol{X}\) and to express this vector using a set of \(k\) factors where we require want that \(k \ll p\) (dimensionaly reduction). The model which is fitted by the factor analysis approch by appling the `factanal()’ function in R is the following one:

\[ \boldsymbol{X} = \Lambda \boldsymbol{F} + \boldsymbol{e}, \]

where \(\Lambda\) is a \(p\times k\) dimensional matrix of so called factor loadings, is a \(k\)-dimensional (random) vector represending \(k\) factors (scores respectively) and \(\boldsymbol{e}\) is an approximation error.

The tricky part in this expresssion is that beside the random vector \(\boldsymbol{X}\) no other quantity is directly observed. The problem, as such, is unsolvable unless we pose some additional restrictions. This is done by specifying a varinace covariance structure among all quantities which appear in the expression above:

  • The scores \(\boldsymbol{F}\) are required to be uncorrelated with unit variance (\(Var (\boldsymbol{F}) = \mathbb{I}_{k});\)
  • Error terms \(\boldsymbol{e}\) are independent with some variance (\(Var (\boldsymbol{e}) = \boldsymbol{\Psi}\), where \(var (e_{i}) = \psi_{ii}\));
  • The correlation matrix of \(\boldsymbol{X}\) is decomposed as \(\Sigma = \Lambda \Lambda^{T} + \boldsymbol{\Psi}\).
fa1 <- factanal(bioData[,2:18], factors = 3)
print(fa1, digits=2, cutoff=.6, sort=TRUE)
## 
## Call:
## factanal(x = bioData[, 2:18], factors = 3)
## 
## Uniquenesses:
##    SaprInd      Lital       RETI     EPTAbu       Marg   Metaritr 
##       0.15       0.09       0.05       0.05       0.03       0.02 
##     JepAbu  Epiritral Hyporitral     Ntaxon     Nceled     Bindex 
##       0.10       0.07       0.39       0.21       0.09       0.19 
##     EPTTax     PosAbu    Spasaci      MDS_1      MDS_2 
##       0.10       0.35       0.11       0.13       0.45 
## 
## Loadings:
##            Factor1 Factor2 Factor3
## SaprInd    -0.79                  
## Lital       0.94                  
## RETI        0.94                  
## EPTAbu      0.90                  
## Metaritr    0.96                  
## JepAbu      0.68            0.65  
## Epiritral   0.89                  
## Hyporitral  0.71                  
## PosAbu      0.79                  
## Spasaci     0.94                  
## MDS_1       0.91                  
## Marg                0.98          
## Ntaxon              0.65          
## Nceled              0.94          
## Bindex              0.79          
## EPTTax              0.83          
## MDS_2                       0.71  
## 
##                Factor1 Factor2 Factor3
## SS loadings       8.96    4.05    1.41
## Proportion Var    0.53    0.24    0.08
## Cumulative Var    0.53    0.77    0.85
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 414.98 on 88 degrees of freedom.
## The p-value is 6.88e-44

Is such factor representation enough? Compare the model with some others for the increasing number of factors used:

fa2 <- factanal(bioData[,2:18], factors = 4)
fa3 <- factanal(bioData[,2:18], factors = 5)
fa4 <- factanal(bioData[,2:18], factors = 9)

There are mainly two different approaches on how to define the number of factors which should be used.

  • Statistical pre-analysis
    One usually runs e.g. a principal components analysis to determin how many factors should be enough (in order to have some reasonable proportion of the overal variability explained);

  • Expert judgement
    Sometimes (under some optimal interpretation options) the estimated factors nice correspond with some latent variables which are not observed, however, can be indentified by some expert judgement.

Note


  • Another option on how to determin the optimal number of factor to be extracted in the analysis is to use the tools from the library ‘nFactors’ (the library can be installed by install.packages("nFactors"));

library(nFactors)
ev <- eigen(cor(bioData[,2:18])) 
ap <- parallel(subject=nrow(bioData[,2:18]),var=ncol(bioData[,2:18]), rep=100, cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)

plotnScree(nS)

load <- fa1$loadings[,1:2] 
plot(load,type="n") 
text(load,labels=names(bioData[,2:18]),cex=.7)

Note


  • The factanal() function in R uses scaled variables as the starting point for the factor analysis - the variance-covariance matrix of \(\boldsymbol{X}\) is replaced by the corresponding correlation matrix with ones on its diagonal.


Now we an idea that maybe three factors should be ok. In orther to get the corresponding score values (those are values of \(\boldsymbol{F}\) in the expression above given for each observations - in our case \(n = 65\)) from the factanal() function we need an additional parameter to be specified:

fa1 <- factanal(bioData[,2:18], factors = 3, scores = "regression")
fa1$scores
##           Factor1     Factor2      Factor3
##  [1,] -0.12449662  0.49318396  0.019226166
##  [2,] -0.33849930 -0.39230318  0.119550138
##  [3,] -0.22044557 -0.38580832  2.320568152
##  [4,]  0.82566318  1.20827606 -0.940891112
##  [5,] -0.13932079  0.89310776 -0.532355972
##  [6,] -1.01455286 -0.65600386 -0.375228204
##  [7,] -0.82343873 -0.62783914 -0.701274829
##  [8,] -0.98728818 -0.13782643 -0.433312546
##  [9,]  0.22418720  0.88632819  0.795206993
## [10,]  0.52292639  1.06261823  0.295526977
## [11,]  0.29559045  1.25010730 -0.529517247
## [12,] -0.50645369 -0.62925864 -0.373956123
## [13,] -0.93565845 -1.35045706 -0.637327245
## [14,] -0.95854352 -0.20085958  1.192711934
## [15,]  0.42955162 -0.44374222 -1.634492803
## [16,]  1.52669131  0.90258023 -0.647334942
## [17,] -1.06358562 -0.98644111 -0.614191669
## [18,] -0.17714883  0.29705426 -0.273963895
## [19,] -1.17296128  0.15482177  0.740586340
## [20,] -0.78876331  1.64768437  1.135756173
## [21,] -0.68592393  1.86494676  0.622906404
## [22,]  0.03378541  0.51636339 -0.492220100
## [23,] -0.98102088  0.67287079  0.410026081
## [24,]  1.72144792  0.93062040 -1.616832483
## [25,] -0.76259043 -1.35110582 -0.851702050
## [26,] -0.91691053 -1.09563523 -0.287746278
## [27,] -0.28310899 -0.08405725 -0.554590635
## [28,]  1.01603809  0.64177486 -1.212012177
## [29,]  0.02890208  0.83487030  0.220552649
## [30,]  2.69443073 -1.16900885  0.606585871
## [31,] -0.47082315 -0.18615905  2.222434682
## [32,]  1.95090263 -1.29573724  1.303667307
## [33,]  0.28558011  1.04412770  0.364248451
## [34,] -0.18909235  0.97786635  0.090795648
## [35,] -0.48400090 -0.41407901 -0.459370022
## [36,] -0.86086894 -1.09247855 -0.356923055
## [37,] -0.83806848 -0.92300383 -0.815412673
## [38,] -0.91971280 -0.62821762 -0.098009986
## [39,] -0.87317702 -0.29454466 -0.077311663
## [40,]  0.96502394 -1.12911936  4.327243670
## [41,] -0.63704689  0.20145093  0.197835134
## [42,] -0.19959392  1.30938359  0.667994311
## [43,]  3.12996486 -1.62033577  0.044354682
## [44,]  1.20880189  0.64341909 -0.149924195
## [45,]  0.06135524 -0.02822842 -0.557196522
## [46,] -0.86225779 -1.61696484 -1.552055930
## [47,] -0.61532943 -0.45390266 -0.419664831
## [48,] -0.14042956  0.55569570 -0.113699662
## [49,] -0.76637088  0.22722578 -0.178657943
## [50,]  0.92871029  1.82419989 -0.448391534
## [51,]  1.52938447  1.37527251 -1.001758261
## [52,] -0.06380299 -2.13721348  0.714719247
## [53,] -0.48147547  0.55671249  1.028660417
## [54,] -0.01773075 -0.25909705 -0.961153139
## [55,]  0.44621941  0.62532458 -0.231747386
## [56,] -0.48055946  0.70571466  0.712530342
## [57,] -0.57932009 -0.16888336  0.267753908
## [58,]  0.41294090  0.99097762 -0.110134319
## [59,] -0.54167359  0.01836471  0.755755149
## [60,] -0.08414099  0.39657781  0.326141043
## [61,] -0.92105906 -1.35457565 -0.853275093
## [62,]  2.50167388 -1.84503963 -0.802830680
## [63,] -0.84252936 -0.72197336 -0.191797001
## [64,]  1.79639695 -1.48006272  0.583465441
## [65,]  0.21360641  1.45044090  0.001460895

and the corresponding loadings (matrix \(\Lambda\) in the expression above) is obtained as

fa1$loadings
## 
## Loadings:
##            Factor1 Factor2 Factor3
## SaprInd    -0.794  -0.457  -0.109 
## Lital       0.939   0.164         
## RETI        0.941   0.203  -0.142 
## EPTAbu      0.903           0.357 
## Marg                0.984         
## Metaritr    0.959   0.193  -0.150 
## JepAbu      0.684  -0.115   0.646 
## Epiritral   0.886   0.182  -0.340 
## Hyporitral  0.710   0.184   0.275 
## Ntaxon     -0.572   0.654  -0.190 
## Nceled      0.102   0.936  -0.151 
## Bindex      0.421   0.789         
## EPTTax      0.456   0.834         
## PosAbu      0.785          -0.159 
## Spasaci     0.939                 
## MDS_1       0.907   0.190   0.113 
## MDS_2      -0.165  -0.129   0.714 
## 
##                Factor1 Factor2 Factor3
## SS loadings      8.961   4.053   1.409
## Proportion Var   0.527   0.238   0.083
## Cumulative Var   0.527   0.766   0.848

Application in Regression or SEM

In the following, we can try to use the factor scores for the linear regression modelling (e.g. to model a relationship between some chemical covariate and the biologial factors). Another typical use of factor loadings is in structural equation modelling approaches.

chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)