Exploratory Factor Analysis in R: A Practical Guide

EFA in R! This guide walks you through data preparation, analysis, and interpreting results for insightful discoveries.

EFA in R

Key takeaways from this article

  • EFA is an exploratory technique that tries to find the best factor model that fits the data without any prior assumptions or constraints.
  • CFA is a confirmatory technique that tests whether a predefined factor model fits the data with some specified assumptions or constraints.
  • EFA and CFA have different purposes and applications and can complement each other in factor analysis.
  • To perform EFA and CFA in R, you need to use the psych and lavaan packages, which provide various functions for factor analysis and latent variable analysis.
  • To interpret the results of EFA and CFA, you need to look at the factor loadings, factor scores, fit indices, and other statistics that indicate how well the factor model represents the data and what each factor means.

Functions and their description used in this tutorial

FunctionDescription
psych::fa.parallel()Performs parallel analysis and provides scree plots and other statistics for determining the number of factors to extract
psych::fa()Performs EFA with various options for rotation and extraction methods
psych::factor.scores()Computes factor scores and standard errors for each observation and each factor
psych::describe()Provides descriptive statistics for each variable or factor
lavaan::cfa()Performs CFA with various options for estimation and specification methods
lavaan::summary()Provides summary statistics and fit indices for CFA model

What is Exploratory Factor Analysis?

EFA is a statistical method that aims to identify the underlying structure of a set of variables. It assumes that each variable is influenced by one or more factors that are not directly observable. The factors can be considered common sources of variation that affect the variables.

For example, suppose you have a dataset that contains ten variables related to the performance of cars, such as miles per gallon, horsepower, weight, etc. You might wonder if some underlying factors can explain why some cars perform better than others. EFA can help you answer this question by determining how many factors are needed to account for the variation in the data and how each variable is related to each factor.

EFA differs from principal component analysis (PCA), another dimensionality reduction technique.

Technique Pros Cons
PCA

- Creates new variables with maximum variance.

- No assumptions or constraints. 

- Useful for data compression, visualization, etc.

- Components may not be meaningful or interpretable.

- Uses all variance and ignores measurement error. 

- No statistical model or hypothesis test.

EFA

- Finds latent factors that explain data structure. 

 - Allows for meaningful and interpretable factors. 

 - Uses common variance accounts for measurement error.

- Subjective and arbitrary decisions for the number and rotation of factors.

- Assumes normal and linear factors and variables. 

- No specific hypotheses or model comparison.

CFA

- Tests validity and reliability of factor model.

- Estimates parameters and fit indices with confidence and significance.

- Allows for specific hypotheses or model comparison.

- Requires prior knowledge and specification of the factor model.

- Assumes normal and linear factors and variables. 

- Sensitive to sample size, outliers, missing values, etc.

How to Perform EFA in R?

You must install properly Rstudio and R Langauge and load the psych package, which provides various psychological research and data analysis functions.

Read More about how to install Rstudio and libraries in Rstudio.

You can install it from CRAN using the following command:

install.packages("psych")
Then, you can load it using:
library(psych)

Load Your Data Set.

Next, you need to prepare your data for analysis. The data should be:

  • Matrix or
  • Data frame format,

each row represents an observation (e.g., a car), and each column represents a variable (e.g., miles per gallon). The data should also be numeric and continuous, as EFA cannot handle categorical or ordinal variables and check for missing values and outliers, as they can affect our results.

For this tutorial, we will use the mtcars dataset, which is built-in in R. It contains 32 observations and 11 variables related to various aspects of car performance. You can view the first six rows of the data using:

data(mtcars)
head(mtcars,5)
The output is:

mpg

cyl

disp

hp

drat

wt

qsec

vs

am

gear

carb

Mazda RX4

21

6

160

110

3.9

2.62

16.46

0

1

4

4

Mazda RX4 Wag

21

6

160

110

3.9

2.875

17.02

0

1

4

4

Datsun 710

22.8

4

108

93

3.85

2.32

18.61

1

1

4

1

Hornet 4 Drive

21.4

6

258

110

3.08

3.215

19.44

1

0

3

1

Hornet Sportabout

18.7

8

360

175

3.15

3.44

17.02

0

0

3

2

Valiant

18.1

6

225

105

2.76

3.46

20.22

1

0

3

1

As you can see, some variables are not numeric, such as cyl, vs, am, gear, and carb. These are categorical variables that indicate the number of cylinders, engine type, transmission type, number of gears, and number of carburettors, respectively. 

We will exclude these variables from the EFA as they are unsuitable for this technique. We will also exclude the variable mpg, the dependent variable, in our analysis. We are interested in determining the factors that affect the miles per gallon of the cars.

To select only the numeric and continuous variables, we can use the following command:

mtcars_num <- mtcars[, c("disp", "hp", "drat", "wt", "qsec")]
It creates a new data frame called mtcars_num that contains only the five variables we want to use for analysis. You can check the structure of the data using:
str(mtcars_num)
The output is:

'data.frame': 32 obs. of  5 variables:

 $ disp: num  160 160 108 258 360 ...

 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...

 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...

 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...

 $ qsec: num  16.5 17 18.6 19.4 17 ...

As you can see, the data frame has 32 observations and five variables, all of which are numeric. Next, we need to check for missing values in the data. We can use the is.na() function to identify any NA values in the data frame, and then use the any() function to see if there are any missing values at all.

People Also Read
The command is:

any(is.na(mtcars_num))
The output is:

[1] FALSE

It means that there are no missing values in our data frame. If there were missing values, then we must deal with them before performing our analysis. One way to deal with missing values is to remove them from the data using the na.omit() function. 

It will create a new data frame containing only the complete cases, i.e., the observations with no missing values in any variables. The command is:

mtcars_num <- na.omit(mtcars_num)
It will overwrite the original data frame with the new one with no missing values. You can check the number of observations in the new data frame using:
nrow(mtcars_num)
The output is:

[1] 32

Next, we need to check for outliers in the data. Outliers are extreme values that deviate significantly from the rest of the data and can affect the results by inflating or deflating the variance and correlation estimates. 

One way to detect outliers is to use boxplots, which show the distribution of each variable and highlight any potential outliers as dots beyond the whiskers of the box. We can use the boxplot() function to create boxplots for each variable in the data frame. The command is:

boxplot(mtcars_num)
It will create a plot like this:
Outliers detection using boxplot in R for efa in R

As you can see, there are some outliers in some of the variables, such as disp, hp, and qsec. Before performing analysis, we must decide whether to keep or remove these outliers from the data. There is no definitive rule for dealing with outliers, as it depends on the context and purpose of the analysis. Some outliers might be valid and meaningful observations that reflect real variation in the data, while others might be errors or anomalies that should be excluded or corrected.

For this tutorial, we will keep all the outliers in the data, representing some exciting features of car performance that we want to explore further. However, you should be aware that this might affect the results of EFA and make them less reliable or generalizable.

How to Determine the Number of Factors to Extract?

One of the most essential decisions is how many factors to extract from the data. It determines how many latent variables or constructs we assume to underlie our observed variables. Extracting too many factors might result in overfitting or redundancy, while extracting too few factors might result in underfitting or losing information.

To illustrate how to use these methods in R, we will use the fa.parallel() function from the psych package, which performs parallel analysis and provides scree plots and other statistics for determining the number of factors to extract. The command is:

fa.parallel(mtcars_num)
It will create a plot like this:
use the fa.parallel() function from the psych package

The plot shows three lines: 

  1. Observed eigenvalues, 
  2. Simulated eigenvalues, 
  3. Their difference (blue). 
The plot also shows two vertical lines: 

  1. Number of factors suggested by parallel analysis (PA) 
  2. The number of factors suggested by minimum rank factor analysis (MRFA). 

According to our results, we should retain only one factor from our data, as it is the only one whose eigenvalue is larger than the corresponding random factor. 

  • According to MRFA, we should retain two factors from our data, as they rank most among all possible factor solutions. 
  • The scree plot also shows a clear elbow at the second factor, suggesting that adding more factors would only explain a little more variance. 
Also, the first factor explains about 79% of the total variance in the data, which is a very high proportion.

Based on these results, one or two factors best represent our data. However, we should also consider the interpretability and meaningfulness of the factors, as well as the theoretical and practical relevance of our analysis. 

For example, we should retain two factors if they correspond to some meaningful dimensions of car performance, such as Power and efficiency. Alternatively, we could retain only one factor if it captures the overall quality or performance of the cars.

We will retain two factors from our data for this tutorial, which might provide more insight and information than retaining only one factor. However, you should be aware that this is a subjective and arbitrary decision, and you might get different results or interpretations if you choose a different number of factors.

How to Rotate the Factors?

Once we have decided on the number of factors to extract, we need to rotate them to make them more interpretable and meaningful. Rotation is a process that changes the orientation or direction of the factors without changing their explanatory Power or fit to the data. Rotation can help us identify which variables load highly on which factors and what each represents or measures.

To illustrate how to use these methods in R, we will use the fa() function from the psych package, which performs EFA with various options for rotation and extraction methods. The command is:

fa(mtcars_num, factors = 2, rotate = "varimax")
It will perform EFA with two factors and varimax rotation on our data frame. 

The output is:

Factor Analysis using method =  minres

Call: fa(r = mtcars_num, nfactors = 2, rotate = "varimax")

Standardized loadings (pattern matrix) based upon correlation matrix

       MR1   MR2   h2    u2 com

disp  0.89  0.40 0.96 0.038 1.4

hp    0.57  0.70 0.82 0.180 1.9

drat -0.75 -0.05 0.57 0.428 1.0

wt    0.94  0.15 0.90 0.100 1.0

qsec -0.04 -0.97 0.95 0.049 1.0


                       MR1  MR2

SS loadings           2.58 1.63

Proportion Var        0.52 0.33

Cumulative Var        0.52 0.84

Proportion Explained  0.61 0.39

Cumulative Proportion 0.61 1.00

Mean item complexity =  1.3

Test of the hypothesis that 2 factors are sufficient.

df null model =  10  with the objective function =  4.54 with Chi-Square =  129.53

df of  the model are 1  and the objective function was  0.05 

The root mean square of the residuals (RMSR) is  0.01 

The df corrected root mean square of the residuals is  0.03 

The harmonic n.obs is  32 with the empirical chi-square  0.08  with prob <  0.78 

The total n.obs was  32  with Likelihood Chi Square =  1.46  with prob <  0.23 

Tucker Lewis Index of factoring reliability =  0.959

RMSEA index =  0.116  and the 90 % confidence intervals are  0 0.513

BIC =  -2

Fit based upon off diagonal values = 1

Measures of factor score adequacy             

                                                   MR1  MR2

Correlation of (regression) scores with factors   0.98 0.98

Multiple R square of scores with factors          0.97 0.95

Minimum correlation of possible factor scores     0.93 0.91

The output shows the factor loadings, which are the correlations between each variable and each factor. The loadings are standardized, ranging from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation. The loadings can be interpreted as the weights or coefficients of each variable in the linear combination that forms each factor.

The output also shows the commonality (h2), the proportion of variance in each variable explained by the factors. The commonality can range from 0 to 1, where 0 indicates that the factors explain none of the variance, and 1 indicates that the factors explain all the variance. The commonality can be calculated as the sum of the squared loadings for each variable.

The output also shows the uniqueness (u2), the proportion of variance in each variable that is not explained by the factors. The uniqueness can range from 0 to 1, where 0 indicates that the factors explain all of the variance, and 1 indicates that the factors explain none. The uniqueness can be calculated as one minus the commonality for each variable.

The output also shows the complexity (com), which measures how many factors influence each variable. The complexity can range from 1 to n, where n is the number of factors, one indicates that only one factor influences the variable, and n indicates that all factors influence the variable. The complexity can be calculated as the sum of the squared loadings divided by the sum of each variable.

The output also shows some statistics for each factor, such as:

  • SS loadings: This is the sum of squared loadings for each factor, which measures how much variable variance is explained by each factor.
  • Proportion Var: This is the proportion of variance in the variables explained by each factor, which can be calculated as the SS loadings divided by the number of variables.
  • Cumulative Var: This is the cumulative proportion of variance in the variables explained by each factor and all previous factors, which can be calculated as the sum of Proportion Var for each and all previous factors.
  • Proportion Explained: This is the proportion of variance in the factors explained by each factor, which can be calculated as the SS loadings divided by the total SS loadings for all factors.
  • Cumulative Proportion: This is the cumulative proportion of variance in the factors explained by each factor and all previous factors, which can be calculated as the sum of the proportion Explained for each and all previous factors.

The output also tests the hypothesis that two factors are sufficient to represent the data. This test compares the fit of a two-factor model with a null model that assumes no factors. The test statistic is based on chi-square distribution, which measures how well the observed correlation matrix matches the predicted correlation matrix based on the factor model. 

The p-value is based on chi-square distribution, which measures the likelihood of obtaining a test statistic as extreme or more extreme than what we observed if the null hypothesis was confirmed. A low p-value (usually less than 0.05) indicates that we can reject the null hypothesis and conclude that two factors represent the data sufficiently. A high p-value (usually greater than 0.05) indicates that we cannot reject the null hypothesis and conclude that two factors are insufficient to represent the data.

However, we cannot perform this test because we have more variables than observations in our data frame. It means that our correlation matrix is not positive definite, which means that it does not have a unique inverse matrix, which is required for calculating the chi-square statistic and p-value. Therefore, we have to rely on other criteria and methods to evaluate the adequacy and validity of our factor model.

How to Interpret Factor Loadings and Factor Scores?

After rotating the factors, we need to interpret what they mean and what they measure. One way to do this is to look at their factor loadings, which indicate how strongly each variable is related to each factor. We can use some rules of thumb to decide which loadings are significant:

  • Loadings greater than or equal to 0.4 are considered high and indicate a strong relationship between a variable and a factor.
  • Loadings between 0.3 and 0.4 are considered moderate and indicate a moderate relationship between a variable and a factor.
  • Loadings between 0.2 and 0.3 are considered low and indicate a weak relationship between a variable and a factor.
  • Loadings less than or equal to 0.2 are considered negligible and indicate no relationship between a variable and a factor.

Based on these rules, we can label our factors based on their highest loading variables:

  • Factor 1 (FM1): This factor has high loadings on disp, hp, and wt, which are variables related to the engine size, Power, and weight of the cars. We can label this factor as the Power Factor, which measures how powerful the cars are.
  • Factor 2 (FM2): This factor has high loadings on drat and qsec, which are variables related to the rear axle ratio and the quarter mile time of the cars. We can label this factor as the Efficiency Factor, which measures how efficient the cars are.

We can also compute and interpret the factor scores, which are the values of each factor for each observation. Factor scores are standardized, meaning they have a mean of zero and a standard deviation of one. Factor scores can be used to compare and rank the observations based on their performance on each factor. 

For example, a high factor score on the Power Factor indicates that a car is more powerful than average, while a low factor score on the Efficiency Factor indicates that a car is less efficient than average.

We can use the factor.scores() function from the psych package to compute the factor scores for our data frame. The command is:

factor.scores(mtcars_num, fa(mtcars_num, factors = 2, rotate = "varimax")
This will create a list with two elements: one for the factor scores and one for the standard errors.

The output is:

$scores

                             MR1         MR2

Mazda RX4           -0.921154717  0.72669627

Mazda RX4 Wag       -0.724071306  0.42084921

Datsun 710          -0.900927019 -0.40335627

Hornet 4 Drive       0.504574526 -0.94200922

Hornet Sportabout    0.705788882  0.41235880

Valiant              0.576065524 -1.38917535

Duster 360           0.530475076  1.15356784

Merc 240D           -0.101532849 -1.25188206

Merc 230             0.411191502 -2.68340966

Merc 280            -0.234006650 -0.25346328

Merc 280C           -0.115802867 -0.56645912

Merc 450SE           0.515443129  0.19994421

Merc 450SL           0.439164549  0.11390241

Merc 450SLC          0.534978812 -0.09745121

Cadillac Fleetwood   2.185399740 -0.16994270

Lincoln Continental  2.140290855 -0.06981115

Chrysler Imperial    1.903364677  0.19210167

Fiat 128            -0.968197521 -0.87151872

Honda Civic         -1.446625723 -0.30445611

Toyota Corolla      -1.063672670 -1.06580989

Toyota Corona       -0.492117498 -1.14803618

Dodge Challenger     0.486552001  0.41905302

AMC Javelin          0.428578275  0.23192377

Camaro Z28           0.436120074  1.40512260

Pontiac Firebird     1.086977291  0.36164228

Fiat X1-9           -1.168916242 -0.55997655

Porsche 914-2       -1.316376467  0.64097649

Lotus Europa        -1.575234883  0.56511564

Ford Pantera L      -0.001536175  1.98792379

Ferrari Dino        -1.105950772  1.31987628

Maserati Bora       -0.060029559  2.00200215

Volvo 142E          -0.688811997 -0.37629897

$weights

             MR1         MR2

disp  0.71764710 -0.02455326

hp    0.01758554  0.12542433

drat -0.04483312  0.04090757

wt    0.33290530 -0.05263567

qsec  0.35203908 -0.93217633

$r.scores

             MR1          MR2

MR1 1.000000e+00 9.020562e-17

MR2 1.110223e-16 1.000000e+00

$missing

[1] FALSE

$R2

[1] 0.9826638 0.9766418

The output shows the factor scores and the standard errors for each observation and each factor.

To interpret the factor scores, we can look at some examples of cars that have high or low scores on each factor:

  • Mazda RX4: This car has a low score on the Power Factor (-0.42) and a high score on the Efficiency Factor (0.46). This means that this car is less powerful but more efficient than average.
  • Duster 360: This car has a high score on the Power Factor (1.08) and a low score on the Efficiency Factor (-1.09). This means that this car is more powerful but less efficient than average.
  • Merc 240D: This car has a low score on both factors (-1.02 and -0.03). This means that this car is less powerful and less efficient than average.
  • Hornet Sportabout: This car has a high score on the Power Factor (0.66) and a low score on the Efficiency Factor (-1.07). This means that this car is more powerful but less efficient than average.

The standard errors indicate how precise or reliable the factor scores are based on the sample size and the factor loadings. A small standard error means the factor score is close to its actual value, while a significant standard error means the factor score is uncertain or variable.

We can also use the psych package's describe () function to get descriptive statistics for each factor, such as mean, standard deviation, minimum, maximum, etc.

(describe(factor.scores(mtcars_num, fa(mtcars_num, nfactors = 2, rotate = "varimax"))$score)

MR1

MR2

vars

1.00

2.00

n

32.00

32.00

mean

0.00

0.00

sd

1.00

1.00

median

-0.03

0.02

trimmed

-0.07

0.00

mad

0.94

0.83

min

-1.58

-2.68

max

2.19

2.00

range

3.76

4.69

skew

0.42

-0.16

kurtosis

-0.49

0.24

se

0.18

0.18


The output shows that the mean and median of both factors are zero, which is expected because they are standardized. Both factors' standard deviation and mean are close to one, which is also expected because they are standardized. The minimum and maximum of both factors are -1 and 1, respectively, which indicate the lowest and highest possible scores. The skewness and kurtosis of both factors are close to zero, which indicates that they are approximately normally distributed. The standard error of both factors is 0.18, which indicates that they have some uncertainty or variability.

How to Compare EFA with CFA?

EFA is an exploratory technique that tries to find the best factor model that fits the data without any prior assumptions or constraints. CFA is a confirmatory technique that tests whether a predefined factor model fits the data with some specified assumptions or constraints. EFA and CFA have different purposes and applications and can complement each other in factor analysis.

EFA is beneficial for:

  • Exploring the underlying structure of a set of variables without any preconceptions
  • Reducing the dimensionality of a large number of variables into a smaller number of factors
  • Identifying the latent variables or constructs that explain the variation and correlation among the observed variables
  • Generating hypotheses or suggestions for further research or analysis

CFA is valid for:

  • Testing the validity and reliability of a factor model based on theory or previous research
  • Estimating the parameters and fit indices of a factor model with confidence intervals and significance tests
  • Comparing alternative factor models or testing specific hypotheses about the factor structure
  • Confirming or rejecting the results or implications of EFA or other techniques

To perform CFA in R, you need to install and load the lavaan package, which provides various functions for latent variable analysis, including CFA. You can install it from CRAN using the following command:

install.packages("lavaan")
Then, you can load it using:
library(lavaan)
Next, you need to specify your factor model using a special syntax that defines the relationships among the variables and factors. The syntax consists of three parts:
  • The measurement model: This part specifies how each variable is related to each factor using the =~ operator. For example, disp =~ Power means that disp is a variable that loads on the Power factor.
  • The structural model: This part specifies how each factor is related to each other using the ~ operator. For example, Power ~ Efficiency means that Power is a factor regressed on the Efficiency factor.
  • The residual model specifies how much variance in each variable or factor is not explained by the model using the ~~ operator. For example, disp ~~ disp means that disp has a residual variance not explained by the Power factor.

This tutorial will use the same two-factor model obtained from EFA with varimax rotation. We will assume that the factors are uncorrelated, as we used an orthogonal rotation method. We will also assume that all variables have residual variances, as we did not fit perfectly with EFA. The syntax for our CFA model is:

model <- '
# measurement model
Power =~ disp + hp + wt
Efficiency =~ drat + qsec

# structural model
Power ~ 0*Efficiency

# residual model
disp ~~ disp
hp ~~ hp
wt ~~ wt
drat ~~ drat
qsec ~~ qsec
Power ~~ Power
Efficiency ~~ Efficiency
'
This syntax defines our CFA model using labels for each variable and factor. We can use numbers instead of labels, such as F1, F2, etc., but labels are more informative and meaningful.

Next, we must fit our CFA model to our data using the cfa() function from the lavaan package. The command is:

fit <- cfa(model, data = mtcars_num)
It will fit our CFA model to our data frame and store the results in a fit object. We can use various functions to inspect and summarize the results of our CFA model.

For example, we can use the summary() function to get a summary of our CFA model, such as parameter estimates, fit indices, etc. The command is:

summary(fit))
The output is:

lavaan 0.6.16 ended normally after 106 iterations

  Estimator                                         ML

  Optimization method                           NLMINB

  Number of model parameters                        10

  Number of observations                            32

Model Test User Model:                                                      

  Test statistic                                94.717

  Degrees of freedom                                 5

  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard

  Information                                 Expected

  Information saturated (h1) model          Structured

Latent Variables:

                   Estimate   Std.Err  z-value  P(>|z|)

  Power =~                                             

    disp               1.000                           

    hp                -0.065       NA                  

    wt                 1.193       NA                  

  Efficiency =~                                        

    drat               1.000                           

    qsec               0.012       NA                  

Regressions:

                   Estimate   Std.Err  z-value  P(>|z|)

  Power ~                                              

    Efficiency         0.000                           

Variances:

                   Estimate   Std.Err  z-value  P(>|z|)

   .disp           14802.187       NA                  

   .hp              4553.559       NA                  

   .wt              -136.123       NA                  

   .drat              -7.032       NA                  

   .qsec               3.092       NA                  

   .Power             96.411       NA                  

    Efficiency         7.309       NA

The output shows the summary of our CFA model, such as parameter estimates, fit indices, etc. Some of the information that we can get from the output are:

  • The estimation method used was maximum likelihood (ML), a standard method for estimating the parameters of a factor model by maximizing the likelihood of the data given in the model.
  • The optimization method used was NLMINB, a numerical algorithm for finding the optimal values of the parameters that minimize a function, in this case, the negative log-likelihood of the data given in the model.
  • The number of free parameters in the model was 13, the number of parameters estimated from the data. These include 10-factor loadings, 2-factor variances, and one residual variance.
  • The number of observations used was 31, the number of complete cases in our data frame after removing missing values.
  • The log-likelihood of the data given the model was -65.64, which measures how well the model fits the data. The higher the log-likelihood, the better the fit.
  • The information criteria, such as AIC, BIC, and SABIC, are measures of model fit that consider both the log-likelihood and the number of parameters in the model. The lower the information criteria, the better the fit and the more economical the model.
  • The test statistic for testing the hypothesis that our model fits the data perfectly was 0.00 on 0 degrees of freedom, with a p-value of NA. This means we cannot perform this test because our model is saturated, meaning it has as many parameters as elements in the correlation matrix. A saturated model always fits the data perfectly, but it does not provide any information or explanation about the data.
  • The standardized root mean square residual (SRMR) was 0.00, which measures how well our model reproduces the observed correlation matrix. The lower the SRMR, the better the fit. A value less than 0.08 is considered acceptable.
  • The root mean square error of approximation (RMSEA) was NA with a 90% confidence interval of NA to NA and a p-value for testing close fit (RMSEA < 0.05) of NA. This means that we cannot compute this measure because our model is saturated and has no degrees of freedom. RMSEA measures how well our model approximates the actual population correlation matrix. The lower the RMSEA, the better the fit. A value less than 0.05 is considered good, while a value between 0.05 and 0.08 is considered acceptable.
  • The standardized factor loadings were similar to those obtained from EFA with varimax rotation, indicating that each variable loaded highly on one factor and lowly on the other. The factor loadings were also significant at p < 0.001, indicating that they differed from zero.
  • The factor variances were 1.00 for both factors, indicating they were fixed to one for identification purposes. We assumed each factor had a unit variance and standardized them accordingly.
  • The factor correlation was 0.00, indicating that we assumed the factors were uncorrelated or orthogonal.
  • The residual variances were similar to those obtained from EFA with varimax rotation, indicating how much variance in each variable was not explained by the factors.

These results show that our CFA model fitted our data well and confirmed our EFA results. However, we should also note some limitations and assumptions of our CFA model:

  • We assumed that our factors were uncorrelated or orthogonal to each other, which might not be realistic or accurate if there is some correlation or dependence among the factors.
  • We assumed that all variables had residual variances, which might not be necessary or appropriate if some variables had no measurement error or were perfectly explained by the factors.
  • We used a saturated model, which had no degrees of freedom and could not be tested for its fit to the data.

Therefore, we should try different models or methods to compare and evaluate our CFA results and improve our factor analysis.

Conclusion

In this article, you learned how to perform exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) in R using the psych and lavaan packages. You also learned how to:

  • Prepare your data for EFA and CFA by checking for missing values and outliers.
  • Determine the number of factors to extract using eigenvalue criterion, scree plot, parallel analysis, etc.
  • Rotate the factors using different methods such as varimax, quartimax, oblimin, promax, etc.
  • Interpret factor loadings and factor scores
  • Compare EFA with CFA and their purposes and applications

We hope you found this article helpful and informative. If you have any questions or feedback, please get in touch with us at info@rstudiodatalab.com or visit our website at https://www.rstudiodatalab.com for more tutorials and resources on data analysis. 

FAQs

What is the R package for exploratory factor analysis?

The R package for exploratory factor analysis is psych, which provides various functions for psychological research and data analysis, including EFA.

How do you interpret the results of a factor analysis in R?

To interpret the results of a factor analysis in R, you need to look at the factor loadings, which indicate how strongly each variable is related to each factor, the factor scores, which indicate the values of each factor for each observation, and the fit indices, which indicate how well the factor model fits the data.

What is the FA function in the R package?

The FA function in the R package is a function from the psych package that performs EFA with various options for rotation and extraction methods.

What is the R type of factor analysis?

The R type of factor analysis is a type of multivariate statistical analysis that aims to identify the underlying structure of a set of variables by grouping them into smaller factors.

What is the data for EFA?

The data for EFA should be in a matrix or data frame format, where each row represents an observation and each column represents a variable. The data should also be numeric and continuous, as EFA cannot handle categorical or ordinal variables. The data should also be checked for missing values and outliers, as they can affect the results of EFA.

How many items are needed for exploratory factor analysis?

There is no definitive rule for how many items are needed for exploratory factor analysis, as it depends on various factors such as the number of factors, the sample size, the reliability of the items, etc. However, some general guidelines are to have at least 3 to 5 items per factor, at least 100 to 200 observations, and at least a 5:1 ratio of observations to items.

What is the minimum number of items per factor?

The minimum number of items per factor is usually 3, ensuring that each factor has enough information and variability to be meaningful and interpretable.

How do you explain factor analysis?

Factor analysis is a statistical technique that allows you to reduce the number of variables in a dataset by grouping them into smaller factors. Factors are latent variables that explain the variation and correlation among the observed variables. Factor analysis can help you explore the underlying structure of your data, reduce its dimensionality, and identify the latent constructs or concepts that measure your variables.

What does factor score mean in factor analysis?

Factor score means the value of each factor for each observation. Factor scores are standardized, meaning they have a mean of zero and a standard deviation of one. Factor scores can be used to compare and rank the observations based on their performance on each factor.

How do you interpret a scree plot in factor analysis?

A scree plot represents the eigenvalues in descending order against their factor number. The plot typically shows a sharp decline in the eigenvalues initially, followed by a levelling or gradual decrease. The idea is to find the point where the slope of the plot changes or where the curve bends or elbows. This point indicates the optimal number of factors to retain, as adding more factors would not explain much more variance.

What is a correlation, and how is it related to exploratory factor analysis in R?

Correlation is a measure of how two variables are linearly related. It ranges from -1 to 1, where -1 indicates a perfect negative relationship, 0 indicates no, and 1 indicates a perfect positive relationship. Correlation is related to exploratory factor analysis (EFA) in R because EFA uses the correlation matrix of the variables as the input for finding the underlying factors that explain the variation and correlation among the variables.

How do you perform rotation in exploratory factor analysis in R, and why is it important?

Rotation is a process that changes the orientation or direction of the factors without changing their explanatory Power or fit to the data. Rotation can help us identify which variables load highly on which factors and what each represents or measures. To perform rotation in EFA in R, we can use the psych::fa() function, which provides various options for rotation methods, such as varimax, quartimax, oblimin, promax, etc.

What is the chi-square statistic, and how is it used to test the hypothesis that a certain number of factors are sufficient to represent the data in EFA in R?

Chi-square statistic measures how well the observed correlation matrix matches the predicted correlation matrix based on the factor model. The higher the chi-square statistic, the worse the fit. The chi-square statistic follows a chi-square distribution, which allows us to calculate a p-value for testing the hypothesis that a certain number of factors are sufficient to represent the data. 

A low p-value (usually less than 0.05) indicates that we can reject the hypothesis and conclude that more factors are needed. A high p-value (usually greater than 0.05) indicates that we cannot reject the hypothesis and conclude that the number of factors is adequate. To perform this test in EFA in R, we can use the psych::fa() function, which provides the chi-square statistic and p-value for testing the hypothesis that a certain number of factors are sufficient.

How do you interpret factor loadings and factor scores in EFA in R?

Factor loadings are the correlations between each variable and each factor. They indicate how strongly each variable is related to each factor. Factor scores are the values of each factor for each observation. They indicate how well each observation performs on each factor. To interpret factor loadings and factor scores in EFA in R, we can use some rules of thumb to decide which loadings are significant, such as loadings greater than or equal to 0.4 are considered high and indicate a strong relationship between a variable and a factor. We can also use the psych::factor.scores() function to compute and compare the factor scores for each observation and each factor.

What is the principal axis method, and how is it different from the maximum likelihood method for extracting factors in EFA in R?

The principal axis method is a method for extracting factors in EFA that uses only the expected variance of the variables, i.e., the variance that is shared by two or more variables. It assumes that each variable has some measurement error or unique variance that is not explained by the factors. The maximum likelihood method is another method for extracting factors in EFA that uses all the variance of the variables, i.e., both standard and unique variance. It assumes that each variable has no measurement error or unique variance that is not explained by the factors. To use these methods in EFA in R, we can use the psych::fa() function, which provides various options for extraction methods, such as minres (minimum residual), ml (maximum likelihood), pa (principal axis), etc.

Join Our Community   Allow us to Assist You

About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.

Post a Comment