Week 7 Computer Lab Worksheet

Latent variables and structural models

Author

Chris Moreh

Session aims

Setup

In Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.

You can create and work in either an .R script or a .qmd/.Rmd for this week’s exercises (e.g. “Lab_7”), but the .qmd/.Rmd is recommended as it allows you to record longer notes and to render/knit your work to an easily readable output format.

Also install and load the packages that we will use in this session. Apart from packages that we have already used before, there are a number of new packages that we will explore:

# Load required packages
# Assumes that the pacman package is already installed; if it's not, install it first
pacman::p_load(tidyverse, 
               sjlabelled, 
               sjmisc,
               modelsummary,
               marginaleffects,
               lme4,
               psych,
               lavaan,
               lavaanPlot,
               semPlot)

Exercise 1: (Exploratory) Factor Analysis

In the first exercises we will use a dataset containing data collected from members of a fitness centre in Norway in 2014 (Mehmetoglu and Mittner 2021) (The description of the methods and analysis also relies mostly on Chapters 13 and 14 in Mehmetoglu and Mittner (2021)). The club members were asked to indicate how well having an attractive face and being sexy described them as a person using an ordinal scale (1 = very badly to 6 = very well). Using a similar scale (1 = not at all important to 6 = very important), the members were also asked to indicate how important various motivation for working out were:

Name	Label
lweight	How important is following to workout- to loose weight
calories	How important is following to workout- to burn calories
cweight	How important is following to workout- to control my weight
stress	How important is following to workout - to help manage stress
tension	How important is following to workout - to release tension
relax	How important is following to workout - to mentally relax
body	How important is following to workout- to have a good body
appear	How important is following to workout- to improve my appearance
attract	How important is following to workout- to look more attractive
muscle	How important is following to workout- to develop my muscles
strength	How important is following to workout- to get stronger
endur	How important is following to workout- to increase my endurance
face	How well does the following describe you as a person - attractive face
sexy	How well does the following describe you as a person - sexy

The workout dataset can be loaded from the course website:

# Load data
workout <- readRDS(url("https://cgmoreh.github.io/HSS8005/data/workout.Rds"))

To inspect and better understand the dataset, apply some of the summary functions we practised in previous weeks to this data (e.g. using modelsummary, gtsummary, summarytools or something else).

...

For the first exercise, we’ll select a few variables from the dataset and keep only complete cases with non-missing values:

efa_data <- workout |> 
  select(stress:attract) |> 
  drop_na()

The six selected variables are the following:

Name	Label
stress	How important is following to workout - to help manage stress
tension	How important is following to workout - to release tension
relax	How important is following to workout - to mentally relax
body	How important is following to workout- to have a good body
appear	How important is following to workout- to improve my appearance
attract	How important is following to workout- to look more attractive

We can notice that the selected variables can roughly be seen to refer to two conceptual dimensions (latent factors): mental well-being and physical appearance.

A common motivation for an (exploratory) factor analysis is to reduce a large number of variables down to a meaningful and manageable number of “factors” - or latent variables - that can explain the covariance/correlation among that larger set of observed - measured - variables. Each factor corresponds to a subset of observed variables that are relatively highly correlated. The first steps in a factor analysis are, thus, deciding the number of factors to extract and factor extraction. Factor extraction revolves around inspecting the diagonals of a correlation matrix that includes all the correlations among the observed variables of interest. In a raw correlation matrix, the values on the diagonals are all 1s, because each variable is perfectly correlated with itself. For example, a raw correlation matrix for our sample data would look like this:

# rounded to two decimals for readability
cor(efa_data) |> 
  round(2)

        stress tension relax body appear attract
stress    1.00    0.83  0.75 0.02  -0.02   -0.05
tension   0.83    1.00  0.82 0.00  -0.02   -0.08
relax     0.75    0.82  1.00 0.12   0.04    0.00
body      0.02    0.00  0.12 1.00   0.76    0.71
appear   -0.02   -0.02  0.04 0.76   1.00    0.86
attract  -0.05   -0.08  0.00 0.71   0.86    1.00

In the factor analysis process our aim is to modify these diagonal values to best suit our aims - the various factor extraction methods target these diagonal values (apart from the Maximum Likelihood method, which manipulates the off-diagonal values). Two of the most commonly employed extraction methods are the Principal Axis factor (PA) and the Principal Components Analysis (PCA) methods. The PA method inserts estimates of the common/shared variance (also called communality) in the diagonals of the starting correlation matrix instead of 1s. Communality values are obtained by estimating the squared multiple correlation (smc) of each variable with all the other variables in the matrix (i.e. the multiple R-squared values obtained from regressing each variable on the remaining variables). Manually, we could calculate these communalities like this:

cor_table <- cor(efa_data)                               # Save the raw correlation matrix as an object

diag(cor_table) <- (1 - 1 / diag(solve(cor_table)))      # Calculate and replace the diagonal of  the matrix with the estimated communalities
## or, alternatively, using the `smc()` function from the {psych} package:
diag(cor_table) <- smc(efa_data) 

cor_table |> round(2)                                    # Print the correlation matrix with two decimals precision

        stress tension relax body appear attract
stress    0.70    0.83  0.75 0.02  -0.02   -0.05
tension   0.83    0.79  0.82 0.00  -0.02   -0.08
relax     0.75    0.82  0.71 0.12   0.04    0.00
body      0.02    0.00  0.12 0.61   0.76    0.71
appear   -0.02   -0.02  0.04 0.76   0.79    0.86
attract  -0.05   -0.08  0.00 0.71   0.86    0.75

Compare

Compare the values on the diagonal to those obtained from regressing each variables in the dataset on all the remaining variables; eg:

lm_stress <- lm(stress ~ tension + relax + body + appear + attract, data = efa_data)
lm_tension <- lm(tension ~ stress + relax + body + appear + attract, data = efa_data)

... etc.

summary(lm_stress)$r.squared
summary(lm_tension)$r.squared

... etc.

The reason why PA uses communalities in the diagonals is that it assumes that some of the variance in the variables is caused by some other unique sources which ideally should be removed from the analysis. Conversely, PCA uses 1s in the starting diagonals in the correlation matrix without sorting out the variance caused by other sources than the factors themselves. In this respect, PCA could be said to analyse variance (1s in the diagonals) whereas PA (and the other extraction methods) analyses covariance (communalities in the diagonals).

The next step in factor extraction is to compute eigenvalues and eigenvectors from this correlation matrix which are then used to compute factor loadings. Eigenvectors are sets of weights (\(w\)) that generate factors with the largest possible eigenvalues, while eigenvalues (\(e\)) are variances captured by these factors. Factor loadings reflect correlations between observed variables and their respective factors, unless factors are correlated. Factors can be assumed correlated when rotating the factor solution using a type of oblique rotation such as “promax” or “oblimin”; otherwise, PA and PCA assume and produce an orthogonal (no correlation between factors) solution initially. As a result, if we square these correlations, the results will show us how much variance of each observed variable is explained by each factor.

The eigenvalues are a common way of deciding on the number of factors to extract. In PCA, factors associated with eigenvalues greater than 1 are generally recommended to be retained for interpretation. The idea here is that the factor retained should explain at least as much variance as one observed variable contributes, a situation represented by 1s in the diagonal. In the case of PA, since we use communalities (which are less than 1) instead of 1s in the diagonal, a common strategy is to choose eigenvalues greater than the average of the initial communalities.

Other strategies for choosing the number of factors to extract are scree tests, parallel analysis and theoretical motivations. The scree test is about examining a plot/curve, produced after factor extraction, containing the eigenvalues on the Y-axis and the factors on the X-axis. The idea of the scree test is that factors along the tail of the curve represent mostly random error variance (trivial eigenvalues), and consequently, the factor solution just before the levelling of the curve should be chosen. Parallel analysis is about estimating the same factor model as the original one using randomly simulated data, resembling the original data in terms of both the number of variables and observations. Eigenvalues obtained from the simulated data are subsequently averaged and contrasted with their counterparts from the original data. A common recommendation is that if the eigenvalue of a factor from the original data proves to be larger than the average (or, alternatively, the 95th percentile) of the eigenvalues from the simulated data, the factor should be retained. Otherwise, the factor would be considered no more substantial than a random factor and thus discarded.

In R the psychpackage provides a variety of functions for performing factor analysis. The fa.parallel() function runs a parallel analysis including a scree plot. In the command below we specify that the parallel analysis should be done using the PA extraction method (fm="pa") and the same method should be used for the eigenvalues (fa="fa"). With the argument SMC="TRUE" we can also request that squared multiple correlations be used in the diagonal of the correlation matrix at the start of the estimation:

parallel_analysis <- fa.parallel(efa_data,
      fm="pa", fa="fa", SMC="TRUE")

Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

Alternatively, to run a Principal Component Analysis instead, we can write:

fa.parallel(efa_data, fa="pc")

Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

The scree plot suggests the existence of two obvious factors before the curve levels off. In the code above we have saved the results to an object so that we can also print out more detailed numerical results using the print() function:

print(parallel_analysis)

Call: fa.parallel(x = efa_data, fm = "pa", fa = "fa", SMC = "TRUE")
Parallel analysis suggests that the number of factors =  2  and the number of components =  NA 

 Eigen Values of 

 eigen values of factors
[1]  2.34  2.28  0.00 -0.07 -0.08 -0.12

 eigen values of simulated factors
[1]  0.28  0.15  0.05 -0.03 -0.11 -0.19

 eigen values of components 
[1] 2.61 2.56 0.32 0.23 0.16 0.12

 eigen values of simulated components
[1] NA

We could also compare the average smc() scores to the eigenvalues:

smc(efa_data) |> mean()

[1] 0.7246744

Again, only two of the eigenvalues are greater that the mean of the squared multiple correlations between the variables. Finally, a theoretical reasoning also supports the choice for two factors being drawn out from this set of six variables, given that, as noted earlier, they appear to roughly map onto motivations relating to either mental well-being or physical appearance.

The actual extraction of the factors/principal components, once the most appropriate number of factors has been decided, can be done using the fa() and principal() functions from psych. These functions also provide various extraction methods and rotation options. It is generally useful to rotate the initial factor solution in order to obtain a more easily interpretable factor solution. An easily interpretable factor solution is associated with an output that contains a factor loading (pattern) matrix with variables with the maximum possible loading (close to 1) on one factor and the minimum possible loading (close to 0) on the remaining factor(s); this is often referred to as a ‘simple structure’. Orthogonal rotation options impose that a 90-degree angle is kept between factor axes while rotating them. An alternative approach, oblique rotation, relaxes this restriction, and is often preferred given that arguably factors (latent variables) measuring behavioural phenomena common in social science research are somewhat correlated, and oblique rotation can take this correlation into account by first estimating the correlation between the factors and then generating a solution.

The most widely used orthogonal rotation technique is “varimax” rotation, which maximizes the variance of the squared loadings for each factor, thus polarizing loadings so that they are either high or low, making it easier to identify factors with specific variables. Rotation does not influence the total variance explained by the factors. However, the total variance explained gets distributed differently among the factors (i.e., eigenvalues change). Suppose that two factors explain 75 per cent variance together, with 48 per cent and 27 per cent of the variance attributable to the first and second factor, respectively. After rotation, the factors would together still explain 75 per cent variance, but now with, say, 44 per cent and 31 per cent of the total variance attributed to the first and second factor, respectively. Relatedly, the total amount of variance of each variable explained by the factors (i.e., communality) would also stay the same but the factor loadings would change after rotation.

The most common oblique rotation technique is “promax”, which starts with an orthogonal rotation (varimax) in which the loadings are, first, raised to a specific power (2, 3 or 4) and then the solution is rotated to allow for correlations among the factors.

Let’s use the fa() function to extract two factors using the PA method and applying a “varimax” rotation. Since our two factors represent two different phenomena (mental well-being or physical appearance) we do not assume a strong correlation between the two factors, and therefore an orthogonal rotation is substantiated. The fa() function also uses an iterative version of PA by default, which we can turn off by specifying max.iter = 0 if we want to. Iterative PA is basically a PA run several times. While PA starts and completes the factor extraction process with one set of estimated communalities (squared multiple correlations), IPA replaces these estimated communalities by new estimates emerging from the factor extraction process each time until the difference between the two communalities (last inserted and last estimated) is minimized. The fa() function uses the default argument min.err = 0.001, meaning that the estimation iterates until the change in communality is less than 0.001.

factor_model <- fa(efa_data,
              nfactors = 2, 
              fm="pa",
              rotate = "varimax")

We can inspect the complete results from the factor analysis by printing the model object:

factor_model

We can also extract more specific information rather than inspecting the entire output. For example, we can check the factor loadings and eigenvalues:

print(factor_model$loadings, digits=4, cutoff=0)


Loadings:
        PA1     PA2    
stress   0.8662 -0.0287
tension  0.9522 -0.0480
relax    0.8689  0.0465
body     0.0575  0.7990
appear   0.0093  0.9550
attract -0.0382  0.8943

                  PA1    PA2
SS loadings    2.4168 2.3554
Proportion Var 0.4028 0.3926
Cumulative Var 0.4028 0.7954

We can see in this output how the first three variables (stress, tension and relax) load strongly on the first factor, and the last three variables (body, appear and attract) load strongly on the second factor. The factor loadings table can be read both vertically and horizontally. When we interpret the matrix vertically, we locate the correlations between a factor and all the observed variables. For example, the loading of stress on factor 1 is 0.8662. Squaring this value - 0.8662^2 = 0.7503 - tells us that roughly 75% per cent of the variance of stress is explained by factor 1. When we interpret the table horizontally, we obtain item communalities. For instance, we see that the loadings of tension on factor 1 and factor 2 are 0.9522 and -0.0480, respectively. Squaring and adding these two values together - 0.9522^2 + (-0.0480)^2 = 0.9089 - would show us the proportion (about 90%) of the total variance of tension that is explained by factor 1 and factor 2 in tandem. The remaining 10% represents the unique or unexplained variance.

We can also visualise these results with the fa.diagram() and fa.plot() functions:

fa.diagram(factor_model, digits = 2)

fa.plot(factor_model)

In most cases the aim of an exploratory factor analysis is to use factors in subsequent analysis. For this, we first need a metric for the two hypothetical constructs. We can either use the estimated factor scores directly or we can generate factors by, for example, taking either the sum or average of the variables expressing each factor. We will choose to take the average to keep the factor metric on the same scale (1−6) as the original observed variables, using the scoreItems() function from psych. Then, we can test the reliability of this summed scale based on Cronbach’s \(\alpha\) coefficient.

# First, we compile a list of the variables:
itemlist <- list(mental=c("stress","tension","relax"),
                 physical=c("body","appear","attract"))
summateds <- scoreItems(itemlist, efa_data, 
                         min=1, max=6, totals = FALSE)
factordata <- as_data_frame(summateds$scores)

# Merge the factordata variables with the dataset:

efa_data <- bind_cols(efa_data, factordata)

# Check the variables in the dataset:
names(efa_data)

[1] "stress"   "tension"  "relax"    "body"     "appear"   "attract"  "mental"  
[8] "physical"

To test the reliability of the factors created, we can use the alpha() function from `{psych}’:

# We run the alpha check on the two items separately:

mental <- efa_data |> select(stress:relax)
physical <- efa_data |> select(body:attract)

# We can save the output to objects and inspect them
alpha_mental <- alpha(mental)
alpha_physical <- alpha(physical)

# Or we can print out only the overall alpha values directly
alpha(mental)$total$std.alpha

[1] 0.9234774

alpha(physical)$total$std.alpha

[1] 0.9124331

In common practice an \(\alpha\) level higher than 0.7 is taken as satisfactory, showing a good fit of the items. Both our factors have a high \(\alpha\) value, so we have additional evidence that the variables chosen fit together well on each factor.

EFA practice Try to redo the analysis above, but this time including all the variables relating to “How important is following to workout …” variables from the original workout dataset.

Exercise 2: Latent Path Analysis

Confirmatory Factor Analysis (CFA) and Latent Path Analysis (LPA) are two of the most commonly used Structural Equation Modelling techniques in the social sciences. Confirmatory factor analysis is used to assess a hypothesized latent factor structure containing a set of indicators and one or more latent variables. Latent path analysis is used not only to examine a factor structure but also to test hypothesized structural relationships among the latent variables. The factor structure is concerned with the relationships between indicators and latent variables, whereas the structural relationships concern links between latent variables. The former is referred to as the measurement part, while the latter is named the structural part; together they comprise LPA.

In this exercise we will use the same workout data we started off from in Exercise 1. The analysis presented by Mehmetoglu and Mittner (2021) begins by proposing a number of hypotheses concerning motivations for working out in a fitness club based on evolutionary psychology theories:

H1: The more attractive one perceives herself/himself, the more the person wants to work out to improve her/his physical appearance (i.e., Attractive → Appearance).
H2: The more the person wants to work out to improve her/his physical appearance, the more s/he wants to work out to build up muscles (i.e., Appearance → Muscle).
H3: The more the person wants to work out to improve her/his physical appearance, the more s/he wants to work out to lose weight (i.e., Appearance → Weight).
H4: The more attractive one perceives herself/himself will indirectly influence the person to work out more to build up muscles (i.e., Attractive → Appearance → Muscle).
H5: The more attractive one perceives herself/himself will indirectly influence the person to work out more to lose weight (i.e., Attractive → Appearance → Weight).

In an SEM framework it is common to visualise this set of hypotheses in a path diagram. Using the lavaan package, we can write the measurement and structural model equations, fit the structural equation model and plot the model like this:

# First, remove all objects from the Environment, except for the "workout" data
rm(list=setdiff(ls(), "workout"))


# Specify the model structure:
full.lpa.mod <- '
              # Measurement model (latent variables)
                Attractive =~ face + sexy
                Appearance =~ body + appear + attract
                Muscle =~ muscle + strength + endur
                Weight =~ lweight + calories + cweight
              # Structural model (regressions)
                Appearance ~ Attractive
                Muscle ~ Appearance
                Weight ~ Appearance
              '

# Estimate the model
est.full.lpa.mod <- sem(full.lpa.mod, data=workout)

# Plot the model
## with lavaanPlot()
lavaanPlot(model = est.full.lpa.mod)

## or semPaths()
semPlot::semPaths(est.full.lpa.mod, title = FALSE,  layout = "tree2", nCharNodes = 0)

Estimating a LPA model proceeds in two steps: we first establish a psychometrically sound (i.e., valid and reliable) measurement model and subsequently test the structural model. The measurement part of the LPA model includes the relationship between each of our four latent variables (Attractive, Appearance, Muscle and Weight) and its respective indicators. The measurement model is a standard CFA, and can be fit separately like this:

meas.lpa.mod <- '
                Attractive =~ face + sexy
                Appearance =~ body + appear + attract
                Muscle =~ muscle + strength + endur
                Weight =~ lweight + calories + cweight
                '
est.meas.lpa.mod <- cfa(meas.lpa.mod, data = workout)

We can inspect the results using the summary() function, with some additional options included:

summary(est.meas.lpa.mod, fit.measures = TRUE, standardized = TRUE)

Some of the model fit indices (RMSEA > 0.1, TLI < 0.9, etc.) could be improved. The root mean squared error of approximation (RMSEA) takes the model complexity and sample size into consideration and penalizes models with too many parameters to estimate (i.e., with low degrees of freedom, df) and accordingly favours simpler models (i.e., with higher df). As we decrease the df (i.e., include more parameters to estimate in our model), the RMSEA value increases. High RMSEA values will be a sign of poor model fit, and RMSEA values greater than 0.10 generally indicate poor model fit. The Tucker–Lewis index (TLI) also generally ranges from 0 to 1, and higher values are desirable. TLI values greater than 0.90 are generally associated with acceptable model fit.

Mehmetoglu and Mittner (2021) suggest that correlating the error variances of two different pairs of indicators (“muscle” and “endur”, “lweigh” and “body”) would improve the model fit. This modification involves:

meas.lpa.mod2 <- '
                Attractive =~ face + sexy
                Appearance =~ body + appear + attract
                Muscle =~ muscle + strength + endur
                Weight =~ lweight + calories + cweight
                muscle ~~ endur 
                lweight ~~ body
                '
est.meas.lpa.mod2 <- cfa(meas.lpa.mod2, data = workout)

The revised model is an improvement, although the RMSA is still slihghtly above 0.10.

The structural part extends the measurement with the hypothesized relationship among the latent variables.

full.lpa.mod <- '
              # Measurement model (latent variables)
                Attractive =~ face + sexy
                Appearance =~ body + appear + attract
                Muscle =~ muscle + strength + endur
                Weight =~ lweight + calories + cweight
                muscle ~~ endur 
                lweight ~~ body
              # Structural model (regressions)
                Appearance ~ Attractive
                Muscle ~ Appearance
                Weight ~ Appearance
                '
est.full.lpa.mod <- sem(full.lpa.mod, data = workout)

We can inspect the full model:

summary(est.full.lpa.mod, fit.measures = TRUE, standardized = TRUE)

Since the theory underlying the model also statest that the covariance between the Muscle and the Weight factors should be set to 0, we can incorporate this into the model:

full.lpa.mod2 <- '
              # Measurement model (latent variables)
                Attractive =~ face + sexy
                Appearance =~ body + appear + attract
                Muscle =~ muscle + strength + endur
                Weight =~ lweight + calories + cweight
                muscle ~~ endur 
                lweight ~~ body
                # Set covariance to 0:
                Muscle ~~ 0*Weight
              # Structural model (regressions)
                Appearance ~ Attractive
                Muscle ~ Appearance
                Weight ~ Appearance
                '
est.full.lpa.mod2 <- sem(full.lpa.mod2, data = workout)

summary(est.full.lpa.mod2, fit.measures = TRUE, standardized = TRUE)

The assessment of the structural part is similar to that from a linear regression analysis. First, the sign, significance and size of path coefficients should be considered. Path coefficients are estimates that help us to assess the hypothesized relationships in the structural part. These path coefficients are typically presented in a standardized form, which is equivalent to standardized betas in linear regression. Standardized coefficients range generally between −1 and 1. The closer a path coefficient is to ±1 the stronger the relationship (positive/negative) is. The closer a path coefficient is to 0, the weaker the relationship is. Standardized beta coefficients equal to or less than 0.09 indicate a small effect, coefficients between 0.09 and 0.2 indicate a moderate effect, and coefficients larger than 0.2 indicate large effect. In our model, all the signs of the coefficients are in the hypothesized direction. That is, Attractive has a large and positive effect on Appearance, and Appearance has a large and positive effect on both Muscle and Weight. Finally, all of the standardized coefficients are statistically significant at 0.01. All these findings provide clear support for the first three of our study hypotheses (H1, H2, and H3). To address H4 and H5, we need to estimate the indirect effect of Attractive (via Appearance) on Muscle and Weight. In lavaan, this involves setting labels for the path coefficients (e.g. a, b1, b2), then using these labels to create new parameters (e.g. ind1 and ind2) using the := operator:

full.lpa.mod3 <- '
              #Measurement model (latent variables)
                Attractive =~ face + sexy
                Appearance =~ body + appear + attract
                Muscle =~ muscle + strength + endur
                Weight =~ lweight + calories + cweight
                muscle ~~ endur 
                lweight ~~ body
                Muscle ~~ 0*Weight #set covariance to 0
              #Structural model (regressions)
                Appearance ~ a*Attractive
                Muscle ~ b1*Appearance
                Weight ~ b2*Appearance
              #Indirect effects
                # of Attraction on Muscle
                ind1 := a*b1 
                # of Attraction on Weight
                ind2 := a*b2 
                '
est.full.lpa.mod3 <- sem(full.lpa.mod3, data = workout)

summary(est.full.lpa.mod3, standardized=TRUE)

From this final model, we find that Attractive has a moderate and positive indirect effect on both Muscle and Weight, and that these indirect effects are statistically significant at the 0.01 level.

Exercise 3: Multilevel path model coefficients via multilevel modelling

In tis exercise we look at the analysis presented in Ejrnæs and Jensen (2021), who use cross-national data from 17 countries collected through the European Social Survey (Round 8, 2016) to identify scenarios, “pathways”, under which EU citizens would vote in a potential referendum for their country to leave the European Union. They begin by positing three theoretical models of “the micro-foundations of ‘hard Euroscepticism’/‘exit scepticism’”: utilitarian, identity and anti-elitist models. Their main method is a path analysis testing these theories, but they also employ multi-level logistic regression, principal component analysis (PCA) and cluster analysis at various points in their analysis. The main analysis is done in Stata (the PCA is done in SPSS).

In this exercise, we will look more closely at the following reported results:

Figure 4: Multilevel path model
Appendix 1: Multilevel logistic regression (marginal effects)

We will check how the results reported can be reproduced using multilevel modelling using the lme4() package, as practiced in Week 6. In R, lavaan has some limitations when it comes to fitting generalised multilevel SEM models.

First, import the dataset:

# Import the dataset
# Also, drop unused value labels and select to keep only the variables that will be used
ejr <- sjlabelled::read_stata("https://cgmoreh.github.io/HSS8005/data/Ejrnaes21.dta", drop.labels = TRUE) |> 
  select(c(vteurmmb, 
           trstprl, trstlgl, trstplc, trstplt, trstprt, 
           imueclt, imwbcnt, imbgeco, 
           eduyrs, hinctnta, country_num, 
           mnactic, agea, atchctr, gndr))

prtvtfch, prtvtfee, prtvtbgb, prtvtcil, prtvtbis, prtvtbno, prtvtdru, prtclfch, prtclfee, prtclbgb, prtcldil, prtclbis, prtclbno, prtcldru, rlgdnach, rlgdngb, rlgdnis, rlgdnno, rlgdeach, rlgdegb, rlgdeis, rlgdeno, vteumbgb, vteubcmb, rshpsgb, marstgb, edlvdch, edlvdee, edubgb1, eduagb2, edagegb, edubil1, eduail2, edlvdis, edlvdno, edlvdru, edlvpdch, edlvpdee, edupbgb1, edupagb2, edagepgb, edupbil1, edupail2, edlvpdis, edlvpdno, edlvpdru, edlvfdch, edlvfdee, edufbgb1, edufagb2, edagefgb, edufbil1, edufail2, edlvfdis, edlvfdno, edlvfdru, edlvmdch, edlvmdee, edumbgb1, edumagb2, edagemgb, edumbil1, edumail2, edlvmdis, edlvmdno, edlvmdru

Data management

Outcome variable

The outcome variable of interest originates from the survey question: “Imagine there were a referendum in [country] tomorrow about membership of the European Union. Would you vote for [country] to remain a member of the European Union or to leave the European Union?”. Responses are coded in the variable vteurmmb, which has five levels (tip: use the sjmisc::frq() function for a labelled frequency table of the variable), including also “Would submit a blank ballot paper”, “Would spoil the ballot paper” and “Would not vote” in addition to “Remain” and “Leave”. We will follow the authors in dichotomising the variable as “1” = Leave vs. “0” = all other non-missing values:

# Recoding "leave" variable as dummy variable (with `dplyr`):
ejr <- ejr |> 
  mutate(leave = case_match(vteurmmb, 
                            "2" ~ 1,
                            c("1", "33", "44", "55") ~ 0)) 

### Notes: 
### - "case_match" has superseded the previous "recode()" function in dplyr 1.1.0. If the code above fails, make sure you have dplyr 1.1.0 installed; if not, install it first
### - "vteurmmb" is a factor with 5 labelled levels; to be able to refer to the values when recoding, we need either to treat it as_numeric() or quote the levels using "". The resulting variable will be numeric in either case.

## `sjmisc::rec()` is another useful option, especially with labelled data:

# ejr <- ejr |>
#   mutate(leave = rec(vteurmmb,
#                      rec = "2 = 1 [Leave];
#                      1, 33, 44, 55 = 0"))

Endogenous variables

In the context of structural equation modelling, variables that are influenced by other variables in a model are called endogenous variables. Conversely, those not influenced by other variables are called exogenous. Additionally, we can differentiate manifest variables - those that we have directly observed and measured - from latent variables - those that we have not directly measured, but instead we estimate from other measured variables. Such latent variables can be, for example, the “factors” or “components” obtained from a factor analysis or PCA. A path analysis is a special version of a structural equation model in which all the variables are manifest (observed).

The two constructs used as endogenous variables in the Figure 4 model (“trust in the political establishment” and “support for multiculturalism”) are treated by the authors as manifest because they are entered into the model as indices created before the model-building, however, the indices are created as a result of a principal component analysis

# Creating two trust-related composite variables


## Select out the variables as data-frames         
trust_vars <- ejr |> 
  select(c(trstprl, trstlgl, trstplc, trstplt, trstprt)) |> 
  mutate(across(everything(), as_numeric))
         
multiculti_vars <- ejr |> 
  select(c(imueclt, imwbcnt, imbgeco)) |> 
  mutate(across(everything(), as_numeric))


## Create Cronbach's alpha coefficient scores using `alpha()` from the {psych} package 
## and add them to the "data" dataframe

ejr <- ejr |> 
  mutate(trust = alpha(trust_vars, check.keys = TRUE)$scores,
         multi_culti = alpha(multiculti_vars, check.keys = TRUE)$scores)

ejr <- ejr |> 
  mutate(across(c(eduyrs, hinctnta, trust, multi_culti), as_numeric),
         across(leave, as_factor))

Variable standardisation

Tha authors also standardize the variables before modelling them. In R this can be done in various ways, including using the sjmisc::std() function:

# Standardize variables using "sjmisc::std()"
ejr <- ejr |> 
  mutate(Zeducation1 = std(eduyrs),
         Zincome1 = std(hinctnta),
         Zmulti1 = std(multi_culti),
         Ztrust1 = std(trust)
         ) |> 
  mutate(across(c(Zeducation1, Zincome1, Zmulti1, Ztrust1, country_num), as_numeric))

Multilevel model fitting

We now have all the variables in a format that can be modelled:

F4_direct <- lme4::glmer(leave ~ Zeducation1 + Zincome1 + Ztrust1 + Zmulti1 + (1|country_num), family = binomial(link = "logit"), data = ejr)

F4_direct_margins <- marginaleffects::marginaleffects(F4_direct, eps = 0.15)

F4_trust <- lme4::lmer(Ztrust1 ~ Zeducation1 + Zincome1 +  (1|country_num), data = ejr)

F4_multi <- lme4::lmer(Zmulti1 ~ Zeducation1 + Zincome1 +  (1|country_num), data = ejr)

modelsummary::modelsummary(list("Leave" = F4_direct_margins, 
                                "Trust in pol. est." = F4_trust, 
                                "Support multiculturalism" = F4_multi),
                           coef_rename = c("Zeducation1" = "Education",
                                           "Zincome1" = "Income",
                                           "Zmulti1" = "Support for multiculturalism",
                                           "Ztrust1" = "Trust in political establishment"),
                           statistic = NULL)

	Leave	Trust in pol. est.	Support multiculturalism
Education	−0.012	0.064	0.210
Income	−0.012	0.111	0.074
Support for multiculturalism	−0.075
Trust in political establishment	−0.061
(Intercept)		−0.011	0.008
SD (Intercept country_num)		0.372	0.353
SD (Observations)		0.903	0.888
Num.Obs.	23850	23923	23865
R2 Marg.	0.193	0.022	0.063
R2 Cond.	0.280	0.164	0.191
AIC	19642.0	63101.9	62184.1
BIC	19690.5	63142.3	62224.5
ICC	0.1	0.1	0.1
RMSE	0.36	0.90	0.89

Compare the results in our table to those reported in Ejrnæs and Jensen (2021)

References

David, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.

Ejrnæs, Anders, and Mads Dagnis Jensen. 2021. “Go Your Own Way: The Pathways to Exiting the European Union.” Government and Opposition, February, 1–23. https://doi.org/10.1017/gov.2020.37.

El-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.

Lord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.

Mehmetoglu, Mehmet, and Matthias Mittner. 2021. Applied Statistics Using R: A Guide for the Social & Natural Sciences. First. Thousand Oaks: SAGE Publications.

Mulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.

Senn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489.