Week 4 Worksheet Notes

Interactions: Estimating, graphing and interpreting interaction effects

Aims

By the end of the session you will learn how to:

fit, visualise and interpret results from regression models that include interaction terms

R packages

library(tidyverse)
library(easystats)
library(gtsummary)
library(ggformula)
library(sjlabelled)
library(ggeffects)
library(marginaleffects)

Introduction

This session explores examples of interactions in regression models. We will look more closely at some of the multiple regression models we have fit in previous labs and ask whether the effects of core explanatory variables could be said to depend on (or are conditioned on) the values of other explanatory variables included in the model. In practice, this will involve including the product of two explanatory variables in the model. Thus, taking a regression model of the generic form \(y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon\), we ask whether there is an interaction between variables \(X_1\) and \(X_2\) (i.e. whether the effect of one depends on the values of the other) by fitting a model of the form \(y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \color{green}{\beta_3(X_1 \times X_2)} + \epsilon\), where we are interested in the “interaction effect” represented by \(\beta_3\).

While fitting such interaction models is simple in practice, understanding when they are needed, interpreting, and communicating their results can be challenging, and a growing literature in the social sciences has explored best practices in relation to such models (Berry, Golder, and Milton 2012; Brambor, Clark, and Golder 2006; Clark and Golder 2023; Hainmueller, Mummolo, and Xu 2019). Advancements in software have also made it easier to interpret and present results from interaction models, with some notable R packages entering this space over the past few years; we have already encountered the {marginaleffects} and {ggeffects} packages when visualising results from logistic regression models, and examining results from interaction models presents very similar challenges. In this week’s lab we will use some of these tools to undertake comprehensive analyses of interaction effects.

We have already encountered interactions in all of the application readings we have engaged with so far. However, they mostly involved more complex cases of interactions, whose understanding will first require a more basic knowledge of how interaction effects work and how they can be implemented in practice. Wu (2021) and Dingemans and Van Ingen (2015) make use of “cross-level” interactions, which we will revisit when learning about multilevel models (Week 6). Mitchell (2021) uses higher-level interactions on data aggregated at country level as an additional analysis to their main models, while Österman (2021) operates with a quasi-experimental design in which the interaction effect is taken to divulge a more directly causal effect the data. This latter example is “more complex” only at a conceptual and study design level, and we will replicate it in the lab exercises. However, we begin by looking at a conceptually less involved example where interaction effects are at the core of the analysis, also from the field of social trust research.

Example 1. Akaeda (2023): Are preferences for redistribution associated with education in European countries, and is the association moderated by social trust?

The research question in the title of this exercise relates directly to the research reported by Akaeda (2023). Following a detailed review of the literature on the relationship between attitudes towards redistribution, education, and social- and institutional trust, the author derives a hypothesis they want to investigate: “trust decreases the gap in preferences for redistribution due to education” (p. 296). They break down the hypothesis into two parts, one relating to social trust and the other to institutional trust.

Akaeda (2023) combine data from several waves of the World Values Survey (WVS) and the European Values Study (EVS) to test this research hypothesis. Their combined dataset contains data on “74 countries, 26 years, 259 country-years, and 254,214 individuals”. Their interest is in modelling the hypothesised relationship on a global level, accounting both for various individual explanatory variables as well as macro-societal factors at the level of countries and country-years (i.e. change over time within countries).

Here we will attempt to fit a simpler version of their “Model 3” (reported in Table 2). In terms of data, we focus specifically on “European” countries; for this purpose, we will use data from the latest wave of the European Values Study survey. In terms of variable selection and transformations, we will roughly follow Akaeda (2023), with some adjustments for convenience. Akaeda (2023:297–98) describe their variable selection procedure in these words:

the dependent variable is the score of preferences for redistribution. This score is based on a question … that asks respondents to indicate on a scale from 1 to 10 whether ‘Incomes should be made more equal (1)’ or ‘We need larger income differences as incentives for individual effort (10)’. In accordance with previous research, the scores were reversed such that a higher score indicates stronger support for redistribution;

the level of education is a key independent variable because the association between education and support for redistribution is a main focus of this study. Because previous studies have shed light on the mechanisms of university education related to a conservative view of redistribution, this analysis adopts the dummy for university or higher degree as an independent variable;

as a key moderator variable … social trust as measured by the question, ‘Generally speaking, would you say that most people can be trusted or that you need to be very careful in dealing with people?’, which had the following two potential responses: ‘1. Most people can be trusted’, ‘2. Need to be very careful’. Based on previous research involving social trust, this analysis employs the dummy for social trust (1 = ‘Most people can be trusted’);

the following individual-level controls: gender (1 = female, 0= male), age, age squared, employment status (employed, self-employed, unemployed, retired, other), household income (z scored for country-year units, marital status (married = 1), child (having child = 1), and political orientation (1 = right to 10 = left)

Data preparation

The original data is freely accessible from the European Values Study website in various formats. For the purposes of this lab, we can also access a lightly edited dataset in native .Rds format from the Data page (evs2017.rds). We can either download the dataset to a local data folder and load it from there, or loaded directly from the web:

evs <- datawizard::data_read("https://cgmoreh.github.io/HSS8005-24/data/evs2017.rds")

Reading data...

Variables where all values have associated labels are now converted into
  factors. If this is not intended, use `convert_factors = FALSE`.

210 out of 266 variables were fully labelled and converted into factors.

Following 8 variables are empty:
  v72_DE, v73_DE, v74_DE, v75_DE, v76_DE, v77_DE, v78_DE and v79_DE
  
Use `remove_empty_columns()` to remove them from the data frame.

Once the data is loaded in the workspace environment, we can select the variables that we are interested in, inspect them in a codebook and apply any necessary transformations:

# Select the variables needed:

evs <- evs |> 
  select(v106,      # redistribution pref.
         v243_r,    # education
         v31,       # social trust
         v225,      # sex/gender
         age,
         v244,      # employment status
         v261_ppp,  # household income (corrected)
         v234,      # marital status
         v239_r,    # number of children
         v102,      # political orientation (1=left...10=right)
         )

View the codebook for the selected variables:

data_codebook(evs) |> View()

Recode variables to match Akaeda (2023) as closely as possible:

akaeda <- evs |> 
  data_modify(redistribute = reverse(v106),
              # recode education to a binary; 
              # treat the original variable as a numeric variable in the process to allow referring to numeric codes rather than the long text labels
              educ_univ = recode_values(as_numeric(v243_r),
                                   recode = list("1" = 3,
                                                 "0" = c(1, 2, 4))),
              # inverse to make "most people can be trusted" the second (indicator) category
              trusting = recode_values(as_numeric(v31),
                                       recode = list("1" = 1,
                                                     "0" = 2)),
              # just rename as "female" is already the second category
              female = v225,
              # recode "employment" as a factor/categorical (without treating it as numeric and adding labels later)
              employment = recode_values(v244,
                                         recode = list("Employed" = c("30h a week or more", 
                                                                      "less then 30h a week"), 
                                                       "Self-employed" = "self employed", 
                                                       "Unemployed" = "unemployed", 
                                                       "Retired" = "retired/pensioned", 
                                                       "Other" = c("military service", 
                                                                   "homemaker not otherwise employed", 
                                                                   "student", 
                                                                   "disabled", 
                                                                   "other"))),
              hh_income = v261_ppp,
              # recoding marital status as numeric
              married = recode_values(as_numeric(v234),
                                      recode = list("1" = 1,
                                                    "0" = 2:6)),
              
              child = recode_values(v239_r,
                                    recode = list("0" = 1,
                                                  "1" = 2:6)),
              politics = reverse(v102)
              ) |> 
  set_labels(married, labels = c("Not married", "Maried")) |> 
  set_labels(trusting, labels = c("Not trusting", "Trusting")) |> 
  set_labels(educ_univ, labels = c("No university education", "University education")) |> 
  set_labels(child, labels = c("No children", "Has children")) |> 
  # Treat them as factors again for better reporting of labels
  to_label(married, trusting, educ_univ, child, married)

View the codebook for the recoded dataset:

data_codebook(akaeda) |> View()

Relationships between the main variables of interest

Following Akaeda (2023), the main variables of interest are redistribute, educ_univ and trusting. We can inspect some basic relationships between them using boxplots and cross-tabulations:

# Boxplots:

gf_boxplot(redistribute ~ educ_univ, data = akaeda)

Warning: Removed 1335 rows containing non-finite outside the scale range
(`stat_boxplot()`).

gf_boxplot(redistribute ~ trusting, data = akaeda)

Warning: Removed 1335 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# Cross-tabulations:

# To check the values of a categorical variable by another categorical variable, a contingency table (cross-tabulation) is preferable.
# Native R is not very good at cross-tabulations; there are some better user-written functions, but only relying on packages we have already used so far,
# some options are:

## In base R (table):
### Simple frequencies
table(akaeda$educ_univ, akaeda$trusting)

                         
                          Not trusting Trusting
  No university education        26838    11319
  University education           10153     9416

### Frequencies with added row marginal totals
table(akaeda$educ_univ, akaeda$trusting) |> addmargins(margin = 2)

                         
                          Not trusting Trusting   Sum
  No university education        26838    11319 38157
  University education           10153     9416 19569

### Proportions, by row
table(akaeda$educ_univ, akaeda$trusting) |> prop.table(margin = 1)

                         
                          Not trusting  Trusting
  No university education    0.7033572 0.2966428
  University education       0.5188308 0.4811692

### Proportions, by column with added column marginal totals
table(akaeda$educ_univ, akaeda$trusting) |> prop.table(margin = 2) |> addmargins(margin = 1)

                         
                          Not trusting  Trusting
  No university education    0.7255278 0.5458886
  University education       0.2744722 0.4541114
  Sum                        1.0000000 1.0000000

## In base R (xtabs - formula style):
xtabs(~ educ_univ + trusting, data = akaeda)

                         trusting
educ_univ                 Not trusting Trusting
  No university education        26838    11319
  University education           10153     9416

#### Proportions by column, rounded to 2 decimals, added column marginal totals
xtabs(~ educ_univ + trusting, data = akaeda)|> prop.table(margin = 2) |> round(2) |> addmargins(margin = 1)

                         trusting
educ_univ                 Not trusting Trusting
  No university education         0.73     0.55
  University education            0.27     0.45
  Sum                             1.00     1.00

## In tidyverse (dplyr + tidyr)
### Frequencies
akaeda |> 
  group_by(educ_univ, trusting) |> 
  summarise(n=n()) |> 
  spread(trusting, n)

`summarise()` has grouped output by 'educ_univ'. You can override using the
`.groups` argument.

# A tibble: 3 × 4
# Groups:   educ_univ [3]
  educ_univ               `Not trusting` Trusting `<NA>`
  <fct>                            <int>    <int>  <int>
1 No university education          26838    11319    890
2 University education             10153     9416    434
3 <NA>                               239      118     31

### Proportions
akaeda |> 
  group_by(educ_univ, trusting) |> 
  summarise(n=n()) |> 
  mutate(prop=n/sum(n)) |> 
  subset(select=c("educ_univ","trusting","prop")) |> 
  spread(trusting, prop)

`summarise()` has grouped output by 'educ_univ'. You can override using the
`.groups` argument.

# A tibble: 3 × 4
# Groups:   educ_univ [3]
  educ_univ               `Not trusting` Trusting `<NA>`
  <fct>                            <dbl>    <dbl>  <dbl>
1 No university education          0.687    0.290 0.0228
2 University education             0.508    0.471 0.0217
3 <NA>                             0.616    0.304 0.0799

  # or, for % of education by trust:
  # spread(educ_univ, prop)


## In {datawizard}:
### Both frequencies and % in a listing format rather than crosstabulated
akaeda |> 
  data_group(educ_univ) |>
  data_tabulate(trusting, collapse = TRUE)

# Frequency Table

Variable |                               Group |        Value |     N | Raw % | Valid % | Cumulative %
---------+-------------------------------------+--------------+-------+-------+---------+-------------
trusting | educ_univ (No university education) | Not trusting | 26838 | 68.73 |   70.34 |        70.34
         |                                     |     Trusting | 11319 | 28.99 |   29.66 |       100.00
         |                                     |         <NA> |   890 |  2.28 |    <NA> |         <NA>
---------+-------------------------------------+--------------+-------+-------+---------+-------------
trusting |    educ_univ (University education) | Not trusting | 10153 | 50.76 |   51.88 |        51.88
         |                                     |     Trusting |  9416 | 47.07 |   48.12 |       100.00
         |                                     |         <NA> |   434 |  2.17 |    <NA> |         <NA>
---------+-------------------------------------+--------------+-------+-------+---------+-------------
trusting |                      educ_univ (NA) | Not trusting |   239 | 61.60 |   66.95 |        66.95
         |                                     |     Trusting |   118 | 30.41 |   33.05 |       100.00
         |                                     |         <NA> |    31 |  7.99 |    <NA> |         <NA>
------------------------------------------------------------------------------------------------------

## In {gtsummary}:
akaeda |> 
  gtsummary::tbl_cross(educ_univ, trusting,
                     percent = "row",
                     missing = "no")

FALSE observations with missing data have been removed.

	people can be trusted/can’t be too careful (Q7)		Total
	Not trusting	Trusting	Total
educational level respondent: recoded (Q81)
No university education	26,838 (70%)	11,319 (30%)	38,157 (100%)
University education	10,153 (52%)	9,416 (48%)	19,569 (100%)
Total	36,991 (64%)	20,735 (36%)	57,726 (100%)

Multiple regression model without interaction terms

Our dependent variable (redistribute) can be treated as numeric, so we will fit a linear (ordinary least squares) regression model using the lm() function:

m1 <- lm(redistribute ~ educ_univ + trusting + female + employment + hh_income + married + child + politics, data = akaeda)

Let’s then check the model coefficients (parameters):

model_parameters(m1)

Parameter                        | Coefficient |       SE |         95% CI | t(40050) |      p
----------------------------------------------------------------------------------------------
(Intercept)                      |        4.05 |     0.05 | [ 3.95,  4.15] |    77.35 | < .001
educ univ [University education] |       -0.50 |     0.03 | [-0.56, -0.44] |   -16.51 | < .001
trusting [Trusting]              |        0.30 |     0.03 | [ 0.25,  0.36] |    10.12 | < .001
female [female]                  |        0.14 |     0.03 | [ 0.09,  0.20] |     5.12 | < .001
employment [Other]               |    5.89e-03 |     0.05 | [-0.09,  0.10] |     0.12 | 0.901 
employment [Retired]             |        0.09 |     0.04 | [ 0.02,  0.16] |     2.62 | 0.009 
employment [Self-employed]       |       -0.21 |     0.06 | [-0.33, -0.09] |    -3.54 | < .001
employment [Unemployed]          |       -0.20 |     0.06 | [-0.31, -0.09] |    -3.58 | < .001
hh income                        |        0.03 | 7.73e-03 | [ 0.01,  0.04] |     3.53 | < .001
married [Maried]                 |       -0.16 |     0.03 | [-0.22, -0.09] |    -4.96 | < .001
child [Has children]             |       -0.12 |     0.04 | [-0.19, -0.05] |    -3.30 | < .001
politics                         |        0.24 | 6.01e-03 | [ 0.22,  0.25] |    39.30 | < .001


Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

Model with an interaction between education and social trust

We can interact predictor variables in regression models using the : and * operators. With :, we include in the regression model only the interaction term, but not the component variables (i.e. their main effects); to include the component variables as well, we need to add them in as usual with the + operator. To note that we should normally want to include both main effects and and the interaction effects in a regression model. With *, we include both the main effects and the interaction effects, so it provides a shortcut to using :, but it is often useful to combine the two in more complex interaction scenarios.

The specifications below are equivalent:

# Interaction using ":"
# including the constituent terms must be done manually
m2 <- lm(redistribute ~ educ_univ + trusting + educ_univ:trusting + female + employment + hh_income + married + child + politics, data = akaeda)

# Interaction using "*"
# including the constituent terms is done automatically
m2 <- lm(redistribute ~                        educ_univ*trusting + female + employment + hh_income + married + child + politics, data = akaeda)
m2 <- lm(redistribute ~ educ_univ + trusting + educ_univ*trusting + female + employment + hh_income + married + child + politics, data = akaeda)

But the one below is missing the constituent terms and is very likely not what we want:

m2_wrong <- lm(redistribute ~                  educ_univ:trusting + female + employment + hh_income + married + child + politics, data = akaeda)

Out of precaution, it is advisable to use the * operator and to manually include all the constituent terms as well, as in the third version above.

Let’s then check the model coefficients (parameters):

model_parameters(m2)

Parameter                                              | Coefficient |       SE |         95% CI | t(40049) |      p
--------------------------------------------------------------------------------------------------------------------
(Intercept)                                            |        4.13 |     0.05 | [ 4.02,  4.23] |    77.15 | < .001
educ univ [University education]                       |       -0.68 |     0.04 | [-0.76, -0.60] |   -17.06 | < .001
trusting [Trusting]                                    |        0.15 |     0.04 | [ 0.07,  0.22] |     3.86 | < .001
female [female]                                        |        0.14 |     0.03 | [ 0.09,  0.20] |     5.00 | < .001
employment [Other]                                     |   -1.39e-03 |     0.05 | [-0.09,  0.09] |    -0.03 | 0.976 
employment [Retired]                                   |        0.09 |     0.04 | [ 0.02,  0.16] |     2.47 | 0.013 
employment [Self-employed]                             |       -0.21 |     0.06 | [-0.33, -0.10] |    -3.59 | < .001
employment [Unemployed]                                |       -0.21 |     0.06 | [-0.32, -0.10] |    -3.82 | < .001
hh income                                              |        0.02 | 7.75e-03 | [ 0.01,  0.04] |     2.94 | 0.003 
married [Maried]                                       |       -0.15 |     0.03 | [-0.22, -0.09] |    -4.90 | < .001
child [Has children]                                   |       -0.12 |     0.04 | [-0.20, -0.05] |    -3.48 | < .001
politics                                               |        0.24 | 6.01e-03 | [ 0.22,  0.25] |    39.14 | < .001
educ univ [University education] × trusting [Trusting] |        0.41 |     0.06 | [ 0.29,  0.52] |     6.98 | < .001


Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

We can compare our results with those reported in Model 3 (Table 2) of Akaeda (2023), keeping in mind that our data are very different. Nonetheless, we find a similar positive effect for the interaction term (0.41), which in our case is stronger.

Interpreting interactions

As with results from logistic regression models, interpreting interaction results is not straightforward.

First of all, the coefficients associated with the individual variables included in the interaction term can no longer be interpreted in the usual direct way, and we usually no longer interpret the “main effects” of educ_univ and trusting, but only the multiplicative effect of the interaction between them. The interpretation of the numeric results, however, is rather convoluted and prone to lead to misinterpretations. In essence, what we get is the effect of a unit-change in trust among the university-educated compared to the effect of a unit-change in trust among the non-university-educated.

As with interpreting logistic regression, it can be much easier and reliable to interpret a plot of the “marginal effects”. As we already know from Lab 4, that can be achieved using the ggpredict() and the plot() functions, and we would add both constitutive terms of the interaction. Depending on which variable we write first we get either the effect of trust by education, or the effect of education by trust:

Education by trust:

ggpredict(m2, terms = c("educ_univ", "trusting")) |> 
  plot()

Trust by education:

ggpredict(m2, terms = c("trusting", "educ_univ")) |> 
  plot()

The latter plot reproduces Figure 1 from Akaeda (2023). To get an even closer reproduction of that Figure, we could add an additional argument to the plotting function, asking for the point-predictions to be connected with lines:

ggpredict(m2, terms = c("trusting", "educ_univ")) |> 
  plot(connect.lines = TRUE)

Looking at whether and how much the confidence intervals of the predicted coefficients overlap with the point-predictions of the coefficients, we can visually asses the interaction effect. We find that positive differences in the level of social trust have a much steeper effect on redistributive attitudes among the highly educated than it does among the lower educated.

Example 2. Österman (2021): Do educational reforms have different effects by levels of parental education?

This research question is at the core of the study conducted by Österman (2021). We have explored this article and used its data to fit simple binary and multiple regression models in Week 2. We can now take that analysis further and replicate some of the models reported in Table 3 of that article:

As our main interest is not so much on their substantive findings regarding the complex effects of different types of educational reforms, but rather on seeing interaction effects operating in real-life data, we will focus on Models 1-3 (All reforms). The main difference compared to the previous example is conceptual: the variable “Reform” is meant to distinguish between respondents who were affected by a major reform of the secondary education system in their countries during their schooling years, or not. This allows the researcher to posit a “quasi-experimental” scenario, in which being exposed to a “reform” can be interpreted in a more causal way, as in an experimental setting. The “reform”, thus, can be thought of as a “treatment” in an experiment, so the examining its statistical effect can be taken to have a more reliably causal interpretation that when exploring associations between purely “descriptive” variables. Technically, however, the models we fit are no different to those in the previous example.

Data preparation

As a first step, let’s import the osterman dataset that underpins the Österman (2021) article (see the Data page on the course website for information about the datasets available in this course). We can download the data to a local folder and load it from there, or we can load it directly from the web (if it works…):

# Import the data
osterman <- data_read("https://cgmoreh.github.io/HSS8005-24/data/osterman.dta")

It’s always a good idea to inspect the dataset after importing, to identify any issues. One option is to check a codebook, for example:

# `View` the codebook:
data_codebook(osterman) |> View()

Remember that some “tagged” NA values that were imported from the Stata data format need changing (see Week 2 worksheet):

osterman <- sjlabelled::zap_na_tags(osterman)

Model 1 redux: no covariates

We start by fitting a simplified version of Model1 (in Table 3 of Österman (2021)). The reported model contains the variables highlighted in the table (“Reform”, “Female” and “Ethnic minority”) as well as a number of additional covariates for statistical control (fbrneur, mbrneur, fnotbrneur, mnotbrneur, agea, essround, yrbrn, eform_id_num, including “self-interactions” and interactions between some of those control variables: yrbrn*yrbrn, yrbrn*reform_id_num, agea*reform_id_num, agea*agea, agea*agea*reform_id_num). The model is also adjusted for sampling weights (dweight) and the standard errors are clustered in “country-by-birth year cohort” groups (see Österman (2021)[pp. 222] for their reasoning for this modelling choice; we will discuss this aspect of the data in more detail when we consider multilevel models).

Here we will first fit a much simplified model that only includes the few variables highlighted in the table, without any additional control variables, weighting or clustering.

Model 1 doesn’t include an interaction, so it is a simple multiple linear regression model:

osterman_m1 <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d, data = osterman)

# Print model parameters, with digits rounded to three decimal points
model_parameters(osterman_m1, digits = 3)

Parameter   | Coefficient |    SE |           95% CI | t(68792) |      p
------------------------------------------------------------------------
(Intercept) |       5.251 | 0.013 | [ 5.226,  5.276] |  412.067 | < .001
reform1 7   |      -0.008 | 0.015 | [-0.037,  0.020] |   -0.560 | 0.576 
female      |       0.007 | 0.015 | [-0.021,  0.036] |    0.493 | 0.622 
blgetmg d   |      -0.478 | 0.060 | [-0.595, -0.361] |   -8.010 | < .001


Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

We find that without the additional model specifications, our results are quite different from those reported in the article. The “Reform” effect is much weaker and in the opposite direction, losing its statistical significance (p = 0.576). The effect of “Female” is also weaker and statistically not significant. On the other hand, belonging to an “ethnic minority group” has a stronger and clearer effect in this simplified model.

As a next step, we aim to expand the model by including all the specified covariates. This, however, will bring up an R coding challenge which can also serve as an introduction to the concept of polynomial terms.

Problem 1: Polynomials

In the footnotes to Table 3 and Table A.3 in the Appendix, the author tells us:

All models include reform FEs [Fixed Effects], a general quadratic birth year trend, and reform-specific trends for birth year (linear) and age (quadratic). Additional controls: foreign-born parents and ESS round dummies.

The tables in the Appendix (Table A.3) “report all of the coefficients for the control variables, except for the age and birth year controls. Since the latter variables are interacted with all of the reform fixed effects, they are not interpretable as single coefficients”.

In the main text, the author further explains some of the reasoning behind their modelling choices:

One dilemma for the design is that there has been a trend of increasing educational attainment throughout the studied time period, which means that the reform-windows of treated and non-treated cohorts will also pick up the effects of this trend. To counter this, [the list of covariates] includes a general quadratic birth year trend, reform-specific linear controls for birth year and reform-specific quadratic age trends. The quadratic terms are included to allow sufficient flexibility in absorbing possible non-linear trends of increasing education within the reform-window of seven treated and seven untreated birth year cohorts. … The reform-fixed effects are also essential because they imply that only the within-reform-window variation is used to estimate the effects and between-reform differences are factored out, such as pre-reform differences in social trust. (Österman 2021:221–22)

Before we fit the model, some of the concepts in the quotation need unpacking. A quadratic term is a second-degree polynomial term: put simply, it’s the square of the variable concerned. The quadratic of a variable such as \(age\) is therefore \(age^2\), or \(age \times age\). In other words, it is like a variable’s “interaction” with itself. Because of this, there are several ways in which the quadratic terms to be included in a model can be specified:

We could create the quadratic terms as new variables, and include those in the model. Effectively, we would be creating new variables/columns in the dataset and using those in the model. We could do that as shown below:

# Create new quadratic variables by multiplication with themselves and add them to the dataset, saving it as a new data object:

osterman <- osterman |> 
  mutate(agea_quad = agea*agea,                                                         # quadratic age variable
         yrbrn_quad = yrbrn*yrbrn) |>                                                   # quadratic birth-year variable
  sjlabelled::var_labels(agea_quad = "Age (quadratic)",                                 # we can label the variable if we want
                        yrbrn_quad = "Birth year (quadratic)")

# We now have two additional variables in the dataset; we can fit a model using those:

m1_prequad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3
                   fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +     # additional controls for foreign-born parents and ESS Round dummies (we treat `round` as a factor for this)
                   agea + yrbrn + agea_quad + yrbrn_quad +                              # general quadratic birth year trend and quadratic age
                   factor(reform_id_num) +                                              # reform fixed effects dummies
                   yrbrn:factor(reform_id_num) +                                        # reform-specific birth year trend
                   agea:factor(reform_id_num) +  agea_quad:factor(reform_id_num),       # reform-specific quadratic age trend
               data = osterman)                                                         # the new expanded dataset

We can get the same results by creating the quadratic terms directly as part of the modelling function. The one thing we should keep in mind is that if we want to include standard mathematical operations within a formula function, we need to isolate or insulate the operation from R’s formula parsing code using the I() function, which returns the contents of I(...) “as.is”. The model formula would then be:

# Create quadratic terms internally as part of the modelling function:

m1_funquad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3
                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +       # additional controls for foreign-born parents and ESS Round dummies (we treat `round` as a factor for this)
                 agea + yrbrn + I(agea^2) + I(yrbrn^2) +                                # general quadratic birth year trend and quadratic age
                 factor(reform_id_num) +                                                # reform fixed effects dummies
                 yrbrn:factor(reform_id_num) +                                          # reform-specific birth year trend
                 agea:factor(reform_id_num) +  I(agea^2):factor(reform_id_num),         # reform-specific quadratic age trend
              data = osterman)                                                          # the original dataset

In the two previous options, the quadratic terms will be correlated with the original variables. To avoid this by relying on so-called orthogonal polynomials we should use the poly() function. We can also fit the same correlated polynomial model as the ones above by adding the raw = TRUE option to the poly() function. In the code below, we will fit the correlated version first, then the orthogonal version (This stackoverflow discussion explains in more detail the difference between the two options):

m1_polyraw <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + 
                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + 
                 poly(agea, 2, raw = TRUE) + poly(yrbrn, 2, raw = TRUE) +
                 factor(reform_id_num) +           
                 yrbrn:factor(reform_id_num) + 
                 agea:factor(reform_id_num) + poly(agea, 2, raw = TRUE):factor(reform_id_num),
               data = osterman)

m1_polyorth <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + 
                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + 
                 poly(agea, 2) + poly(yrbrn, 2) +
                 factor(reform_id_num) +           
                 yrbrn:factor(reform_id_num) + 
                 agea:factor(reform_id_num) + poly(agea, 2):factor(reform_id_num),
               data = osterman)

For a side-by-side overview comparison of the models we fitted so far, we can use one of the many model summary tabulation functions that exist in various packages. Some popular options are stargazer::stargazer(), jtools::export_summs(), sjPlot::tab_model(), modelsummary::modelsummary() or parameters::compare_models from the {easystats} ecosystem. Below we will use modelsummary(), which produces publication-ready summary tables in HTML format, but which can easily be exported to other formats, such as Microsoft Word or PDF:

# It's cleaner to first make a list of the models we want to summarise; we can even name them:
models <- list(
  "Pre-calculated quadratic" = m1_prequad,
  "Within-function quadratic" = m1_funquad,
  "poly() with raw coding" = m1_polyraw,
  "poly() with default orthogonal coding" = m1_polyorth
)

# modelsummary table with stars for p-values added
modelsummary::modelsummary(models, stars = TRUE)

The results from the modelsummary() are not shown here because it’s a long and ugly table, but it’s useful for perusing to compare the results across the different models. We do notice some differences in the affected variables between the orthogonal-coded version and the others. It’s worth noting, however, that the Stata routine used by the author fitted correlated/raw-coded polynomials, so the orthogonal version from the output above is just for a comparison and we will not use it going forward. We generally want our transformed (polynomial) variables to be correlated with the original variables, as they are in fact measuring the same thing.

For a cleaner table showing only the results included in Table A.3 in the Appendix to Österman (2021), we can use the coef_map or the coef_omit option in modelsummary() and only include m1_funquad, which will be the polynomial fitting routine that we will use going forward:

# It's cleaner to first make a vector of the coefficients we wish to include; we can name the coefficients as they appear in Table A.3; note that we also leave out the Intercept, as in the published table:
coefs <- c("reform1_7" = "Reform",
           "female" = "Female",
           "blgetmg_d" = "Ethnic minority",
           "fbrneur" = "Foreign-born father, Europe",
           "mbrneur" = "Foreign-born mother, Europe",
           "fnotbrneur" = "Foreign-born father, outside Europe",
           "mnotbrneur" = "Foreign-born mother, outside Europe",
           "factor(essround)2" = "ESS Round 2",
           "factor(essround)3" = "ESS Round 3",
           "factor(essround)4" = "ESS Round 4",
           "factor(essround)5" = "ESS Round 5",
           "factor(essround)6" = "ESS Round 6",
           "factor(essround)7" = "ESS Round 7",
           "factor(essround)8" = "ESS Round 8",
           "factor(essround)9" = "ESS Round 9")

# Then we pass the vector to coef_map to select the coefficients to print
modelsummary::modelsummary(list(m1_funquad), stars = TRUE, coef_map = coefs)

	(1)
Reform	0.063*
	(0.027)
Female	0.058***
	(0.013)
Ethnic minority	−0.241***
	(0.054)
Foreign-born father, Europe	−0.111**
	(0.042)
Foreign-born mother, Europe	−0.108*
	(0.044)
Foreign-born father, outside Europe	−0.065
	(0.073)
Foreign-born mother, outside Europe	−0.110
	(0.078)
ESS Round 2	0.059
	(0.045)
ESS Round 3	0.162*
	(0.075)
ESS Round 4	0.243*
	(0.108)
ESS Round 5	0.360*
	(0.144)
ESS Round 6	0.397*
	(0.179)
ESS Round 7	0.449*
	(0.212)
ESS Round 8	0.655**
	(0.246)
ESS Round 9	0.816**
	(0.283)
Num.Obs.	68796
R2	0.200
R2 Adj.	0.198
AIC	268913.8
BIC	270056.1
Log.Lik.	−134331.877
RMSE	1.71
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

We can compare our table to the published one; there will, of course, be some differences, as our model specification is still not the same as that of the original paper. One source of the inconsistencies are the use of “survey weights” in the original article, something we will look into in Week 6.

Model 2

Model 2 includes two additions: an extra predictor for “High Parental Education”, as well as an interaction effect between “Reform” and “High parental education”. As in the previous example, we can fit the model like so:

osterman_m2 <- lm(trustindex3 ~ reform1_7 + paredu_a_high + reform1_7*paredu_a_high + female + blgetmg_d + # main covariates reported in Table 3
                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +       # additional controls for foreign-born parents and ESS Round dummies (we treat `round` as a factor for this)
                 agea + yrbrn + I(agea^2) + I(yrbrn^2) +                                # general quadratic birth year trend and quadratic age
                 factor(reform_id_num) +                                                # reform fixed effects dummies
                 yrbrn:factor(reform_id_num) +                                          # reform-specific birth year trend
                 agea:factor(reform_id_num) +  I(agea^2):factor(reform_id_num),         # reform-specific quadratic age trend
              data = osterman)                                                          # the original dataset

We can now select the coefficients we want to display in the table and create a summary table comparing the results from Model 1 and Model 2 side by side:

coefs_m2 <- c("reform1_7" = "Reform",
              "paredu_a_high" = "High parental edu",
              "reform1_7:paredu_a_high" = "Ref x High par edu",
           "femaleFemale" = "Female",
           "blgetmg_d" = "Ethnic minority",
           "fbrneur" = "Foreign-born father, Europe",
           "mbrneur" = "Foreign-born mother, Europe",
           "fnotbrneur" = "Foreign-born father, outside Europe",
           "mnotbrneur" = "Foreign-born mother, outside Europe")

# Print model parameters, with digits rounded to three decimal points
modelsummary::modelsummary(list(m1_funquad, osterman_m2), stars = TRUE, coef_map = coefs_m2)

	(1)	(2)
Reform	0.063*	0.081**
	(0.027)	(0.029)
High parental edu		0.349***
		(0.020)
Ref x High par edu		−0.049+
		(0.029)
Ethnic minority	−0.241***	−0.207***
	(0.054)	(0.056)
Foreign-born father, Europe	−0.111**	−0.106*
	(0.042)	(0.044)
Foreign-born mother, Europe	−0.108*	−0.115*
	(0.044)	(0.046)
Foreign-born father, outside Europe	−0.065	−0.047
	(0.073)	(0.076)
Foreign-born mother, outside Europe	−0.110	−0.135+
	(0.078)	(0.081)
Num.Obs.	68796	64960
R2	0.200	0.209
R2 Adj.	0.198	0.208
AIC	268913.8	252985.1
BIC	270056.1	254138.4
Log.Lik.	−134331.877	−126365.537
RMSE	1.71	1.69
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

While the results of our Model 2 still differ from those reported based on a more complex model, we find that our estimates are much closer now, with all the coefficients pointing in the same direction as in the reported model, including the interaction effect of “Reform” by “High parental education”. Because the two variables involved in this interaction are both simple binary, indicator, variables (coded 0 and 1) it is possible to make (some) sense of these results. Compared to the “main” effects shown in the previous model, the coefficient for “Reform” now refers to the effect of the reforms among individuals with poorly educated parents (coded 0) only. The interaction coefficient shows, by contrast, the difference in the effect of “Reform” for those with highly educated parents (coded 1). In this case, the negative coefficient shows that in the case of those with highly educated parents, the effect of educational “reforms” is smaller.

We can check more closely the meaning of the interaction effect using a marginal plot:

ggpredict(osterman_m2, terms = c("reform1_7", "paredu_a_high")) |> 
  plot(connect.lines = TRUE)

This visualization in now easier to interpret than the numerical output. a steeper effect among those whose parents were highly educated compared to those whose parents had lower education, however, we can also see more clearly that the effect of educational reforms within the two groups is actually in the opposite direction: among those with lower parental education, educational reforms are associated with higher levels of trust, while among those with lower parental education, the effect is in the opposite direction.

Model 3 is different from Model 2 in that it includes interaction terms not only between “Reform” and “High parental education”, but Parental education is also interacted with all the other covariates that we did not include in our models.

References

Akaeda, Naoki. 2023. “Trust and the Educational Gap in the Demand for Redistribution: Evidence from the World Values Survey and the European Value Study.” International Sociology 38(3):290–310. doi: 10.1177/02685809231167834.

Berry, W. D., M. Golder, and D. Milton. 2012. “Improving Tests of Theories Positing Interaction.” Journal of Politics 74(3):653–71. doi: 10.1017/S0022381612000199.

Brambor, T., W. R. Clark, and M. Golder. 2006. “Understanding Interaction Models: Improving Empirical Analyses.” Political Analysis 14(1):63–82. doi: 10.1093/pan/mpi014.

Clark, William Roberts, and Matt Golder. 2023. Interaction Models: Specification and Interpretation. Cambridge: Cambridge University Press.

Dingemans, Ellen, and Erik Van Ingen. 2015. “Does Religion Breed Trust? A Cross-National Study of the Effects of Religious Involvement, Religious Faith, and Religious Context on Social Trust.” Journal for the Scientific Study of Religion 54(4):739–55. doi: 10.1111/jssr.12217.

Hainmueller, J., J. Mummolo, and Y. Q. Xu. 2019. “How Much Should We Trust Estimates from Multiplicative Interaction Models? Simple Tools to Improve Empirical Practice.” Political Analysis 27(2):163–92. doi: 10.1017/pan.2018.46.

Mitchell, Jeffrey. 2021. “Social Trust and Anti-Immigrant Attitudes in Europe: A Longitudinal Multi-Level Analysis.” Frontiers in Sociology 6.

Österman, Marcus. 2021. “Can We Trust Education for Fostering Trust? Quasi-experimental Evidence on the Effect of Education and Tracking on Social Trust.” Social Indicators Research 154(1):211–33. doi: 10.1007/s11205-020-02529-y.

Wu, Cary. 2021. “Education and Social Trust in Global Perspective.” Sociological Perspectives 64(6):1166–86. doi: 10.1177/0731121421990045.