## Install first if not yet installed
library("tidyverse")
library("sjlabelled")
library("easystats")
library("ggformula")
library("marginaleffects")
library("modelsummary")
library("gtsummary")
library("sjPlot")
library("sjmisc")
library("ggeffects")
Week 7 Application Lab
Paths: Graphical models and considerations for causal analysis
Aims
This session will help you think more carefully about the causal claims you can make from a regression-based analysis of observational (i.e. non-experimental) data. We will practice some more elementary methods for slicing data to assess group differences and relate them to the linear regression methods familiar from previous weeks. We will also take a step back from data to think more conceptually about some of the models we have encountered so far, and to begin thinking about the research questions and the theoretical and empirical estimands (quantities of interest) that you would like to investigate on a dataset of your choice. These will be the first steps towards the analysis you will be submitting for assessment in two months’ time.
In Week 1 you set up R
and RStudio
, and an RProject folder (we called it “HSS8005_labs”) with an .R
script and a .qmd
or .Rmd
document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder.
Create a new Quarto markdown file (.qmd
) for this session (e.g. “Lab_7.qmd”) and work in it to complete the exercises and report on your final analysis.
Packages
Introduction
Exercise 1: Estimating the persuasive power of the news media
In this exercise we will focus on data from an article by Ladd and Lenz (2009) (follow the doi link in the citation to access the article). You can also download the article from here and download the article’s appendix from here). In this article the authors aim to estimate whether (print) media has the power to persuade people to vote differently. In general terms, this is a very interesting question, even though the power of print media has definitely weakened over the past few decades and measuring the effect of alternative media sources would present additional challenges. Nevertheless, the authors attempt to take advantage of a unique natural experiment that arguably facilitates tackling this causal question: between the 1992 and 1997 UK general elections, four newspapers (the Sun, Daily Star, Independent, and Financial Times) changed their editorial stance and tone from supporting the Conservative Party to supporting Tony Blair’s Labour Party. Such radical shifts editorial stance are extremely rare. The data they use come from several waves of the British Election Panel Study from between 1992 and 1997 and include variables on voting behaviour in 1992 and 1997, as well as whether the respondent was a reader of one of the “switching” newspapers. These are the main variables that are useful for tackling the causal effect involved in the research question, but there are several other variable that provide various controls.
Data
We can import the dataset (available in Stata’s.dta
format) from the course website. For the purpose of this exercise, we select only three core variables of interest:
# Let's call the data object "news" for simplicity
news <- sjlabelled::read_stata("https://cgmoreh.github.io/HSS8005-data/LaddLenz.dta") |>
select(tolabor, vote_l_92, vote_l_97) |>
# let's also rename the variable `tolabor` to `reader`, which sounds more intuitive, perhaps
rename(reader = tolabor)
data_codebook(news)
Comparing proportions
The aim of Ladd and Lenz (2009) is to estimate the effect of “reading a switching newspaper” on respondents’ voting behaviour change between the 1992 and 1997 elections. In terms of our variables in the dataset, they aim to estimate vote_l_97
(Voted Labour in 1997) from vote_l_92
(Prior Labour vote in 1992) as moderated by reader
(treatment: indicator of whether reading a switching newspaper). All three variables are binary indicator variables.
We have data consisting of measurements on the same individuals at two different time points (1992 and 1997), which allows us to think in causal terms. But the question could be first broken down into two smaller questions exploring:
- the average treatment effect of
tolabour
in a cross-sectional design (oblivious of prior vote) - a before/after comparison of the treated group, comparing their average vote in 1997 to their average vote in 1992
- a differences-in-differences comparison of the average changes over time in the treatment group and average changes over time in the control group.
Overall mean vote for Labour in 1997 and 1992
What is the overall proportion of those voting Labour in the 1997 elections in the sample?
mean(news$vote_l_97)
[1] 0.4507219
And it 1992:
mean(news$vote_l_92)
[1] 0.3314501
As with all proportions, these can be read as a percentage if multiplied by 100, and we can also round the numbers to one decimal point:
[1] 45.1
[1] 33.1
So the difference in the 1997 and 1992 Labour vote is:
mean(news$vote_l_97 - news$vote_l_92)
[1] 0.1192718
Or
percentage points.
To get a better sense of the data that generates these averages, let’s tabulate the 97-92 Labour vote difference:
data_tabulate(news$vote_l_97 - news$vote_l_92)
outcome: voted labor in 1997 (news$vote_l_97 - news$vote_l_92) <numeric>
# total N=1593 valid N=1593
Value | N | Raw % | Valid % | Cumulative %
------+------+-------+---------+-------------
-1 | 49 | 3.08 | 3.08 | 3.08
0 | 1305 | 81.92 | 81.92 | 85.00
1 | 239 | 15.00 | 15.00 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
It’s important to understand how these data come about.
Think (look again) at the coding of the variables involved. What do you think the values in the table represent?
- Which category are those who voted Labour in 1992 but did not vote Labour in 1997?
- Which category are those who did not vote Labour in 1992, but voted Labour in 1997?
- What does the 0 category stand for here?
Conditional averages
What about the proportion of Labour voters among those who read/not read papers that shifted their editorial support?
# We can first break down the dataset into two:
not_reader <- news |> filter(reader == 0)
reader <- news |> filter(reader == 1)
# Then calculate means within each:
not_reader_97_lab_vote_mean <- mean(not_reader$vote_l_97)
reader_97_lab_vote_mean <- mean(reader$vote_l_97)
# With the result:
not_reader_97_lab_vote_mean
[1] 0.4305355
reader_97_lab_vote_mean
[1] 0.5829384
# A tibble: 2 × 2
reader `mean(vote_l_97, na.rm = TRUE)`
<dbl> <dbl>
1 0 0.431
2 1 0.583
Difference between “treatment” groups in 1997: “Average Treatment Effect”
The average treatment effect would be the difference between the two groups in the 1997 vote share:
between_97 <- reader_97_lab_vote_mean - not_reader_97_lab_vote_mean
between_97
[1] 0.1524029
ATE <- between_97
round(ATE * 100, 1)
[1] 15.2
Between in 1992
not_reader_92_lab_vote_mean <- mean(not_reader$vote_l_92)
reader_92_lab_vote_mean <- mean(reader$vote_l_92)
not_reader_92_lab_vote_mean
[1] 0.3227207
reader_92_lab_vote_mean
[1] 0.3886256
between_92 <- reader_92_lab_vote_mean - not_reader_92_lab_vote_mean
between_92
[1] 0.0659049
round(between_92 * 100, 1)
[1] 6.6
Before/After difference within “treatment” groups
mean(news$vote_l_92[news$reader == 0], na.rm = TRUE)
[1] 0.3227207
mean(news$vote_l_92[news$reader == 1], na.rm = TRUE)
[1] 0.3886256
within_nonreader <- not_reader_97_lab_vote_mean - not_reader_92_lab_vote_mean
within_reader <- reader_97_lab_vote_mean - reader_92_lab_vote_mean
within_nonreader
within_reader
[1] 0.1078148
[1] 0.1943128
Differences-in-Differences
We can get the “difference in differences” effect in several ways:
DiD <- within_reader - within_nonreader
DiD
[1] 0.08649803
between_97 - between_92
[1] 0.08649803
We can also obtain the DiD estimate with a linear regression model, in which the outcome \(y\) is the within-treatment-group difference and the predictor \(x\) is the “treatment” (reader
). We can calculate the \(y\) as part of the regression command, but to include a mathematical operation within the regression function, we need to “isolate” the mathematical operation using the I()
function:
Parameter Estimate
1 (Intercept) 0.10781476
2 reader 0.08649803
Look back at the tabulation of the vote_l_97 - vote_l_92
difference from earlier to check again the values taken by \(y\) in this model.
- What does the
(Intercept)
term represent here?
Linear probability model
The comparison of proportions that we calculated earlier relies on comparing average changes. We can also asses the effects involved using ordinary least squares (OLS) linear regression (a so-called “linear probability model” given that our outcome is binary). For example, we can check what the overall mean vote for Labour in 1997 was in the sample, and the average treatment effect we calculated earlier:
## the overall mean of Labour vote in 1997
mean_l_97 <- lm(vote_l_97 ~ 1, data = news)
coefficients(mean_l_97)
(Intercept)
0.4507219
In the above, we do not include any predictor \(y\) variables; instead, we specify a number 1 on the right hand side of the equation to get an “intercept”, an average of the entire sample.
Including the “treatment” variable reader
in the model gives us the average treatment effect form earlier:
ATE_reg <- lm(vote_l_97 ~ reader, data = news)
coefficients(ATE_reg)
(Intercept) reader
0.4305355 0.1524029
The “(Intercept)” here absorbs only one variable category - when the value of reader
is 0
(i.e. when the respondent is a “non-reader”, or an “untreated” case) - and as such it tells us the proportion of the 1997 Labour vote among those who were not readers of one of the endorsement-shifting newspapers.
It can often be useful to know how to extract specific coefficients/parameters from regression model results, which allows us to perform further operations on them. Below we use both base-R functions and some equivalent - and simpler - functions from the insight package which is included as part of the easystats meta-package:
## base R: extract intercept coefficient, turn into percetage and round down to 1 decimals
coefficients(ATE_reg)["(Intercept)"]
(Intercept)
0.4305355
coefficients(ATE_reg)["(Intercept)"][[1]]
[1] 0.4305355
round(coefficients(ATE_reg)["(Intercept)"][[1]] * 100, 1)
[1] 43.1
## Extracting the intercept value with vbase-R and formatting it using easystats {insight}
coefficients(ATE_reg)["(Intercept)"][[1]] |> format_percent(digits = 1)
[1] "43.1%"
## Extracting the intercept value and formatting it using easystats {insight}
get_intercept(ATE_reg) |> format_percent(1)
[1] "43.1%"
format_percent(get_intercept(ATE_reg), 1)
[1] "43.1%"
What happens if we add together the two coefficients from the ATE_reg
model?
coefficients(ATE_reg)["(Intercept)"] + coefficients(ATE_reg)["reader"]
(Intercept)
0.5829384
We can also use the the get_parameters()
function from insight for this:
# Extracting parameter values (including for the intercept) and manipulating them, using {insight}
get_parameters(ATE_reg)
Parameter Estimate
1 (Intercept) 0.4305355
2 reader 0.1524029
get_parameters(ATE_reg)[2]
Estimate
1 0.4305355
2 0.1524029
get_parameters(ATE_reg)[2] |> sum()
[1] 0.5829384
get_parameters(ATE_reg)[[2]]
[1] 0.4305355 0.1524029
get_parameters(ATE_reg)[[2]] |> sum()
[1] 0.5829384
We can also obtain this parameter directly from a regression specification that omits the intercept term from the model. It is almost never a good idea, in practice, to omit the intercept term from a model, but this example can provide an easy explanation of what happens if we do omit it. To leave out the intercept from a regression in R
, we can replace the value of 1
on the right hand side of the equation (that value is always added into the regression model by default, even if we do not specify it) with either a 0
or a -1
. W can check below whether the two procedures produce equivalent results:
## the mean of Labour vote in 1997 among the "treated"
reader_mean_l_97a <- lm(vote_l_97 ~ 0 + reader, data = news)
reader_mean_l_97b <- lm(vote_l_97 ~ -1 + reader, data = news)
get_parameters(reader_mean_l_97a)
Parameter Estimate
1 reader 0.5829384
get_parameters(reader_mean_l_97b)
Parameter Estimate
1 reader 0.5829384
## Are the two coefficients for `reader` obtained above identical?
### Extract coefficients with base-R
identical(coefficients(reader_mean_l_97a),
coefficients(reader_mean_l_97b)
)
[1] TRUE
### Extract parameters (names and coefficients) with {insight}
identical(get_parameters(reader_mean_l_97a),
get_parameters(reader_mean_l_97b)
)
[1] TRUE
If we treat the reader
variable as a factor (categorical) - instead of numeric -, we can see more clearly what the lack of an intercept means, and how the coefficient of reader
changes meaning compared to the model with the intercept included:
reader_mean_l_97c <- lm(vote_l_97 ~ 0 + as.factor(reader), data = news)
## Pretty print parameters, keeping only the coefficients
get_parameters(reader_mean_l_97c)
Parameter Estimate
1 as.factor(reader)0 0.4305355
2 as.factor(reader)1 0.5829384
Parameter | Coefficient
------------------------
reader [0] | 0.431
reader [1] | 0.583
compare_parameters(reader_mean_l_97c, ci = NULL) |> print(digits = 3)
Parameter | reader_mean_l_97c
--------------------------------
reader [0] | 0.431
reader [1] | 0.583
--------------------------------
Observations | 1593
compare_parameters(reader_mean_l_97a,
reader_mean_l_97b,
reader_mean_l_97c,
ci = NULL) |> print(digits = 3)
Parameter | reader_mean_l_97a | reader_mean_l_97b | reader_mean_l_97c
------------------------------------------------------------------------
reader | 0.583 | 0.583 |
reader [0] | | | 0.431
reader [1] | | | 0.583
------------------------------------------------------------------------
Observations | 1593 | 1593 | 1593
Check these regression results against the values for not_reader_97_lab_vote_mean
and reader_97_lab_vote_mean
obtained previously and attemt an interpretation.
get_parameters(reader_mean_l_97a)
Parameter Estimate
1 reader 0.5829384
We can now add the third variable - Labour vote in 1992 - into the model:
m1 <- lm(vote_l_97 ~ reader + vote_l_92, data = news)
coefficients(m1)
(Intercept) reader vote_l_92
0.2113743 0.1076466 0.6791049
Questions
What does this simple model tell us?
What about if we also include the interaction effect between the two predictors?
Call:
lm(formula = vote_l_97 ~ factor(reader) * factor(vote_l_92),
data = news)
Residuals:
Min 1Q Median 3Q Max
-0.92683 -0.20513 -0.20513 0.09641 0.79487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.20513 0.01235 16.607 < 2e-16 ***
factor(reader)1 0.15921 0.03549 4.486 7.77e-06 ***
factor(vote_l_92)1 0.69846 0.02174 32.124 < 2e-16 ***
factor(reader)1:factor(vote_l_92)1 -0.13597 0.05763 -2.359 0.0184 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3779 on 1589 degrees of freedom
Multiple R-squared: 0.4247, Adjusted R-squared: 0.4236
F-statistic: 390.9 on 3 and 1589 DF, p-value: < 2.2e-16
Question: How do we interpret the coefficients on this simple interaction model?
If the interpretation of the interaction model proves difficult (as it should), maybe it’s better to visualise the model by plotting it:
predict_response(m1_int, terms = c("reader", "vote_l_92")) |>
plot(connect_lines = TRUE)
To make the meaning of the plot even more straightforward for our interpretative purposes, we can make some changes to the y-axis to display vertical lines at the probability-levels that we got from the model coefficients, and label them with the name of the regression terms. Check the code below against the information from the model summary:
predict_response(m1_int, terms = c("reader", "vote_l_92")) |>
plot(connect_lines = TRUE, dodge = NULL) +
theme(panel.grid.minor = element_blank(),
axis.title.y = element_blank()) +
expand_limits(x = 0, y = 0) +
scale_y_continuous(breaks = c(0,
0.20513,
0.20513 + 0.15921,
0.20513 + 0.69846,
0.20513 + 0.15921 + 0.69846-0.13597,
1),
labels = c("0%",
"92 Not-Lab Not-Reader \n Intercept (20.5%)",
"92 Not-Lab Reader",
"92 Lab Not-Reader",
"92 Lab Reader",
"100%")) +
scale_x_continuous(breaks = c(0, 1))
predict_response(m1_int, terms = c("vote_l_92","reader")) |>
plot(connect_lines = TRUE, dodge = NULL) +
theme(panel.grid.minor = element_blank(),
axis.title.y = element_blank()) +
expand_limits(x = 0, y = 0) +
scale_y_continuous(breaks = c(0,
0.20513,
0.20513 + 0.15921,
0.20513 + 0.69846,
0.20513 + 0.15921 + 0.69846-0.13597,
1),
labels = c("0%",
"92 Not-Lab Not-Reader \n Intercept (20.5%)",
"92 Not-Lab Reader",
"92 Lab Not-Reader",
"92 Lab Reader",
"100%")) +
scale_x_continuous(breaks = c(0, 1))