library(tidyverse)
library(easystats)
library(gtsummary)
library(ggformula)
library(sjlabelled)
library(ggeffects)
library(marginaleffects)
Week 3 Worksheet Notes
Categories: Logistic regression and other generalised linear models
Aims
This session introduces binary logistic regression models. These models are the simplest form of a broader class of models called generalised linear models, which are applicable when the outcome (“dependent”, “response”, “explained”, etc.) variable cannot be assumed to follow a Gaussian (i.e. “normal”) distribution, but it instead a bounded or discrete measurement (e.g. think of variables whose values cannot be negative - i.e. have a lower limit of 0 - or fall into discrete categories such as “yes”/“no”, “disagree”/“neither agree nor disagree”/“agree”, or “blue”/“green”/“black”/“brown”/“other”). Binary logistic regression is the simplest case, where the outcome can take only two values (therefore “binary”). However, the logic that underpins it is similar to that of other generalised linear models.
By the end of the session you will learn how to:
- Fit and summarise logistic regression models in
R
- Interpret results from logistic regression models
- Manipulate the regression output to ease interpretation
- Plot and visualise results from logistic regression models to aid interpretation
R packages
Introduction
To introduce logistic regression, we replicate part of the results presented in Wu (2021), specifically a simpler version of their Model 1 in Appendix A. Table 2, with only individual-level variables and pooled data that disregards the hierarchical nature of the dataset. The main problem addressed by Wu (2021) is the relationship between education and social trust at a global level, and for this they partially use data from Waves 5 and 6 of the World Values Survey.
The WVS measures social trust with a binary question:
This variable will serve as the outcome variable.
Previous research had generally concluded that “there is a strong and positive relation between education level and trust” (Wu 2021:1167). However,
“several studies have shown that education might yield differential impacts on trust in different societies. In Sweden, Sven Oskarsson et al. (2017) show that education has little impact on trust. In China, Cary Wu and Zhilei Shi (2020) suggest that education has a negative impact on people’s trust. Several cross-national studies have also shown that the education and trust association can vary from positive to negative depending on the specific institutional contexts … For example, the effect of education is found to be negative in highly corrupt countries such as Serbia, Turkey, Hungary, Slovakia, Bulgaria, Croatia, Kosovo, and Ukraine” (p. 1167).
Wu (2021) investigates how individual-level (micro) factors (such as education level and risk-taking propensity) may have different effects on trust depending on country-level (macro) factors (such as sociopolitical risk). We will focus on the individual (micro) level in this exercise, using measurements of education level and risk-taking propensity as our main explanatory variables. Following Wu (2021:1172), we will also control for a number of demographic variables at the individual level (“gender, age, income, marital status, and occupational status”).
Data
The data required for this analysis can be downloaded freely from the World Values Survey website. We download Wave 5 (2005-2009) and Wave 6 (2010-2014) data. These are the only waves that measured risk-taking propensity using Schwartz’s value item,
“where respondents were asked to report their similarity to a hypothetical individual: “Adventure and taking risks are important to this person; to have an exciting life.” Respondents rated the statement using a 6-point scale (1 = very much like me, 6 = not at all like me)” (Wu 2021:1172).
The latest wave of the WVS does not include this variable. Wu (2021) describe the other variables they use
For the WVS, I measure educational attainment using respondents’ highest education level attained with eight categories, namely, 1 = no formal education or inadequately completed elementary education, 2 = completed (compulsory) elementary education, 3 = incomplete secondary school: technical/vocational, 4 = complete secondary school: technical/vocational, 5 = incomplete secondary: university, 6 = complete secondary: university, 7 = some university without degree, and 8 = university with degree/higher education.
In some analyses, I treat education as a categorical variable. To reduce the number of categories, I recode respondents’ education into Primary, Secondary, Post-secondary, and Tertiary. This is also consistent with the most recent wave of the WVS coding (p. 1171)
I also control for relevant demographic covariates such as gender, age, income, marital status, and occupational status at the individual level (p. 1172)
We can also find the questionnaires and the codebooks on the survey website, which will help identify the relevant variables.
The code below imports and selects the data from the original WVS dataset and creates a data-frame called wvs56
that contains all the variables needed:
## Paths to files --------------------------------------------------------------------------------------------------------------
### Data was downloaded in SPSS (.sav) format from the WVS website (https://www.worldvaluessurvey.org/WVSContents.jsp)
### Downloaded data files were extracted and saved in the folder "raw" within the folder "data"
### Create paths to the data files are created from the RProject root folder; cross-check the created path string variable:
<- here::here("data", "raw", "WV5_Data_Spss_v20180912.sav")
wvs5_path <- here::here("data", "raw", "WV6_Data_sav_v20201117.sav")
wvs6_path
## Select variables ------------------------------------------------------------------------------------------------------------
### Create a vector storing names for the variables that will be used, based on Wu (2021: 1170-1172):
<- c("country", "year", "trust", "risktaker", "education", "sex", "age", "income", "marstat", "employment")
wu_vars
### Create vectors storing the original names of the relevant variables in the WVS5 and WVS6 datasets:
<- c("V2", "V260", "V23", "V86", "V238", "V235", "V237", "V253", "V55", "V241")
wvs5_vars <- c("V2", "V262", "V24", "V76", "V248", "V240", "V242", "V239", "V57", "V229")
wvs6_vars
## Read in the data files -----------------------------------------------------------------------------------------------------
<- read_spss(wvs5_path) |>
wvs5 data_select(select = c(wvs5_vars))
<- read_spss(wvs6_path) |>
wvs6 data_select(select = c(wvs6_vars))
## Read in the data files -----------------------------------------------------------------------------------------------------
### Create codebooks to check variables
### Comparing the codebooks we see that the coding of all the variables is similar
### WVS6 has shorter value labels on some variables
<- wvs5 |>
wvs5_codebook data_codebook()
<- wvs6 |>
wvs6_codebook data_codebook()
## Rename variables -----------------------------------------------------------------------------------------------------------
### It's safe to replace the original var names with more human-readable names; we use `datawizard::data_rename()`
### Replace the value codes of `country` and `year` with the labels; will make merging datasets less error-prone
<- wvs5 |>
wvs5 data_rename(c(wvs5_vars), c(wu_vars)) |>
labels_to_levels(c("country", "year"))
<- wvs6 |>
wvs6 data_rename(c(wvs6_vars), c(wu_vars)) |>
labels_to_levels(c("country", "year"))
## Merge data files -----------------------------------------------------------------------------------------------------------
### Now variable names are the same in both datasets, the two can be joined
### `datawizard::data_merge(..., join = "bind")` is similar to `dplyr::bind_rows()` but also copies value labels from the first-mentioned dataset
### We mention the `wvs6` dataset first to keep its (shorter) version of value labels
<- data_merge(wvs6, wvs5, join = "bind")
wvs56
## Create new `country_year` variable -----------------------------------------------------------------------------------------
<- data_unite(wvs56, new_column = "country_year", c("country", "year"), append = TRUE) wvs56
Recoding variables
There are a few final alterations that need to be made to the variables:
trust
will serve as our response (dependent) variable - the one we want to explain/model. We can see from the codebook that it has two valid categories (“1” and “2”), referring to “Most people can be trusted” and “Need to be very careful”, respectively. Given that it is a binary (dichotomous) variable, we will use binary logistic regression to model it. When we enter a variable into a logistic regression model as dependent variable inR
, the first category/level will be automatically recoded internally to the value “0” and be used as the reference category; the latter category/level will be recoded to “1”, and it will be the indicator category that the model predicts in contrast to the reference category. In our case, that will mean that the category referring to “Need to be very careful” will be treated as the indicator variable by default, so effectively the coefficients of the explanatory (predictor, independent) variables will be estimating the propensity towards “distrust”. If we first reversed the order of the categories in the dependent variable, the output would be estimating “trust”, which is conceptually cleaner and more straightforward to interpret;- Following Wu (2021), we can treat our main explanatory variable -
education
(Highest educational level attained) - as numeric. But in the dataset it is coded as factor, so we will have to either change it to numeric in the dataset, or force it to be treated as numeric byR
when fitting the model. Technically,education
is categorical in nature, but given that the categories are ordinal in their substantive meaning (increasing from no/less education to more) and there are more than just a handful of levels/categories (there are 9), we can technically treat it as a short numeric scale, while keeping in mind that descriptive statistics such as the mean do not make much sense in this case; - Wu (2021) also noted that in some models (including in Model 1, Appendix A. Table 2) they use a recoded categorical version of the education, collapsed over 4 categories: “Primary”, “Secondary”, “Post-secondary”, and “Tertiary” education. We can also create this variable to potentially include it in our models.
We can make the required transformations in a single call to the {datawizard}
function data_modify()
like so:
<- wvs56 |> # (1)
wvs56 data_modify(trusting = recode_values(trust,
recode = list("0" = 2)), # (2)
educ_num = as_numeric(education), # (3)
educ_cat_1 = recode_values(education, # (4)
recode = list("1" = ("1, 2, 3"),
"2" = ("4, 5, 6, 7"),
"3" = ("8"),
"4" = ("9"))), # (4.1)
educ_cat_2 = recode_values(education, # (5)
recode = list("1" = ("1, 2"),
"2" = ("3"),
"3" = ("4, 5"),
"4" = ("6, 7"),
"5" = ("8"),
"6" = ("9")))
|> # (6)
) set_labels(trusting, labels = c("Not trusting", "Trusting")) |> # (7)
set_labels(educ_cat_1, labels = c("Primary",
"Secondary",
"Post-secondary",
"Tertiary")) |>
set_labels(educ_cat_2, labels = c("Less than primary",
"Primary",
"Secondary-vocational",
"Secondary-preparatory",
"Post-secondary",
"Tertiary"))
# Code explanation:
# (1) overwrite the data object with the changes
# (2) reverse the category order of "trust" and save it as a new variable called "trusting" to make the indicator meaning clearer
# (3) change the type of variable "education" to numeric and save it as the new variable "educ_num"
# (4) recode the "education" variable into 4 larger categories and save the new variable as "educ_cat_1"
# (4.1) make sure to close all the brackets under `recode_values()`
# (5) alternative recode of "education" into 6 categories as "educ_cat_2"
# (6) remember to close the brackets under `data_modify()`
# (7) add value labels
Of course, we could do all the changes one-by-one, but because we can achieve everything we wanted using the data_modify()
option, it’s easier to do it all at once. When you are experimenting with recoding, it is probably a good idea to do it one-by-one and check the results at each step to catch out any errors more easily.
Let’s have a look at the recoded dataset; because we did not overwrite any of the original variables, we will have both the old and the newly recoded/renamed variable in the dataset:
<- data_codebook(wvs56) wvs56_codebook
View(wvs56_codebook)
The newly created variables look like this:
# Showing values:
data_tabulate(wvs56, trusting:educ_cat_2)
# Showing value labels:
|> data_select(trusting:educ_cat_2) |> labels_to_levels() |> data_tabulate() wvs56
Describing the relationship between the main variables of interest
Now that the data has been recoded, we can have a closer look at the relationship between our two main variables: trusting and educ_num. Since the education variable is now recorded as numeric, we can make use of a boxplot for visualisation. In Week 2 we practiced building {ggplot2}
plots incrementally, and the boxplot could be coded like this:
ggplot(wvs56) +
aes(trusting, educ_num) +
geom_boxplot()
We could also set the value labels as the values to plot, and we could remove the missing values:
|>
wvs56 labels_to_levels(trusting) |>
remove_missing() |>
ggplot() +
aes(trusting, educ_num) +
geom_boxplot()
While ggplot2::ggplot()
is flexible and useful for expanding the plots and adding various embellishments, the {ggformula}
package simplifies its syntax by making it similar to the formula syntax used in common modelling functions (such as lm()
). All the {ggformula}
commands begin with gf_
followed by the type of the plot. The command for a boxplot would be as simple as:
gf_boxplot(educ_num ~ trusting, data = wvs56)
It also complies with pipe workflows, like for example:
|>
wvs56 labels_to_levels(trusting) |>
remove_missing() |>
gf_boxplot(educ_num ~ trusting)
Looking at the boxplots, we can see a slightly higher median education level among those in the “Trusting” category (those who responded that “Most people can be trusted” in the survey). For a more detailed numerical summary of this association, we can use another function from the {datawizard}
package, means_by_group()
:
|> means_by_group("educ_num", "trusting") wvs56
Here we can see more clearly the difference in average education levels between those with high trust levels and those with low trust levels. This table is equivalent to the results from a linear regression model in which the numeric variable educ_num
is the dependent variable and trusting
is the explanatory (independent) variable. We should keep in mind, however, that while our educ_num
variable is treated as numeric, it is in fact an ordinal categorical variable, and a mean value of 5.88 over trusting
cannot be interpreted meaningfully. If we are interested in only in differences, we can calculate the median educ_num
- and any other statistics - over trusting
manually:
|>
wvs56 labels_to_levels(trusting) |>
remove_missing() |>
data_group(trusting) |>
summarize(mean_educ = mean(educ_num),
median_educ = median(educ_num))
Out interest, however, is in modelling the trusting
variable as a function of education, while also accounting for the other demographic covariates. For this, we turn to a logistic regression model.
Simple bivariate logistic regression model
Fitting a logistic regression model in R
is fairly straightforward. We can use the base R glm()
function and would write the following code:
<- glm(trusting ~ educ_num, family = binomial(link = "logit"), data = wvs56) model_1
In the code above, instead of the linear model function (lm()
) we used the generalised linear model function (glm()
), and because glm()
is a larger family of models, we need to also specify which “family” we are fitting; since our dependent variable is binary variable, we are calling the “binomial” family. We also need to specify a “link function” to be used, but the “logit” is the default link for the binomial distribution family, so we could have ommitted it and simply written glm(..., family = binomial)
.
In all other aspects the formula is the same as the one we already know from linear regression.
Let’s look at the model parameters:
model_parameters(model_1)
It is at the stage of interpretation that logistic regression models become difficult. The coefficients that we obtain in the standard output are similar to those from a linear regression in that they summarise a linear and additive relationship between the independent and dependent variables. The difference is that the dependent variable has been transformed to a logit or logged odds scale. We could say - following the logic of linear regression - that each unit difference in one’s education level is associated with a 0.18-point positive difference in the logged odds/logit value of trusting. But this doesn’t make much sense, because the logit scale is difficult to comprehend.
It would be more intuitive to see the results on a different scale. A simple and commonly used transformation is from log-odds to odds ratios. If we exponentiate the logistic regression log-odds coefficients, we could say that the independent variables relate to the odds of the dependent variable, rather than the logged odds, which is somewhat easier to relate to. The model_parameters()
function can do that for us in exchange of some additional specification:
model_parameters(model_1, exponentiate = TRUE)
The odds ratios that we find in this table are somewhat easier to interpret, but we need to understand how odds work. By having exponentiated the coefficients, we have moved from an additive scale to a multiplicative scale. Basically, what this means is that while on an additive scale the value “0” means no change (because by adding 0 to a number we do not alter that number), on a multiplicative scale that role is taken over by “1” (when we multiply a number by “1”, we keep it as it is). Consequently, a coefficient of “1” on the exponentiated scale means no difference, while values between 0 and 1 represent a negative difference and values greater than 1 a positive difference.
I our case, the OR of 1.10 means that a difference of one degree in education level is associated with a positive difference of 1.10 in the odds of being a trusting person. We could further translate this into a percentage: if we subtract 1 from the exponentiated coefficient and multiply it by 100, we get the percentage difference due to a unit change in the independent variable.
\[ \% \Delta _{odds} = 100\times(OR-1) \]
For values grater than 1 this calculation is straightforward. In our case, this translates to a 10% increase in the odds of being a trusting person.
It doesn’t add much to our understanding to plot the regression results from a model with a single predictor, but knowing how to plot results will be particularly useful once we expand the model with further explanatory variables. To create a plot of the results, we can simply take the output from the model_parameters()
function and pass it on to the plot()
function included in {easystats}
:
parameters(model_1, exponentiate = TRUE) |> plot()
However, the fact that the exponentiated coefficients are measured on a multiplicative scale poses some difficulties and it makes it easy to fall into the trap of interpreting these odds as probabilities.
The issue is that odds can range from 0 to infinity, so the scale is far from standard. Probabilities, on the other hand, have the nice feature of being constrained between 0 and 1, so a percentage change has much more meaning.
There are several packages that help us compute various comparative statistics on the probability scale from logistic models.
Below, we will use the avg_slopes()
function from the {marginaleffects}
package to compute the average difference in the probability of being trusting based on a one-unit difference in education level:
avg_slopes(model_1)
The result indicates the change in the predicted probability that the outcome (trusting
) equals 1. On average, a unit-change in educ_num
changes the predicted probability that the outcome equals 1 by 1.8%.
This “average marginal” effect represents the average slope of the predictor, and as such it can be considered as an “adjusted regression coefficient”. However, because the effect is no longer fixed across the values of the predictor variable, it is still less intuitive to talk about an “average” effect in the case of logistic regression.
To make these numerical results easier to visualise, we can plot the result.There are several options, but probably the most straightforward approach is to estimate the probabilities of being trusting
at each level of the predictor variable (educ_num
), then plot the results. The plot_predictions()
function from the {marginaleffects}
package or the plot()
function of {ggeffects}
can be used here. Below, we’ll use the latter:
ggpredict(model_1, "educ_num") |>
plot()
This code first calculates predictions of the outcome probability for each level of the explanatory variable using the ggpredict()
function (we could save these results as a data-frame), then passes those results on to the plot()
function. To remove the overlapping value labels on the horizontal axis, we could add something like the following to the command:
ggpredict(model_1, "educ_num") |>
plot() +
scale_x_continuous()
This graph makes it a lot clearer how education is associated with trust in our model, and if education is our main variable of interest, we can plot its effect even from a larger model that contains other covariates.
Expanding the logistic regression model
This step does not contain anything new compared to what we already know from multiple linear regression and what we have learnt in the previous step about the logistic regression model.
We can add the other relevant variables using the +
sign, and we can change the variable measuring “education” for the categorical version. We also specify risktaker
and income
as numeric variables:
<- glm(trusting ~ as.numeric(risktaker) + educ_cat_1 + sex + age + as.numeric(income) + marstat + employment, family = binomial, data = wvs56) model_2
Tabulate the model parameters; to make the calculation of the confidence intervals faster, we can add ci_method="wald"
to the command:
model_parameters(model_2, exponentiate = TRUE, ci_method = "wald")
Plot the model parameters on the log-odds scale:
model_parameters(model_2, ci_method = "wald") |> plot()
The plotted results are much more informative in the case of the multiple regression model. We find that higher education, being a female, higher age and higher income brackets all have a positive association with trust, while being other than married and in non-full-time employment show a negative association with trust.
We can plot the exponentiated coefficients to interpret odds instead:
model_parameters(model_2, exponentiate = TRUE, ci_method = "wald") |> plot()
Plot predicted probabilities of trusting based on education, adjusted for the effect of the other covariates in the model:
ggpredict(model_2, c("educ_cat_1")) |>
plot() +
scale_x_continuous()
Given the multiple explanatory variables included in the model, we can further break down the effects. For example, we can plot the effect of education on trust by sex within each marital status.
ggpredict(model_2, c("educ_cat_1", "sex", "marstat")) |>
plot() +
scale_x_continuous()
It looks like the effect of education on trust is stronger among those married, or single or cohabitating than among the separated or widowed, for both men and women.
This is a rather complex finding, and we can explore this model in much more depth, as well as thinking about developing it or specifying it further.