Week 5 Worksheet

Learning outcomes

By the end of the session, you should be familiar with:

running logistic regression in JASP
the interpretation of logistic regression coefficients
sub-setting/filtering data in JASP
recoding/dichotomising a variable in JASP

Intro

In this worksheet we will practice another type of regression analysis, which allows us to model outcome/dependent/response/\(y\) variables that are not measured on a continuous numeric scale, but instead on a categorical measurement scale that allows only two values (e.g. “Yes”/“No”; “Agree”/“Disagree”; “Above average”/“Average and below”; “Trust”/“Doesn’t trust”; etc.). Such a variable is called binary, binomial (because it can take only two values) or dichotomous (because the two values present a “dichotomy”). Modelling such a variable using the linear regression approach from Week 4 would give inaccurate results, but we can generalise the logic of linear regression to make it applicable to dichotomous dependent variables. In fact, the method we learn about today - (binary/binomial) logistic regression - is the most foundational case of a broader category of statistical models called generalised linear models (GLMs). GLMs can be thought of as a two-stage modelling approach, in which we first model the response variable using a probability distribution - such as the binomial distribution in our case -, and then we model the parameter of the distribution using a collection of predictors and a special form of multiple regression. This all sounds very technical and it does involve some complex mathematics, but applying logistic regression in practice will feel very similar to what we have done with linear regression before. The challenge will be in finding a way to interpret the statistical results accurately and meaningfully given that the resulting regression coefficients refer to estimates on the mathematically transformed scale of the parameter of the distribution, rather than the original scale of measurement of the variables ass in the case of linear regression. The Advanced topics readings assigned for this week outline these challenges for those who would like to gain a deeper understanding of the mathematics and mechanics of logistic regression; for the rest of us, Connelly et al. (2016) (on the reading list) provides a very approachable introduction to the major challenges, specifically for sociological analyses (see, specifically, the sections between Parameter estimates in logistic regression models and The presentation of logistic regression results).

In essence, with binary logistic regression we will attempt to predict the probability that an observation falls into one of the two categories of a dichotomous dependent (outcome) variable based on one or more independent (predictor, explanatory) variables. Observations are predicted to fall into whichever one of the two outcome categories is most probable for them given the values they have on the independent variable(s). Logistic modelling is thus often considered a classification method rather than a regression method, especially in recent machine learning parlance.

Solutions

A JASP file with the solutions to the exercises can be downloaded from HERE.

Exercise 5.1: Predict whether social trust at country level is above average

This exercise is primarily designed to introduce some new data management skills that may be useful in other contexts, and to connect with the data and analysis performed in Week 4. In Week 4 Worksheet - Exercise 3 we fit a multiple linear regression model of “social trust” measured as a percentage of a country’s population who answered that “people in general can be trusted” (as opposed to “one cannot be too careful”) on the standard binary “social trust” survey question as asked in the World Values Survey (trust_pct). We regressed that outcome on two variables: inequality_s80s20 and Region.

What if our research question would ask not about the distribution of trust on a percentage scale, but about the factors associated with whether the level of social trust in a country is above average?In this case, our “quantity of interest” is different and requires us to use a different dependent/outcome variable: one that dichotomises the proportion of trust_pct into two outcome categories: “below average” and “above average”. We can easily transform our existing measurement of social trust to map it onto such a binary dichotomy because the trust_pct measurement contains more detailed information. For this reason, in actual research it is not good practice to transform variables in such a way as to lose information; instead, we would strive to find a modelling method that applies to the format of the variable that encompasses the most available information. For our didactic purposes, however, we will dichotomise the trust_pct variable to practice this useful data-transformation technique.

Let’s load the Trust & Inequality (trust_inequality.dta) dataset into JASP (which can be downloaded from https://cgmoreh.github.io/SOC2069-QUANT/Data/ if not yet downloaded to a local/OneDrive folder).

Task 5.1.1: Dichotomise the `trust_pct` variable

In JASP we can create new variables by switching to the Edit Data menu tab and clicking on the green plus sign above the first empty column at the end of our dataset (there are other ways too, but this is one):

When clicked, a set of fields pop up where we can write in the name we want to give to the new variable (this is necessary) as well as other details. We will create a variable called trust_d to refer to “dyihotomised trust scale”. Once we enter the desired variable name into the Name field, we click enter and an empty variable is created. We can also add a label (Long name) and description, and for this exercise, under Computed type we will select to compute the values of the new variable “with R code”. This option is more flexible and allows us here to “cut” the trust_pct scale into two around the mean/average value.

For this, we need to know the mean value of trust_pct in the dataset. We can request a table of Descriptive Statistics for his, as we have done many times before, and find out that the mean value for trust_pct is 25.731. Returning to the dataset window, we enter the following R code into the Compute column definition field:

cut(trust_pct, breaks = c(0, 25.731, 100))

Clicking the Compute column bar at the bottom of the field, the values are computed, and the window should look something like this:

The only thing we need to know about the R code we just entered is that “cut” the trust_pct variable into sections around three break-points: the value 0, the mean value (25.731) and the value 100. Because 0 and 100 represent the absolute theoretical extreme points of the trust_pct scale (because it’s a percentage scale), the only actual “cut” in the dataset is made around the mean value. In R, the c(...) function simply means that we are “combining” the values enumerated into a list of numbers; we need to use it if we want to specify more than one value. If we only specify a single “break” value in the cut(...) function, then the trust_pct scale will be cut into that many sections - and that’s not what we want here.

We still need to do some changes to this newly minted variable. We can click on the Label editor tab, and there we will see the current values assigned automatically based on our “cutting” procedure: the first category are trust_pct values between 0 and 25.7 (inclusive), and the second category codes trust_pct values between 25.7 and 100. We want some easier values and labels to use, so we’ll recode the first value as 0 and give it the label “Low trust”, and recode the second category as 1 and label it as “High trust”. Our window should now look something like this:

We can return to the Analyses menu tab now.

Task 5.1.2. A scatter plot of `trust_d` by `inequality_s80s20`

We can check how the newly created variable is distributed by the values of our “inequality” measure by creating a scatter plot of their co-variation/co-relation. We have done scatter plots many times now. The result - simplified a bit - would look like this:

As expected, the trust_d variable only takes values of 0 and 1. This variable will be our new dependent variable that we want to model as a function of “inequality” and “Region”, as we have done in Week 4 Worksheet - Exercise 3 with the trust_pct variable.

Task 5.1.3. Linear regression with dichotomous dependent variable

As a quick comparison with Week 4 Worksheet - Exercise 3, we can try to fit a linear regression using the new trust_d variable as dependent variable and inequality_s80s20 and Region as the independent/predictor variables. Set it up in JASP, and the resulting output tables should look like this:

Questions

Using the Answers to the Week 4 exercise sheet (you can find the sheet updated with answers on the website!), interpret the coefficients in the Coefficients table. Remember that the interpretation is exactly the same as in that previous exercise, because we have used a linear regression model. However, the scale of the dependent variable has changed, so that will effect the meaning of a one-unit difference in the dependent variable

Task 5.1.4. Binary logistic regression in JASP

The main shortcoming of the linear model we fit in the previous task is, of course, that because our dependent variable only has two values, the linear model does not make very accurate predictions of the expected values of trust_d, which it assumes to be a continuous numeric variable. It makes predictions of values in between 0 and 1, and most unfortunately, of values below 0 and above 1, which cannot empirically exist. It is also likely to violate one of the core assumptions of liner regression, that the “errors” (i.e. the values of the dependent variables conditional on the predictor variables; e.g. the differences between the dots and the regression line on the scatter plot in the case of only one predictor) are “normally distributed”.

To account for these issues, we should choose a modelling method that fits an s-shaped curve instead of a straight line. Doing so, we force the model to restrict its estimates to between 0 and 1. The logistic regression model is the tool for this, and fitting it in JASP is straightforward, even through the interpretation of the coefficients not so much.

To run a binary logistic regression, navigate to through the following menu tab options:

\[ \text{Regression} \longrightarrow \text{[Classical] Logistic regression} \]

Inside the Logistic regression settings, we select our dependent variable (which this time must have only two distinct values/categories!), our continuous numeric Covariate(s) and any categorically coded Factor(s). Once we do this, we get the results in the outputs window to the right. before we look at the results, let’s also tick the box for “Odds ratios” under Statistics > Coefficients in the Logistic regression settings:

The resulting output should be:

As with multiple linear regression, our main interest lies in understanding the Coefficients table. Here, the coefficients for all the Regions compared to “Europe and Central Asia” (see the worked solution on Week 4 Worksheet - Exercise 3 for how to interpret the coefficients on the categorical predictor) are rather extreme; but they also seem to be rather unreliable by looking at the p column (which we will look at more closely and interpret in Week 6, so we’ll leave it for now). So let’s focus the interpretation on the coefficient for inequality_s80s20.

Questions

What is the effect of a one-unit positive change on the inequality_s80s20 scale on the expected value of trust_d ?
What is the effect of a one-unit positive change on the inequality_s80s20 scale on the odds of being a “high trust” country?

Solutions

A JASP file with the solutions to the exercises can be downloaded from HERE.

As we know from the lecture and the statistics readings for this week, interpreting these numerical outputs form logistic regression models is very difficult, because the scale of our dependent variable (trust_d) has been log-transformed in the process of modelling. That means that the coefficients in the Estimate column can be interpreted along the same logic as with linear regression, but the expected difference/change in the dependent/outcome variable is measured on the logit (or log-odds) scale: the natural logarithm of the odds of being in the outcome category coded as 1 (i.e. “high trust”). So a one-unit difference in inequality_s80s20 is associated with a 4.530-unit increase in these log-odds values. That’s not very intuitive, and to make it easier to interpret, researchers often transform this log-odds scale to an Odds Ratio, which is perhaps more readily interpretable as a percentage change in the odds of being in the outcome category coded as 1 as opposed to being in the outcome category coded as 0.

The Odds Ratio may sound more intuitive, but it’s equally difficult to grasp. A better solution would be to transform the dependent variable scale further to plain “probabilities”, which we are intuitively more familiar with because its values are at equal distances and run between 0 and 1 (just like a proportion, and similar to a percentage scale).

In JASP we can request visual outputs for the effects of the independent variables on the probability of the outcome by ticking the Conditional estimates plots option under Inferential plots under Plots:

The resulting plots are visually more telling than the numbers in the regression output tables:

We find, for instance, that around the value of 7 on inequality_s80s20, the probability of being in the “High trust” category drops to under 0.25 (25%)

Exercise 5.2 (take-home): Which of the assignment research questions could be addressed using a logistic regression model?

Let’s look again at the assignment research questions. Some of these questions imply a dependent variable which is measured as a numeric scale or at least a long-ish (e.g. 7-point +) ordinal scale in one of the surveys we will use for the assignment (ESS10, WVS7, EVS2017). Other questions imply dependent variables that are more strictly categorical, and as such, we cannot model them using linear regression. For those, we may be able to apply a logistic regression model.

It may be the case that the categorical variable of choice has more than two categories. While there are specific modelling methods for those types of variables, we are not covering them on this module. Instead, in Week 6 we will practice some further data transformation techniques to easily dichotomise such multi-categorical/multinomial variables.

In this exercise, explore the assignment dataset using the Variable search function on the website to identify any available variables for answering one/some of the questions below, and check how the implied dependent variable was measured:

Are religious people more satisfied with life?
Are older people more likely to see the death penalty as justifiable?
What factors are associated with opinions about future European Union enlargement among Europeans?
Is higher internet use associated with stronger anti-immigrant sentiments?
How does victimisation relate to trust in the police?
What factors are associated with belief in life after death?
Are government/public sector employees more inclined to perceive higher levels of corruption than those working in the private sector?

References

Connelly R, Gayle V and Lambert PS (2016) Modelling key variables in social science research: Introduction to the special section. Methodological Innovations 9: 2059799116637782.

Week 5 Worksheet

Learning outcomes

Intro

Exercise 5.1: Predict whether social trust at country level is above average

Task 5.1.1: Dichotomise the trust_pct variable

Task 5.1.2. A scatter plot of trust_d by inequality_s80s20

Task 5.1.3. Linear regression with dichotomous dependent variable

Task 5.1.4. Binary logistic regression in JASP

Exercise 5.2 (take-home): Which of the assignment research questions could be addressed using a logistic regression model?

References

Task 5.1.1: Dichotomise the `trust_pct` variable

Task 5.1.2. A scatter plot of `trust_d` by `inequality_s80s20`