Week 4 Worksheet

Learning outcomes

By the end of the session, you should be familiar with:

running simple and multiple linear regression in JASP
performing a correlation analysis in JASP
model building in JASP
the interpretation of linear regression coefficients

Intro

We continue where we left off last week, taking further Week 3 Worksheet - Exercise 2 in which we made a scatter plot of inequality by social trust using the Trust & Inequality (trust_inequality.dta) dataset, which can be downloaded from https://cgmoreh.github.io/SOC2069-QUANT/Data/.

In that exercise we simplified the default output by removing the univariate distributions of the variables displayed on the margins and the regression line cutting through the plot. Now, however, we will focus on understanding what that “regression line” is actually telling us.

Solutions

A JASP file with the solutions to the exercises can be downloaded from HERE.

Exercise 1: From a regression line to regression coefficients

If you haven’t yet downloaded it last week, download the Trust & Inequality (trust_inequality.dta) dataset from https://cgmoreh.github.io/SOC2069-QUANT/Data/

Task 1.1: Visualise the relationship

As a first step, create a scatter plot visualising the “relationship” (co-variation, joint distribution, …) between social trust (trust_pct) and inequality (inequality_s80s20). This is Exercise 2 from Week 3 - if you need a reminder of how to do it, check Week 3 Worksheet - Exercise 2 or your saved .jasp file containing your workshop analysis from Week 3.

Task 1.2: Model the relationship

Now let’s dig deeper into the meaning of the regression line by building a simple bivariate linear regression model of social trust as a function of societal inequality (i.e. a model aiming to explain/predict values of social trust in various countries depending on the value of societal inequality in those countries).

To build a linear regression model in JASP, click through the Menu tabs:

\[ \text{Regression} \longrightarrow \text{[Classical] Linear regression} \] In the Linear regression panel, move the “social trust” variable to the \(\text{Dependent Variable}\) box and the “inequality” variable to the \(\text{Covariates}\) box.

The results from the linear regression model will appear in the outputs window on the right.

Solution

The output should look something like this:

Task 1.3: Interpret the regression model output

In general terms, the coefficient of interest (the one associated with the independent variable) tells us: that a one-unit difference/change on the independent variable scale is associated with a difference/change in the dependent variable of the size shown by the value of the coefficient.

But what does this mean substantively in the context of our two variables?

Questions

Using the lecture slides and Chapter 7 (“Linear regression with a single predictor”) from the Introduction to Modern Statistics (IMS), interpret the meaning of the regression coefficient on “inequality”.
Add a note on the JASP output under the \(\text{Coefficients}\) output and write down your interpretation there. [Tip: You’ve already practiced adding notes to the outputs in Week 2, Exercise 3, Point 7]
Where can you find the coefficient of correlation (\(R\)) in the outputs? What about the coefficient of determination (\(R^2\))?

Solution

The tables in the outputs list results from two “models”. One is a model without any explanatory (independent/predictor) variables (M₀), in which case the (Intercept) value is nothing else then the average (mean) value of the outcome (dependent/explained) variable calculated on the sample of cases/observations (i.e. countries here) that were included in the model (i.e. only those who had a valid measurement on both of the variables, so a non-missing value for both the “trust” and the “inequity” variable). This “null model” is given to us by JASP by default, without us asking for it. The model we actually fit is presented as M₁, and this is what we are interested in. As we see in the Coefficients table, the coefficient associated with the (Intercept) value in M₁ is no longer the mean/average level of trust_pct, but instead 45.384. This value represents the expected score on the dependent variable (trust_pct) when the value of the independent variable (inequality_s80s20) is 0. As it will become apparent from the discussion below, a value of 0 for the inequality_s80s20 variable is highly unrealistic, and this is often the case. However, that’s not a problem, because the (Intercept) value is meant to provide a mathematical baseline to the model and we generally are not interested in interpreting it., so we won’t. What we are interested in is the coefficient on our independent (explanatory/predictor) variable.

What is the meaning of the regression coefficient on “inequality”?

In general terms, the coefficient of interest (the one associated with the independent variable) tells us: that a one-unit difference/change on the independent variable scale is associated with a difference/change in the dependent variable of the size shown by the value of the coefficient.

But what does this mean substantively in the context of our two variables?

From the Coefficients table we see that the regression coefficient associated with our predictor/independent variable is -3.114. This tells us that a one-unit increase on the “inequality” scale is associated with a -3.114-unit change on the “trust” scale. To give this a more meaningful interpretation, we need to know what exactly a unit change means on each scale, and for that we need to know what our variables measure.

Understanding the “trust” measurement (the trust_pct variable) is more straightforward. We know that it measures the percentage of respondents in each country who answered that “in general, most people can be trusted” to the standard social trust survey question in the World Values Survey. A unit on that scale is therefore one percentage-point.

Understanding the “inequality” scale is much trickier, because that scale measures a ratio. Specifically, it denotes the so-called income quintile share ratio, a commonly employed measurement of inequality in the income distribution in a country. It is calculated as the ratio of total income received by the 20% of the population with the highest income (the top quintile) to that received by the 20% of the population with the lowest income (the bottom quintile) [By the way, a “quintile” (from the Latin word “quintus” for “five”) is simply a value that divides a scale into five equal parts - e.g. a scale of all the integers between 0 and 100 divided into five equal parts will contain 20 integers in each part, i.e. 100÷5]. In our dataset, the variable measuring it - inequality_s80s20 - was calculated by taking the ratio of (i.e. dividing) the variables income_top20 and income_bottom20. For example, looking at the first entry, for Albania, we see that income_top20 = 39.6375 and income_bottom20 = 7.9125, so inequality_s80s20 will be 39.6÷7.9 = 5.009478673 (or 5.01 rounded up to two decimal points).

That’s fairly easy to get our heads around, but because its unit of measurement is a ratio it also makes it more difficult to compare one value to another. The best way of approaching it is through some logical examples. Say, for instance, that we want to posit the existence of a perfectly equal society (in terms of income distribution, at least). What would that entail? It would entail that the top-20% earners and the bottom-20% earners own the same share of the total available income. For example, if the top-20% possess 20% of the income distribution and the bottom-20% also possess 20% of the income distribution (with the remaining 60% of income possessed by the middle-60% of earners), then the inequality ratio would be 20÷20=1. The ratio is the same if the “middle-class” is very thin and both the top- and the bottom-20% of earners possess, say, 42% of the income distribution (84% in total): 42÷42=1. To generalise, when two values are equal, their ratio is 1. But what happens if we hypothesise a very unequal society, say one in which the top-20% of the earners share 90% of all income distributed within an economy, while the bottom-20% share on only 0.5% of the income distribution? In such a scenario, the inequality ratio would be 90÷0.5=180; and, in theory, the value of this ratio can be limitlessly high (e.g. 99.999÷0.00005=1,999,980; yes, that is indeed almost 2 million!). The larger the gap, the more extreme the ratio value, and the change between values is not “linear”.

However, the above scenarios are empirically unrealistic. In fact, if we look at our actual data, the lowest share of the income scale possessed by the top-20% in any country is 34.56 (Slovakia), while the highest share is 57.3 (Brazil). For income possessed by the bottom-20%, the lowest percentage is 3.45 (Brazil) and the highest value is 10.08 (Ukraine). In terms of inequality_s80s20, we then see the lowest value at 3.52 (Ukraine) and the highest at 16.64 (Brazil). On this more restricted empirical scale (running from 3.52 to 16.64), the difference between the values is more equal, making the changes from one value to another approximately linear.

So, then, what does a one-unit change on this “empirical” scale mean? It means, for example, the difference between an inequality score of 30÷10=3 and one of 24÷6=4; or, approximately, between 40÷7.9=5.06 (Lebanon) and 42.83÷7.08=6.05 (Nigeria). It is, in other words, a rather large unit; the entire scale (running from 3.52 to 16.64) contains just over 13 such units (16.64 - 3.52 = 13.12). By contrast, a percentage scale contains 100 units of 1%-point each. Due to such differences in the scales that are being compared, it is very common to “standardise” the variables used in a regression so that each unit involved is “equal”. On the other hand, the difficulty is then shifted onto interpreting those standardized coefficients on the original meaningful scale of the variables involved. In other words, it’s a challenge either way. In this exercise, we have left the two variables involved in the regression unstandardised, so a “unit” has a different meaning in the case of each variable.

Now, having thought deeply about what the variables involved actually measure, we can interpret the regression coefficient, which is telling us that our simple model predicts that a country with an inequality score 1 points higher than that of another country will have, on average, a trust score 3.114% lower than that other country. If we look, for example, at the trust scores of Lebanon and Nigeria, we can get a sense of how well our model does at predicting this specific case: 12.68% - 9.92% = 2.76%. It’s not a perfect prediction, but it’s a much better one than we could have made had we not had any information on “inequality”. Ultimately, our model helps us make a better informed assessment of the variability in trust across countries, even if it will either overestimate or underestimate any given actual value.

Where can you find the coefficient of correlation (\(R\)) in the outputs?

The idea behind the coefficient of correlation (\(R\)) relates somewhat to the discussion above regarding “standardisation”. In fact, the correlation coefficient is calculated by standardising the scales of the two variables involved (trust_pct and inequality_s80s20) and expressing the coefficient on that standardised scale. In our JASP output, we can find this information under the Standardized column in the Coefficients table, and as an absolute value (i.e. non-negative) is shown under the R column in the Model Summary table.

But what exactly does “standardisation” mean?

One of the most commonly employed “standard scores” is the z-score, which tells us how many “standard deviations” (SD) each score/value is from the mean/average score/value on a given measurement scale. The \(z\)-score for a value \(x\) is calculated as \(z_x={(x-\mu_X)\over\sigma_X}\) , where \(x\) is any single value on a scale \(X\), \(\mu_X\) is the mean/average of \(X\) and \(\sigma_X\) is the standard deviation of \(X\). In JASP, we can use the zScores() drag-and-drop function to calculate this. If we were to create two new variables that are the z-standardised versions of the trust_pct and inequality_s80s20 variables, respectively, and we ran the same regression as above on those variables, then the obtained regression coefficient on “standardised inequality” would have coincided with the correlation coefficient and it would have been -0.430.¹ On these newly standardized scales, a value of 1 means “one standard deviation above the mean”, a value of 2 means “two standard deviations above the mean”, a value of -1 means “one standard deviation below the mean”, and so on. This means that the two standardised variables are measured on a similar scale and their values are therefore directly comparable, which was not the case for the original unstandardised values.

A coefficient of -0.43 associated with “standardised inequality” would therefore be interpreted as: a one-standard-deviation positive change in “standardised inequality” is associated with a below-half (around 40.3%) of a standard deviation negative change in “standardised trust”. In other words, it takes a difference of over two standard deviations (2.33) in inequality to observe a one-standard-deviation reduction in the level of trust.

However, the purpose of the “correlation coefficient” is to provide a more general description of the association between two numeric variables. Its value ranges between -1 and 1, where 1 represents a perfect positive correlation between the two variables and -1 a perfect negative correlation, while 0 shows no correlation between the two at all. Visualising some of these extreme and intermediate correlations on scatter plots would look something like this:

Task 1.4: Find the correlation coefficient using a “correlation” test instead

To run a simple bivariate correlation analysis in JASP, go through the Menu tabs:

\[ \text{Regression} \longrightarrow \text{[Classical] Correlation} \] Move both of the variables of interest to the \(\text{Variables}\) box.

Check if the results are the same as those obtained using linear regression

Solution

The correlation output should look something like this:

This output was simplified by ticking the “Display pairwise” box in the Correlation options and un-ticking the “Report significance” option:

Exercise 2: Linear regression with categorical predictors

Now we will build another simple bivariate regression model, but this time we will use the variable Region to model/explain/predict levels of “social trust” in different countries. Region is the only Nominal categorical variable in this dataset, and categorical variables behave differently in regression models.

Task 2.1: Describe the `Region` variable using a Frequency table

Tip: You have done this a few times in previous workshops. Check back on previous exercises if you need to remind yourself of how to create a frequency table.

Solution

The frequency table is:

Task 2.2: Build a simple bivariate regression model

The steps for fitting the regression, however, are very similar to what we have done in the previous exercise:

Click through the Menu tabs:

\[ \text{Regression} \longrightarrow \text{[Classical] Linear regression} \]

In the Linear regression panel, move the “social trust” variable to the \(\text{Dependent Variable}\) box
BUT THIS TIME, we will move the Region variable to the \(\text{Factors}\) box instead.

This will tell JASP that the Region variable is categorical and it should model it as such, treating each of its constituent categories as an individual factor/indicator variable, automatically leaving out the first category (Task 2.1 above will tell you which one that is!) from the model so that the left out category becomes the baseline/reference to which the coefficients on all the other categories compare. What happens here is that the left-out category is absorbed into the “Intercept” (the unknown/unmeasured variation in the dependent variable).

The results from the linear regression model will appear in the outputs window on the right.

Solution

The modelling options chosen are:

And the outputs from the model are:

Task 2.3: interpret the regression model

In the case of numeric Scale-type predictor/independent variables the interpretation of the coefficient (“unstandardized”) was that a one-unit difference/change on the independent variable scale is associated with a difference/change in the dependent variable of the size shown by the value of the coefficient. When the predictor/independent variable is categorical, the interpretation changes somewhat. The coefficient associated with the listed category/level of the independent variable compares that category with the reference/baseline category. In other words, the unit of difference in this case is the difference between the stated and the reference category: being in the listed category as opposed to being in the reference category is associated with a difference/change in the dependent variable of the size shown by the value of the coefficient.

But what does this mean substantively in the context of our two variables?

Questions

Using the lecture slides and the assigned readings from Introduction to Modern Statistics (IMS), interpret the meaning of the regression coefficients on each reported level of the Region variable;
Which one is the “reference”/“baseline” category?
Add a note on the JASP output under the \(\text{Coefficients}\) output and write down your interpretation there. [Tip: You’ve already practiced adding notes to the outputs in Week 2, Exercise 3, Point 7]
Where can you find the coefficient of correlation (\(R\)) in the outputs? What about the coefficient of determination (\(R^2\))? Are they meaningful in this context? Why so, or why not?

Solution

In order to interpret the regression results, we need to understand how our variables are measured and coded. We already know everything we have to about the trust_pct variable from Exercise 1. To understand the Region variable, we should inspect the frequency table produced in Task 2.1 above. We see there that the variable has 7 non-missing (valid) categories, and if we look at the regression output, we see that only 6 are listed in the Coefficients table: the first category (“Europe & Central Asia”) is missing. Well, it’s not actually missing, but it was “absorbed” into the (Intercept). What this means is that the meaning of the (Intercept) in this case is: the average value of trust_pct when the value of Region is “Europe & Central Asia”; or, put differently, the (Intercept) denotes the average trust_pct in the European and Central Asian countries present in the dataset.

What happens when we enter a categorical variable as a predictor (independent variable) into a linear regression model in JASP (and most other statistical computing packages) is that the regression function automatically breaks the categorical variable into several “dummy” or “indicator” variables, one for each constituent category, that indicate membership (or not) in that category, and the first category is left out of the model so that all other categories can compare to this left-out “reference” or “base-line” category. Put differently, we have not one but six different predictors/independent variables in this model, each taking the value of 1 if the observation falls into that category and the value of 0 if it does not. For example, the observation “Australia” has a value of 0 on the “Region (Latin America & Caribbean)” variable, but a value of 1 on the “Region (East Asia & Pacific)” variable, and 0 on all other listed variables. The United Kingdom has a value of 0 on all six variables listed in the Coefficients table, because it has a value of 1 on the “Region (Europe & Central Asia)” variable, which was left out from among the predictors and to which all other variables compare.

The regression coefficients are therefore interpreted as comparisons to “Europe & Central Asia”. For example, the coefficient of -20.11 on “Latin America & Caribbean” means that the expected value of trust_pct of a country in Latin America or the Caribbean is 20.11% lower, on average, than the expected trust_pct value of a country in Europe or Central Asia. The only regions where we expect to see higher average social trust than in “Europe & Central Asia” are “East Asia & Pacific” (2.1% higher) and “North America” (13,6% higher). In other regions, the expected average level of social trust is lower than in Europe & Central Asia by the displayed coefficient.

Where can you find the coefficient of correlation (\(R\))? What about the coefficient of determination (\(R^2\))? Are they meaningful in this context?

As we can see, the correlation coefficient no longer appears in the “Standardized” coefficients column in the Coefficients table. The reason is explained in the table. The correlation coefficient also loses its meaning in the context of a multiple regression model in which we have more than one predictor/independent/explanatory variable. It’s meaning becomes “partial” to all the other variables in the model. In this case, its squared value, the R² is more meaningful: it captures, roughly, the amount (percentage) of the variance in the dependent variable that is explained by the whole model, with all the predictors included. Here, we could say that Region is able to explain 27.1% of the variation in the “social trust” across countries (R²=0.271; to express that as a percentage, we multiply it by 100; i.e. we move the decimal point two places to the right).

Exercise 3: Build a multiple regression model

We can now combine the separate bivariate analyses in the previous two exercises into a more elaborate multiple regression model. The procedure to build a multiple regression model is the same as in the simple regression models before, but this time we add both of the independent variables into the model:

\[ \text{MENU TABS: } \text{Regression} \longrightarrow \text{[Classical] Linear regression} \]

In the Linear regression panel, move the “social trust” variable to the \(\text{Dependent Variable}\) box
Move the “inequality” variable to the \(\text{Covariates}\) box
Move the Region variable to the \(\text{Factors}\) box

The results will appear in the outputs window on the right. We now have a statistical model which explains variation in “social trust” not only dependent on “inequality”, but also on “Region”. Put differently - if our main aim is to estimate how “inequality” is associated with “social trust” - we have obtained a more accurate estimate of the association between “inequality” and “social trust”, while also accounting for variation due to differences in the Region to which countries belong.

Another way in which this is often expressed is that the stated coefficients are those obtained after we keep constant or eliminate the effect of the other variables in the model. This procedure is expected to give us more accurate estimates because by including further variables into the model, we have removed them from the pool of the “unknown” factors affecting/related to out outcome measurement of interest.

Questions

Using the lecture slides and the assigned readings from Introduction to Modern Statistics (IMS), interpret the meaning of each regression coefficient, comparing them with the ones obtained from the simpler models in the previous exercises;
Add a note on the JASP output under the \(\text{Coefficients}\) output and write down your interpretations there.
Where can you find the coefficient of correlation (\(R\)) in the outputs? What about the coefficient of determination (\(R^2\))? Are they meaningful in this context? Why so, or why not?

Solution

The regression model options chosen are:

And the resulting output is:

The interpretation of the coefficients here is a combination of the interpretations given in Exercise 1 and Exercise 2. The only difference is that each coefficient tells us the expected difference in the value of trust_pct associated with the given predictor while the values of all the other predictors in the model are kept constant (i.e. their effect is nullified, taken out of the equation). This means that the coefficient of -3.63 for inequality_s80s20 is a more precise estimate of the impact of “inequality” on “social trust” that the coefficient of -3.11 obtained in the simplem regression model in Exercise 1, because this coefficient now also accounts for the expected effect of a country belonging to a given world Region. Vice-versa, compared to the coefficients we saw in Exercise 2, the coefficients for Region seen here are a more precise estimation of the expected effect of belonging to a Region because that effect now also accounts for variation in inequality_s80s20 between the various regions. In fact, we do see some markedly different Region coefficients than in the simple model, now that we have “controlled” for the effect of “inequality” in the model as well.

Looking at the R² value we also find that this has increased to 0.346, indicating that the combined knowledge of the Region as well as the inequality_s80s20 score of a country can help explain 34.6% of the variation in trust_pct. The remaining 65%+ of the variation is associated with other factors that our model is not considering (maybe “crime rates”, or “recent history of civil war”, or “ethno-racial diversity/fragmentation”? If we have variables measuring these phenomena, then we could consider including them in our model).

Exercise 4 (take-home): Which of the assignment research questions could be addressed using a linear regression model?

Let’s look again at the assignment research questions. Some of these questions imply a dependent variable which is measured as a numeric scale or at least a long-ish (e.g. 7-point +) ordinal scale in one of the surveys we will use for the assignment (ESS10, WVS7, EVS2017). Other questions imply dependent variables that are more strictly categorical, and as such, we cannot model them using linear regression. For those, we may be able to apply another model type that better fits that kind of outcome variable (e.g. logistic regression), one of which we will be covering in Week 5.

In this exercise, explore the survey questionnaires (like we did in Week 2 and 3) to identify any available variables for answering one/some of the questions below, and check how the implied dependent variable was measured:

Are religious people more satisfied with life?
Are older people more likely to see the death penalty as justifiable?
What factors are associated with opinions about future European Union enlargement among Europeans?
Is higher internet use associated with stronger anti-immigrant sentiments?
How does victimisation relate to trust in the police?
What factors are associated with belief in life after death?
Are government/public sector employees more inclined to perceive higher levels of corruption than those working in the private sector?

The best way to explore the available Assignment datasets and questionnaires is via the Data page of this site.

Footnotes

I demonstrate what “standardization” using z-scores means in the JASP file containing the solution analysis, available here.↩︎

Week 4 Worksheet

Learning outcomes

Intro

Exercise 1: From a regression line to regression coefficients

Task 1.1: Visualise the relationship

Task 1.2: Model the relationship

Task 1.3: Interpret the regression model output

Task 1.4: Find the correlation coefficient using a “correlation” test instead

Exercise 2: Linear regression with categorical predictors

Task 2.1: Describe the Region variable using a Frequency table

Task 2.2: Build a simple bivariate regression model

Task 2.3: interpret the regression model

Exercise 3: Build a multiple regression model

Exercise 4 (take-home): Which of the assignment research questions could be addressed using a linear regression model?

Footnotes

Task 2.1: Describe the `Region` variable using a Frequency table