Week 8 Application Lab

Designs: Simulation-based power analysis for study design

Aims

By the end of the session you will learn how to perform:

Simulation of random variables
Power analysis via simulation

Setup

In Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder.

Create a new Quarto markdown file (.qmd) for this session (e.g. “Lab_8.qmd”) and work in it to complete the exercises and report on your final analysis.

Packages

## Install first if not yet installed

library("tidyverse")
library("sjlabelled")
library("easystats")
library("ggformula")
library("ggeffects")
library("marginaleffects")
library("modelsummary")
library("gtsummary")
library("modelr")
library("MASS")

Statistical power

Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve “statistical significance” at some predetermined level (typically a p-value below 0.05), given some assumed true effect size
If a certain effect of interest exists (e.g. a difference between two groups) power is the chance that we actually find the effect in a given study
A power analysis is performed by first hypothesizing an effect size, then making some assumptions about the variation in the data and the sample size of the study to be conducted, and finally using probability calculations to determine the chance of the p-value being below the threshold
The conventional view is that you should avoid low-power studies because they are unlikely to succeed
There are several problems with this view, but it’s often required by research funding bodies

Example 1: Simulating a simple randomized experiment

This example comes from Gelman, Hill, and Vehtari (2020) [ROS, Ch. 16]. They demonstrate data simulation for power analysis with an artificial example of a randomized experiment on 100 students designed to test an intervention for improving final exam scores.

Data

set.seed(862)

sim_1 <- function(n = 100) {
  y_if_control <- rnorm(n, mean = 60, sd = 20)
  y_if_treated <- y_if_control + 5
  tibble(
    z = rep(0:1, n / 2) |> sample(),
    y = if_else(z == 1, y_if_treated, y_if_control)
  )
}

data_1a1 <- sim_1()

Having simulated the data, we can now compare treated to control outcomes and compute the standard error for the difference.

data_1a1 |> 
  summarize(
    diff = mean(y[z == 1]) - mean(y[z == 0]),
    diff_se = 
      sqrt(sd(y[z == 0])^2 / sum(z == 0) + sd(y[z == 1])^2 / sum(z == 1))
  )

# A tibble: 1 × 2
   diff diff_se
  <dbl>   <dbl>
1  4.66    3.80

Equivalently, we can run the regression:

set.seed(619)

fit_1a1 <- lm(y ~ z, data = data_1a1, refresh = 0)

fit_1a1


Call:
lm(formula = y ~ z, data = data_1a1, refresh = 0)

Coefficients:
(Intercept)            z  
     59.008        4.656

To give a sense of why it would be a mistake to focus on the point estimate, we repeat the above steps for a new batch of 100 students simulated from the model. Here is the result:

set.seed(827)

data_1a2 <- sim_1()

set.seed(619)

fit_1a2 <- lm(y ~ z, data = data_1a2, refresh = 0)

fit_1a2


Call:
lm(formula = y ~ z, data = data_1a2, refresh = 0)

Coefficients:
(Intercept)            z  
      56.19        12.19

A naive read of this table would be that the design with 100 students is just fine, as the estimate is well over two standard errors away from zero. But that conclusion would be a mistake, as the coefficient estimate here is too noisy to be useful.

Including a pre-treatment predictor

Add a pre-test variable simulated independently of the potential outcomes for the final test score.

set.seed(134)

data_1b <- 
  data_1a1 |> 
  mutate(x = rnorm(n(), mean = 50, sd = 20))

We can then adjust for pre-test in our regression:

set.seed(619)

fit_1b <- lm(y ~ z + x, data = data_1b, refresh = 0)

fit_1b


Call:
lm(formula = y ~ z + x, data = data_1b, refresh = 0)

Coefficients:
(Intercept)            z            x  
   62.18784      4.44862     -0.06444

Because the pre-test variable was simulated independently of the potential outcomes for the final test score, the standard error for the coefficient of z wasn’t reduced.

To perform a realistic simulation, we will now simulate both test scores in a correlated way.

set.seed(822)

n <- 100

true_ability <- rnorm(n, mean = 50, sd = 16)
y_if_control <- true_ability + rnorm(n, mean = 0, sd = 12) + 10
y_if_treated <- y_if_control + 5

data_2 <- 
  tibble(
    x = true_ability + rnorm(n, mean = 0, sd = 12),
    z = rep(0:1, n / 2) |> sample(),
    y = if_else(z == 1, y_if_treated, y_if_control)
  )

The simple comparison is equivalent to a regression on the treatment indicator:

set.seed(619)

fit_2a <- lm(y ~ z, data = data_2, refresh = 0)

fit_2a


Call:
lm(formula = y ~ z, data = data_2, refresh = 0)

Coefficients:
(Intercept)            z  
     59.671        6.363

And the estimate adjusting for pre-test:

set.seed(619)

fit_2b <- lm(y ~ z + x, data = data_2, refresh = 0)

fit_2b


Call:
lm(formula = y ~ z + x, data = data_2, refresh = 0)

Coefficients:
(Intercept)            z            x  
     20.673        7.357        0.795

In this case, with a strong dependence between pre-test and post-test, this adjustment has reduced the residual standard deviation by about a quarter.

Example 2: simulating a regression design

We can use simulation to test rather complex study designs
Imagine you are interested in students attitude towards smoking and how it depends on the medium of the message and the focus of the message
We want to know whether people’s attitude is different after seeing a visual anti-smoking message (these pictures on the package) vs. a text-message (the text belonging to that picture)
We are interested in whether the attitude that people report is different after seeing a message that regards the consequences on other people (e.g. smoking can harm your loved ones) as compared to yourself (smoking can cause cancer)

Study design

DV: attitude towards smoking (0-100) IV1: medium (text vs. visual) IV2: focus (internal vs. external)

This is, there are 4 groups:

group_TI will receive text-messages that are internal
group_TE will receive text-messages that are external
group_VI will receive visual messages that are internal
group_VE will receive visual messages that are external
assume that we expect that people’s attitude will be more negative after seeing a visual rather than text message if the focus is internal (i.e. the message is about yourself) because it might be difficult to imagine that oneself would get cancer after reading a text but seeing a picture might cause fear regardless
for the external focus on the other hand, we expect a more negative attitude after reading a text as compared to seeing a picture, as it might have more impact on attitude to imagine a loved one get hurt than seeing a stranger in a picture suffering from the consequences of second-hand smoking
we expect that the internal focus messages will be related to lower attitudes compared to the external focus messages on average but we expect no main-effect of picture vs. text-messages
visualize some rough means that show the desired behavior that we described in words earlier and see where we are going
we could make the overall mean of the internal focus groups (group_TI and group_VI) 20 and the mean of the external groups (group_TE and group_VE) 50 (this would already reflect the main-effect but also a belief that the smoking-attitudes are on average quite negative as we assume both means to be on the low end of the scale)
assume that the mean of group_TI is 30 while the mean of group_VI is 10 and we could assume that the mean of group_TE is 40 and the mean of group_VE is 60

focus <- rep(c("internal", "external"), each = 2)
media <- rep(c("text", "visual"), times = 2)
mean_TI <- 50
mean_VI <- 20
mean_TE <- 30
mean_VE <- 60

pd <- data.frame(score = c(mean_TI, mean_VI, mean_TE, mean_VE), focus = focus, media = media)

interaction.plot(pd$focus, pd$media, pd$score, ylim = c(0,100))

focus <- rep(c("internal", "external"), each = 2)
media <- rep(c("text", "visual"), times = 2)
mean_TI <- 43
mean_VI <- 40
mean_TE <- 45
mean_VE <- 47

pd <- data.frame(score = c(mean_TI, mean_VI, mean_TE, mean_VE), focus = focus, media = media)

interaction.plot(pd$focus, pd$media, pd$score, ylim = c(0,100))

in the new example there is a difference between the two media groups on average but it is only .50 points, so arguably it is small enough to represent the assumption of “no” effect, as in real-life “no” effect in terms of a difference being actually 0 is rather rare
come up with some reasonable standard-deviation; if we start at 50 and we want most people to be < 80, we can set the 2-SD bound at 80 to get a standard-deviation of 15 (80-50)/2.
let’s assume that each of our groups has a standard-deviation of 15 points.

group_TI = normal(n, 43, 15)
group_VI = normal(n, 40, 15)
group_TE = normal(n, 45, 15)
group_VE = normal(n, 47, 15)

n <- 1e5
group_TI <-  rnorm(n, 43, 15)
group_VI <-  rnorm(n, 40, 15)
group_TE <-  rnorm(n, 45, 15)
group_VE <-  rnorm(n, 47, 15)

participant <- c(1:(n*4))
focus <- rep(c("internal", "external"), each = n*2)
media <- rep(c("text", "visual"), each = n, times = 2)

data <- data.frame(participant = participant, focus = focus, media = media, score = c(group_TI, group_VI, group_TE, group_VE))

summary(data)

  participant       focus              media               score       
 Min.   :1e+00   Length:400000      Length:400000      Min.   :-36.36  
 1st Qu.:1e+05   Class :character   Class :character   1st Qu.: 33.51  
 Median :2e+05   Mode  :character   Mode  :character   Median : 43.77  
 Mean   :2e+05                                         Mean   : 43.77  
 3rd Qu.:3e+05                                         3rd Qu.: 54.01  
 Max.   :4e+05                                         Max.   :113.93

Power-analysis

Some additional assumptions: suppose we have enough funding for a sizeable data collection and the aim is to ensure that we do not draw unwarranted conclusions from the research
We should then set the alpha-level at a more conservative value (\(\alpha = .001\)); with this, we expect to draw non-realistic conclusions in the interaction effect in only about 1 in every 1,000 experiments
We also want to be sure that we do detect an existing effect and keep our power high at 95%; with this, we expect that if there is an interaction effect, we would detect it in 19 out of 20 cases (only miss it in 1 out of 20, or 5%)
Running the power-simulation can be very memory-demanding and the code can run a very long time to complete; it’s advised to start from various “low-resolution” sample-sizes (e.g. n = 10, n = 100, n = 200, etc.) to get a rough idea of where we can expect our loop to end. Then, the search can be made more specific in order to identify a more precise sample size.

set.seed(1)
n_sims <- 1000 # we want 1000 simulations
p_vals <- c()
power_at_n <- c(0) # this vector will contain the power for each sample-size (it needs the initial 0 for the while-loop to work)
n <- 100 # sample-size and start at 100 as we can be pretty sure this will not suffice for such a small effect
n_increase <- 100 # by which stepsize should n be increased
i <- 2

power_crit <- .95
alpha <- .001

while(power_at_n[i-1] < power_crit){
  for(sim in 1:n_sims){
    group_TI <-  rnorm(n, 43, 15)
    group_VI <-  rnorm(n, 40, 15)
    group_TE <-  rnorm(n, 45, 15)
    group_VE <-  rnorm(n, 47, 15)
    
    participant <- c(1:(n*4))
    focus <- rep(c("internal", "external"), each = n*2)
    media <- rep(c("text", "visual"), each = n, times = 2)
    
    data <- data.frame(participant = participant, focus = focus, media = media, score = c(group_TI, group_VI, group_TE, group_VE))
    data$media_sum_num <- ifelse(data$media == "text", 1, -1) # apply sum-to-zero coding
    data$focus_sum_num <- ifelse(data$focus == "external", 1, -1) 
    lm_int <- lm(score ~ 1 + focus_sum_num + media_sum_num + focus_sum_num:media_sum_num, data = data) # fit the model with the interaction
    lm_null <- lm(score ~ 1 + focus_sum_num + media_sum_num, data = data) # fit the model without the interaction
    p_vals[sim] <- anova(lm_int, lm_null)$`Pr(>F)`[2] # put the p-values in a list
  }
    print(n)
    power_at_n[i] <- mean(p_vals < alpha) # check power (i.e. proportion of p-values that are smaller than alpha-level of .10)
    names(power_at_n)[i] <- n
    n <- n+n_increase # increase sample-size by 100 for low-resolution testing first
    i <- i+1 # increase index of the while-loop by 1 to save power and cohens d to vector
}

[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900

power_at_n <- power_at_n[-1] # delete first 0 from the vector

We can plot the results form the power-simulation:

At roughly 900 participants we observe sufficient power

References

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press.