# Setting up simple simulation studies in R

### June 6, 2024

Setting up and running simulation studies are a ubiquitous task in applied statistics. In this post, I’ll write up a small simulation study to show how I usually approach setting them up in R. This post will assume a certain level of familiarity with causal inference.

The simulation study will compare a naive estimator of an Average Treatment Effect vs. a simple parametric g-estimator. The first step in setting up a simulation study is deciding on a data generating process for the observed data. I like to start by defining a bare-bones simulation function:

simulate <- function(N, seed) {
set.seed(seed)
}

This function will evolve with the simulation study, but at core it takes two arguments: N, the number of observations, and seed, the random number seed to use so that the function is reproducible.

For our simulation, suppose the observed data are $$n$$ i.i.d. draws of a variable $$O = (X, A, Y)$$ where $$X$$ is a covariate, $$A$$ is a binary treatment indicator, and $$Y$$ is a continuous outcome. Further define the potential outcomes $$Y(A)$$ for $$A \in \{ 0, 1 \}$$ for the outcome under treatment $$A = 0$$ and $$A = 1$$. Then we can write the observed outcome as $$Y = AY(1) + (1 - A)Y(0)$$.

Our target statistical parameter is the average treatment effect (ATE), defined as $\psi = \mathbb{E}[Y(1) - Y(0)].$

Under standard assumptions, the ATE is identified as $\psi = \mathbb{E}[\mathbb{E}[Y | A = 1, X] - \mathbb{E}[Y | A = 0, X]].$

Let’s get into setting up the actual simulation study. To be more concrete, we will assume the following data generating process (DGP): \begin{align*} X &\sim \mathrm{Normal}(0, 1), \\ A | X &\sim \mathrm{Bernoulli}\left(\mathrm{logit}^{-1}(\alpha X)\right), \\ Y | A, X &\sim \mathrm{Normal}\left(\beta_1 X + \beta_2 A, \sigma^2\right), \end{align*} where $$\alpha, \beta_1, \beta_2$$, and $$\sigma > 0$$ are parameters that we can use to control the properties of the simulated data. Specifically, $$\alpha$$ controls the dependence of $$A$$ on $$X$$; we can use it to control the degree of covariate overlap in the treatment vs. control groups. The parameter $$\beta_1$$ controls the strength of association between the covariate and the outcome, and $$\beta_2$$ controls the magnitude of the treatment effect. Finally, $$\sigma$$ controls the conditional variance of the outcome.

For causal simulation studies, I usually like to explicitly generate the potential outcomes, if it’s feasible to do so. Based on the prior DGP, the conditional distributions of the potential outcomes are given by: \begin{align*} Y(0) | X &\sim \mathrm{Normal}(\beta_1 X, \sigma^2), \\ Y(1) | X &\sim \mathrm{Normal}(\beta_1 X + \beta_2, \sigma^2). \\ \end{align*}

Next, we want to fill out the simulate function to encode this data generating process. We’ll add the parameters as arguments so that we can control how the data are generated. I typically rely heavily on tidyverse packages, so we’ll load that as well.

library(tidyverse)

simulate <- function(N, alpha, beta1, beta2, sigma, seed) {
X <- rnorm(N, mean = 0, sd = 1)
A <- rbinom(N, size = 1, prob = plogis(alpha * X))

# Potential outcomes
Y0 <- rnorm(N, mean = beta1 * X , sd = sigma)
Y1 <- rnorm(N, mean = beta1 * X + beta2, sd = sigma)

# Observed outcome
Y <- A * Y1 + (1 - A) * Y0

# Return data frame with covariate, treatment, both potential outcomes,
# and observed outcome
tibble(X, A, Y0, Y1, Y)
}

In a simulation study we have the benefit of being able to observe both potential outcomes – in reality, we would only get to see one (this is called the “fundamental problem of causal inference”).

In the DGP we’re working with, it’s easy to read off the true value of the ATE parameter $$\psi$$: it is just $$\beta_2$$. In more complicated DGPs, it can be more difficult (or impossible) to find a closed form solution for the true parameter value. A useful approach here is to estimate the value of the causal parameter by drawing a very large number of observations, and then plugging those observations into the causal parameter. Since we simulate all the potential outcomes, we can get a very good estimate of the true causal value.

For the ATE, it’s pretty simple to write a function that, given a simulated dataset, takes the average of the two potential outcomes:

true_ate <- function(data) {
mean(data$Y1 - data$Y0)
}

For example, we can calculate the true ATE under a specific set of simulation parameters using a large draw from the simulation DGP:

true_ate(
simulate(N = 1e5, alpha = 1, beta1 = 1, beta2 = 1, sigma = 0.1, seed = 1)
)
## [1] 0.999703

For those parameter values, the true ATE should be exactly 1; this Monte-Carlo approach gets us a pretty close estimate.

Next, we can start defining some estimators of the ATE that we will be comparing in our simulation study. As a benchmark, we could consider a “naive” estimator that simply takes the difference of the empirical means of the treatment and control groups. Let’s put this into its own function:

ate_naive <- function(data) {
estimate <- mean(data$Y[data$A == 1]) - mean(data$Y[data$A == 0])

# Return estimate
tibble(
estimator = "naive",
estimate = estimate
)
}

Second, we can define a parametric g-computation estimator using linear models:

ate_gcomp <- function(data) {
# Estimate outcome regression model
model <- lm(Y ~ A + X, data = data)

# Predict E[Y | A = 0, X]
mu0 = predict(model, mutate(data, A = 0))

# Predict E[Y | A = 1, X]
mu1 = predict(model, mutate(data, A = 1))

# Form estimate E[E[Y | A = 1, X] - E[Y | A = 0, X]]
estimate <- mean(mu1 - mu0)

# Return estimate
tibble(
estimator = "g-computation",
estimate = estimate
)
}

Next, I like to have a single function that runs all of the estimators on a simulated dataset:

fit <- function(data) {
bind_rows(
ate_naive(data),
ate_gcomp(data)
)
}

Now we’re finally ready to set up and run the actual study. We’ll set up a data frame that contains the simulated datasets. In this example, we simulate $$100$$ datasets for every combination of sample sizes $$N = \{ 50, 100, 250, 500 \}$$ and parameters $$\alpha \in \{ 0, 1, 2 \}$$, $$\beta_1 = 1$$, $$\beta_2 = 1$$, and $$\sigma = 0.1$$. As $$\alpha$$ increases, the degree of covariate overlap between the treatment and control group will decrease, so this simulation will help us understand how the estimators behave under increasingly severe violations of positivity.

As mentioned earlier, in this simple example the true ATE has a closed form. We’ll use that knowledge and add an additional column giving the true ATEs corresponding to each simulated dataset. Alternatively, we could use the true_ate function we defined earlier and estimate the true ATE by drawing a large dataset for each unique set of parameters.

num_simulations <- 100

simulated_datasets <- expand_grid(
N     = c(50, 100, 250, 500),
index = 1:num_simulations,
alpha = c(0, 1, 2),
beta1 = 1,
beta2 = 1,
sigma = 0.1
) %>%
mutate(
seed = 1:n(),
true_ate = beta2,
data = pmap(list(N, alpha, beta1, beta2, sigma, seed), simulate)
)

Now we apply the estimators to each of the simulated datasets, and immediately unnest to access each estimator’s results:

simulation_results <- simulated_datasets %>%
mutate(fits = map(data, fit)) %>%
select(-data) %>%
unnest(fits)

Now we can summarize the results to understand how, for example, the mean bias of the two estimators differs:

simulation_summary <- simulation_results %>%
group_by(N, alpha, beta1, beta2, sigma, estimator) %>%
summarize(bias = mean(estimate - true_ate))
## summarise() has grouped output by 'N', 'alpha', 'beta1', 'beta2', 'sigma'.
## You can override using the .groups argument.
simulation_summary
## # A tibble: 24 × 7
## # Groups:   N, alpha, beta1, beta2, sigma [12]
##        N alpha beta1 beta2 sigma estimator         bias
##    <dbl> <dbl> <dbl> <dbl> <dbl> <chr>            <dbl>
##  1    50     0     1     1   0.1 g-computation  0.00162
##  2    50     0     1     1   0.1 naive          0.0374
##  3    50     1     1     1   0.1 g-computation  0.00171
##  4    50     1     1     1   0.1 naive          0.882
##  5    50     2     1     1   0.1 g-computation -0.00233
##  6    50     2     1     1   0.1 naive          1.23
##  7   100     0     1     1   0.1 g-computation  0.00296
##  8   100     0     1     1   0.1 naive          0.0273
##  9   100     1     1     1   0.1 g-computation  0.00192
## 10   100     1     1     1   0.1 naive          0.817
## # ℹ 14 more rows

To view the results, it’s usually nice to convert to some other format that’s easier to read. Ideally, we can format it in such a way that it’s easy to put directly into a paper.

simulation_table <- simulation_summary %>%
ungroup() %>%
select(alpha, N, estimator, bias) %>%
pivot_wider(names_from = estimator, values_from = bias) %>%
arrange(alpha, N)

simulation_table
## # A tibble: 12 × 4
##    alpha     N g-computation    naive
##    <dbl> <dbl>           <dbl>    <dbl>
##  1     0    50        0.00162  0.0374
##  2     0   100        0.00296  0.0273
##  3     0   250        0.00124  0.000927
##  4     0   500       -0.000239 0.00381
##  5     1    50        0.00171  0.882
##  6     1   100        0.00192  0.817
##  7     1   250       -0.00104  0.824
##  8     1   500       -0.000191 0.826
##  9     2    50       -0.00233  1.23
## 10     2   100       -0.00308  1.21
## 11     2   250       -0.00351  1.20
## 12     2   500       -0.000441 1.21

We can format this into an HTML table and clean up the precision of the estimates:

simulation_table %>%
mutate_at(vars(g-computation, naive), scales::number_format(accuracy = 0.001)) %>%
knitr::kable()
alpha N g-computation naive
0 50 0.002 0.037
0 100 0.003 0.027
0 250 0.001 0.001
0 500 0.000 0.004
1 50 0.002 0.882
1 100 0.002 0.817
1 250 -0.001 0.824
1 500 0.000 0.826
2 50 -0.002 1.227
2 100 -0.003 1.206
2 250 -0.004 1.205
2 500 0.000 1.206

From these results, we can see that the bias of the naive estimator is zero if there is no confounding between $$X$$ and $$A$$ (that is, when $$\alpha = 0$$). For larger values of $$\alpha$$, the bias of the naive method increases. On the other hand, the parametric g-computation method has low bias in all settings. This is not surprising, as the parametric model in the g-computation is correctly specified.

Now that we have a skeleton, we could continue building out the simulation study by adding more estimators, more metrics (say the mean absolute error, empirical coverage if our methods give confidence intervals, etc.), and testing more combinations of simulation parameters.

The code for this example lays out how I typically start simulation studies. The main idea is to have a function that simulates the data, and a function that runs all the estimators. Then a big data frame holds all the simulated datasets for the various combination of simulation parameters. The estimators are then run on each of the simulated datasets using the pmap function.

This setup can be extended to handle more complicated scenarios. If the estimators are computation intensive, we can easily switch to running them via future_pmap so that they are executed in parallel. We can also modify the fit function and how it is called to cache estimated values for a simulated dataset, so that if R crashes or is interrupted the simulation study can be restarted and pick up where it left off.

For very computationally intensive estimators, it’s often necessary to move to a high-performance computing cluster. This setup can be generalized to that case, where the R script is executed with various settings passed by environment variables by the cluster scheduler.