Using R formulas to pass data to Stan

Many statistical routines in R use R formulas as a flexible way to specify the terms of a model. With a little setup, we can use formulas to build inputs to Stan and avoid hard-coding any variables in the model.

For example, say you are writing a Stan model for linear regression. You would like to regress a response variable \(y\) on two predictors, \(x_1\) and \(y_1\):

// linear_regression.stan
data {
  int N; // Number of observations
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
} 
parameters {
  real beta_0;
  real beta_1;
  real beta_2;
  real<lower=0> sigma;
} 
model {
  y ~ normal(beta_0 + beta_1 * x1 + beta_2 * x2, sigma);
}

But what if you later decide to add more predictors? We can make the above model more flexible by allowing a matrix of predictors \(X\) of arbitrary size:

// linear_regression.stan
data {
  int N; // Number of observations
  int K; // Number of predictors
  vector[N] y;
  matrix[N, K] X;
} 
parameters {
  vector[K] beta;
  real<lower=0> sigma;
} 
model {
  y ~ normal(beta * X, sigma);
}

Then we can use R formulas to build the predictor matrix \(X\) and pass it to Stan:

fit_linear_regression <- function(formula, data, ...) {
  model <- stan_model("./linear_regression.stan")

  X <- model.matrix(formula, data)
  y <- model.extract(model.frame(formula, data), "response")
  
  data <- list(
    N = nrow(X),
    K = ncol(X),
    X = X,
    y = y
  )

  sampling(model, data, ...)
}

Now it’s easy to fit the model with different predictors:

N <- 100
simulated_data <- tibble::tibble(
  x1 = rnorm(N, 0, 1),
  x2 = rnorm(N, 0, 1),
  y = x1 + 2*x2 + rnorm(N, 0, 0.1)
)


fit_linear_model(y ~ x1, simulated_data)
fit_linear_model(y ~ x1 + x2, simulated_data)

Writing a separate function for preparing the data for Stan based on a formula makes the model more usable and flexible.

Avatar
Herb Susmann
Graduate Student

Gradate student in Biostatistics at UMass Amherst matter.