# Using R formulas to pass data to Stan

Many statistical routines in R use R formulas as a flexible way to specify the terms of a model. With a little setup, we can use formulas to build inputs to Stan and avoid hard-coding any variables in the model.

For example, say you are writing a Stan model for linear regression. You would like to regress a response variable $$y$$ on two predictors, $$x_1$$ and $$y_1$$:

// linear_regression.stan
data {
int N; // Number of observations
vector[N] y;
vector[N] x1;
vector[N] x2;
}
parameters {
real beta_0;
real beta_1;
real beta_2;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x1 + beta_2 * x2, sigma);
}

But what if you later decide to add more predictors? We can make the above model more flexible by allowing a matrix of predictors $$X$$ of arbitrary size:

// linear_regression.stan
data {
int N; // Number of observations
int K; // Number of predictors
vector[N] y;
matrix[N, K] X;
}
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model {
y ~ normal(beta * X, sigma);
}

Then we can use R formulas to build the predictor matrix $$X$$ and pass it to Stan:

fit_linear_regression <- function(formula, data, ...) {
model <- stan_model("./linear_regression.stan")

X <- model.matrix(formula, data)
y <- model.extract(model.frame(formula, data), "response")

data <- list(
N = nrow(X),
K = ncol(X),
X = X,
y = y
)

sampling(model, data, ...)
}

Now it’s easy to fit the model with different predictors:

N <- 100
simulated_data <- tibble::tibble(
x1 = rnorm(N, 0, 1),
x2 = rnorm(N, 0, 1),
y = x1 + 2*x2 + rnorm(N, 0, 0.1)
)

fit_linear_model(y ~ x1, simulated_data)
fit_linear_model(y ~ x1 + x2, simulated_data)

Writing a separate function for preparing the data for Stan based on a formula makes the model more usable and flexible.