Herb Susmann CV About

Using R formulas to pass data to Stan

Many statistical routines in R use R formulas as a flexible way to specify the terms of a model. With a little setup, we can use formulas to build inputs to Stan and avoid hard-coding any variables in the model.

For example, say you are writing a Stan model for linear regression. You would like to regress a response variable $y$ on two predictors, $x_1$ and $y_1$:

// linear_regression.stan
data {
  int N; // Number of observations
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
} 
parameters {
  real beta_0;
  real beta_1;
  real beta_2;
  real<lower=0> sigma;
} 
model {
  y ~ normal(beta_0 + beta_1 * x1 + beta_2 * x2, sigma);
}

But what if you later decide to add more predictors? We can make the above model more flexible by allowing a matrix of predictors $X$ of arbitrary size:

// linear_regression.stan
data {
  int N; // Number of observations
  int K; // Number of predictors
  vector[N] y;
  matrix[N, K] X;
} 
parameters {
  vector[K] beta;
  real<lower=0> sigma;
} 
model {
  y ~ normal(beta * X, sigma);
}

Then we can use R formulas to build the predictor matrix $X$ and pass it to Stan:

fit_linear_regression <- function(formula, data, ...) {
  model <- stan_model("./linear_regression.stan")

  X <- model.matrix(formula, data)
  y <- model.extract(model.frame(formula, data), "response")
  
  data <- list(
    N = nrow(X),
    K = ncol(X),
    X = X,
    y = y
  )

  sampling(model, data, ...)
}

Now it’s easy to fit the model with different predictors:

N <- 100
simulated_data <- tibble::tibble(
  x1 = rnorm(N, 0, 1),
  x2 = rnorm(N, 0, 1),
  y = x1 + 2*x2 + rnorm(N, 0, 0.1)
)


fit_linear_model(y ~ x1, simulated_data)
fit_linear_model(y ~ x1 + x2, simulated_data)

Writing a separate function for preparing the data for Stan based on a formula makes the model more usable and flexible.

Smartphone interface for reporting research results to study participants

One of the themes that I worked on at Silent Spring Institute was on how to report complex personal data to our study participants. In the PROTECT study, mothers in Puerto Rico were tested for a host of environmental chemicals. Our job was to design a tool to report-back individual results to the participants.

While I was on the project, I designed and implemented a novel smartphone interface for communicating personal results to study participants. One aspect was designing a visualization that allowed participants to compare their results to other women in the study. Our approach uses a SinaPlot where the participant’s personal results are represented by an avatar that they chose when they enter their report.

Last May I left Silent Spring Institute to pursue graduate school, and I was happy to see that the tool was launched in October! To learn more, check out the article the PROTECT team wrote about the launch of the reports.

PROTECT Smartphone Interface

More animals from Twitter

More from Twitter, to complete a very silly triptych:

Code for the @common_squirrel plot and the @isacatinthesink plot are available as Github Gists.

From Twitter:

Here’s the R code I used to generate the histogram:

library(rtweet)
library(tidyverse)
library(stringr)
library(cowplot)
library(grid)
library(jpeg)

g <- rasterGrob(readJPEG("ellie.jpg"), interpolate = TRUE)

tmls <- get_timelines("dog_rates", n = 3200)

ratings <- tmls %>%
  filter(str_detect(text, "t.co")) %>%
  filter(!str_detect(text, "^RT")) %>%
  filter(!str_detect(text, "Here's a little more info on Dew")) %>%
  mutate(rating = map(text, str_extract_all, "1[0-5]/10"),
         rating = map(rating, `[[`, 1)) %>%
  unnest(rating) %>%
  count(rating)

ggplot(ratings, aes(x = rating, y = n)) +
  annotation_custom(g) +
  geom_col(fill = "white", alpha = 0.8) +
  labs(caption = "Data: @dog_rates, photo: @KatieNicoleF",
       title = "577 WeRateDogs™ Ratings")

The R code is also available as a gist.