# Presidential Primary Polling Analysis in Stan

Note: this post has been updated with more recent data.

I often use random walk/autoregressive models in my research as a component in time-series analysis, and I wanted to get some more experience fitting them to data. FiveThirtyEight publishes several polling datasets, including polling for the 2020 Democratic presidential primary. I used Stan to fit a Bayesian random walk model to the polling data, which I describe below. The Stan and R code used in this post is available as a Github gist.

Let $$\delta_{c,t}$$ be the true proportion of voters in favor of candidate $$c$$ at time $$t$$. Our modeling assumption is that the logit-transform of $$\delta_{c,t}$$ follows a random walk; that is: $\mathrm{logit}(\delta_{c,t}) \sim \mathrm{N}\left(\mathrm{logit}(\delta_{c,t-1}), \tau^2\right)$ We can’t observe $$\delta_{c,t}$$ directly; we have to infer it through the noisy observations we have from polls.

Let $$s_{i}$$ be the sample size of poll $$c[i]$$ and $$y_{i}$$ the number of poll respondents in favor of candidate $$c[i]$$ at time $$t[i]$$. Let $$\phi_i$$ be the proportion of poll respondents in favor of candidate $$c[i]$$ at time $$t[i]$$. To incorporate sampling error, we model $$y_i$$ as binomial: $y_i \sim \mathrm{Binomial}(s_i, \phi_i)$ We also allow for added variance in our observations by relating $$\phi_i$$ to the true logit proportion $$\delta_{c[i], t[i]}$$ with a normal distribution: $\mathrm{logit}(\phi_i) \sim \mathrm{N}(\mathrm{logit}(\delta_{c[i], t[i]}), \sigma^2)$

To finish defining the model half-normal priors on the hyperparameters. The prior for $$\tau^2$$ has a small variance to improve identification of the model (a vaguer prior can cause the MCMC chains to not mix well.) \begin{aligned} \tau^2 &\sim \mathrm{N}(0, 0.02)[0, \infty] \\ \sigma^2 &\sim \mathrm{N}(0, 1)[0, \infty] \end{aligned}

Here is the Stan representation of the statistical model:

S4 class stanmodel 'random_walk' coded as follows:
data {
int<lower=0> T; // Number of timepoints
int<lower=0> C; // Number of candidates
int<lower=0> N; // Number of poll observations

int sample_size[N]; // Sample size of each poll
int y[N]; // Number of respondents in poll for candidate (approximate)
int<lower=1, upper=T> get_t_i[N]; // timepoint for ith observation
int<lower=1, upper=C> get_c_i[N]; // candidate for ith observation
}
parameters {
matrix[C, T] delta_logit; // Percent for candidate c at time t
real<lower=0, upper=1> phi[N]; // Percent of participants in poll for candidate
real<lower=0> tau; // Random walk variance
real<lower=0,upper=0.5> sigma; // Overdispersion of observations
}
model {
// Priors
tau ~ normal(0, 0.2);
sigma ~ normal(0, 1);

// Random walk
for(c in 1:C) {
delta_logit[c, 2:T] ~ normal(delta_logit[c, 1:(T - 1)], tau);
}

// Observed data
y ~ binomial(sample_size, phi);
for(i in 1:N) {
// Overdispersion
delta_logit[get_c_i[i], get_t_i[i]] ~ normal(logit(phi[i]), sigma);
}
}
generated quantities {
matrix[C, T] delta = inv_logit(delta_logit);
} 

The raw dataset that we are going to fit:

I fitted the Stan model to the data using the standard HMC-NUTS algorithm and 1,000 MCMC iterations (with 500 of those for warmup.) The plot below shows the posterior median with 75% and 95% credible intervals.

Let’s zoom in on August-November 2019 where there are more data:

This model is too simple to be that useful, but it could serve as a starting point for more complex models. The model could be improved by incorporating more information, perhaps by finding a data source that includes more polls or modifying the model to incorporate results from head-to-head matchups. We might also want to model systematic biases by the type of poll (phone or internet) or the pollster.