I'm trying to fit a binomial GLM using the rethinking
package (draws on rstan
MCMC).
The model fits, but the sampling is inefficient and Rhat indicates something went wrong. I don't understand the reason for this fitting problem.
This is the data:
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(dummy)
This is the model:
m1_stan <- map2stan(
alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_n,
beta0 ~ dnorm(0, 10),
beta1 ~ dnorm(0, 10)
),
data = d,
WAIC = FALSE,
iter = 1000)
Fit diagnostics (Rhat, number of effective samples) indicate that something went wrong:
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 -3.75 0 -3.75 -3.75 3 2.21
beta1 0.00 0 0.00 0.00 10 1.25
The traceplot does not show a "fat hairy caterpillar":
The stan output suggested to increase two parameters, adapt_delta
and max_treedepth
, which I did. That improved the sampling process somewhat:
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 18.1 0.09 18.11 18.16 28 1.06
beta1 0.0 0.00 0.00 0.00 28 1.06
But as the traceplot shows, there's still something wrong:
The pairs plot looks strange, too:
What I else tried:
But nothing helped so far. What could be the reason for the ineffecient fitting? What could I try?
Could the large count numbers in each be a problem?
mean(d_short$afd_votes)
[1] 19655.83
Data excerpt:
head(d)
afd_votes votes_total foreigner_n
1 11647 170396 16100
2 9023 138075 12600
3 11176 130875 11000
4 11578 156268 9299
5 10390 150173 25099
6 11161 130514 13000
session info:
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] viridis_0.5.1 viridisLite_0.3.0 sjmisc_2.7.3 pradadata_0.1.3 rethinking_1.59 rstan_2.17.3 StanHeaders_2.17.2 forcats_0.3.0 stringr_1.3.1
[10] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] httr_1.3.1 jsonlite_1.5 modelr_0.1.2 assertthat_0.2.0 stats4_3.5.0 cellranger_1.1.0 yaml_2.1.19 pillar_1.3.0 backports_1.1.2
[10] lattice_0.20-35 glue_1.3.0 digest_0.6.15 rvest_0.3.2 snakecase_0.9.1 colorspace_1.3-2 htmltools_0.3.6 plyr_1.8.4 pkgconfig_2.0.1
[19] broom_0.5.0 haven_1.1.2 bookdown_0.7 mvtnorm_1.0-8 scales_0.5.0 stringdist_0.9.5.1 sjlabelled_1.0.12 withr_2.1.2 RcppTOML_0.1.3
[28] lazyeval_0.2.1 cli_1.0.0 magrittr_1.5 crayon_1.3.4 readxl_1.1.0 evaluate_0.11 nlme_3.1-137 MASS_7.3-50 xml2_1.2.0
[37] blogdown_0.8 tools_3.5.0 loo_2.0.0 data.table_1.11.4 hms_0.4.2 matrixStats_0.54.0 munsell_0.5.0 prediction_0.3.6 bindrcpp_0.2.2
[46] compiler_3.5.0 rlang_0.2.1 grid_3.5.0 rstudioapi_0.7 labeling_0.3 rmarkdown_1.10 gtable_0.2.0 codetools_0.2-15 inline_0.3.15
[55] curl_3.2 R6_2.2.2 gridExtra_2.3 lubridate_1.7.4 knitr_1.20 bindr_0.1.1 rprojroot_1.3-2 KernSmooth_2.23-15 stringi_1.2.4
[64] Rcpp_0.12.18 tidyselect_0.2.4 xfun_0.3 coda_0.19-1
STAN does better with unit scale, uncorrelated parameters. From the STAN manual §28.4 Model Conditioning and Curvature:
Ideally, all parameters should be programmed so that they have unit scale and so that posterior correlation is reduced; together, these properties mean that there is no rotation or scaling required for optimal performance of Stan’s algorithms. For Hamiltonian Monte Carlo, this implies a unit mass matrix, which requires no adaptation as it is where the algorithm initializes. Riemannian Hamiltonian Monte Carlo performs this conditioning on the fly at every step, but such conditioning is very expensive computationally.
In your case, the beta1
is tied to foreigner_n
, which doesn't have unit scale, and so is imbalanced in comparison to beta0
. Additionally, since foreigner_n
is not centered, both betas are changing the location of p
during sampling, hence the posterior correlation.
Standardizing yields a more tractable model. Transforming foreigner_n
to be centered and unit scale enables the model to rapidly converge and yields high effective sample sizes. I'd also argue that the betas in this form are more interpretable, since beta0
focuses only on the location of p
, while beta1
only concerns how variation in foreigner_n
explains variation in afd_votes/total_votes
.
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_z <- scale(d$foreigner_n)
m1 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_z,
c(beta0, beta1) ~ dnorm(0, 1)
)
m1_stan <- map2stan(m1, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
Examining the sampling, we have
> summary(m1_stan)
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -1.95 0.00 0.00 -1.95 -1.95 -1.95 -1.95 -1.95 16352 1
beta1 -0.24 0.00 0.00 -0.24 -0.24 -0.24 -0.24 -0.24 13456 1
dev 861952.93 0.02 1.97 861950.98 861951.50 861952.32 861953.73 861958.26 9348 1
lp__ -17523871.11 0.01 0.99 -17523873.77 -17523871.51 -17523870.80 -17523870.39 -17523870.13 9348 1
Samples were drawn using NUTS(diag_e) at Sat Sep 1 11:48:55 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
And looking at the pairs plot, we see the correlation between the betas is reduced to 0.15:
I originally intuited that the uncentered foreigner_n
was the main issue. At the same time, I was slightly confused because STAN is using HMC/NUTS, which I've been under the impression are supposed to be rather robust against correlated latent variables. However, I noticed comments in the STAN manual about practical problems with scale invariance due to numerical instability, which are also commented on by Michael Betancourt in a CrossValidated answer (though it is a rather old post). So, I wanted to test whether centering or scaling were the most impactful in improving sampling.
Centering still results in quite poor performance. Note the effective sample size is literally one effective sample per chain.
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_c <- d$foreigner_n - mean(d$foreigner_n)
m2 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_c,
c(beta0, beta1) ~ dnorm(0, 1)
)
m2_stan <- map2stan(m2, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
which yields
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -0.64 0.4 0.75 -1.95 -1.29 -0.54 0.2 0.42 4 2.34
beta1 0.00 0.0 0.00 0.00 0.00 0.00 0.0 0.00 4 2.35
dev 18311608.99 8859262.1 17270228.21 861951.86 3379501.84 14661443.24 37563992.4 46468786.08 4 1.75
lp__ -26248697.70 4429630.9 8635113.76 -40327285.85 -35874888.93 -24423614.49 -18782644.5 -17523870.54 4 1.75
Samples were drawn using NUTS(diag_e) at Sun Sep 2 18:59:52 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
And still appears to have a problematic pairs plot:
Scaling vastly improves the sampling! Even though the resulting posteriors still have rather high correlation, the effective sample sizes are in acceptable ranges, although well below those of full standardization.
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_s <- d$foreigner_n / sd(d$foreigner_n)
m3 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_s,
c(beta0, beta1) ~ dnorm(0, 1)
)
m3_stan <- map2stan(m2, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
yielding
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -1.58 0.00 0.00 -1.58 -1.58 -1.58 -1.58 -1.57 5147 1
beta1 -0.24 0.00 0.00 -0.24 -0.24 -0.24 -0.24 -0.24 5615 1
dev 861952.93 0.03 2.01 861950.98 861951.50 861952.31 861953.69 861958.31 5593 1
lp__ -17523870.45 0.01 1.00 -17523873.15 -17523870.83 -17523870.14 -17523869.74 -17523869.48 5591 1
Samples were drawn using NUTS(diag_e) at Sun Sep 2 19:02:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
The pairs plot shows that there is still significant correlation:
Hence, while decorrelating variables does improve sampling, eliminating scale imbalances are most important in this model.