multithreading bayesian mcmc stan pystan

Multi Threading in C++ Windows

I am using the pystan module in Windows where multithreading is not supported on Windows in the module. The pystan module is partially written in C++ and since I am trying to decrease the run time of the module, I am wondering if there is a way to hand write a multi-threading code in the C++ part of the module to decrease run time so I can increase the iterations? Below is the code:

from __future__  import division
import pystan
import numpy as np
import os 

x=np.array([453.05,453.05,453.24,453.35,453.44,453.44,453.83,454.02,454.89])
y=np.array([3232.12,3231.45,3231.90,3231.67,3231.84,3231.95,3231.89,3231.67,3231.45])
x=np.array(zip(x,y))
c=np.array([0.01,0.07,0.001,0.1,0.05,0.001,0.001,0.05,0.001])
s = np.array([454.4062631951059,3230.808656891571])
st=np.array([12,12,12,12,12,12,12,12,12])

model='''
    data {
     int D; //number of dimensions
     int K; //number of gaussians
     int N; //number of data 

     vector[D] y[N]; // observation data
     real con[N]; //concentration
     vector[D] s;//oil spill location
     real st[N]; // sample time
    }

    parameters {
     simplex[K] theta; //mixing proportions
     vector[D] v[K];
     vector<lower=0>[D] Dif[K];
     cholesky_factor_corr[D] L[K]; //cholesky factor of correlation matrix
    }

    transformed parameters {
      cholesky_factor_cov[D,D] cov[K,N];
      vector<lower=0>[D] sigma[K,N]; // standard deviations  
      vector[D] mu[K,N];
      real ro[K];
      matrix[D,D] Omega[K];
      matrix[D,D] Sigma[K,N];  
      vector[N] lamba;  

      for (k in 1:K) {  
      Omega[k] = multiply_lower_tri_self_transpose(L[k]);
         for (n in 1:N){
          sigma[k,n] = 0.05 + sqrt(2*st[n]*Dif[k]);
          mu[k,n] = s+v[k]*st[n];
          cov[k,n] = diag_pre_multiply(sigma[k,n],L[k]);
      Sigma[k,n] = quad_form_diag(Omega[k], sigma[k,n]); 
         }
      ro[k]=Omega[k,2,1]; 
      }

      for (i in 1 : N) {lamba[i] = 1/(theta[1]*(1./2./3.1415926/sqrt (Sigma[1,i, 1, 1])/sqrt (Sigma[1,i, 2, 2])/sqrt (1 - ro[1]*ro[1]))*exp (-1./2./(1 - ro[1]*ro[1])*(-(y[i, 1] - mu[1,i, 1])*(y[i, 1] - mu[1,i, 1])/Sigma[1, i,1, 1] - (y[i, 2] - mu[1, i,2])*(y[i, 2] - mu[1, i,2])/Sigma[1,i, 2, 2] + 2.*ro[1]*(y[i, 1] - mu[1,i, 1])*(y[i, 2] - mu[1,i, 2])/sqrt (Sigma[1, i,1, 1])/sqrt (Sigma[1,i, 2, 2]))) + 
           theta[2]*(1./2./3.1415926/sqrt (Sigma[2, i,1, 1])/sqrt (Sigma[2,i, 2, 2])/sqrt (1 - ro[2]*ro[2]))*exp (-1./2./(1 - ro[2]*ro[2])*(-(y[i, 1] - mu[2, i,1])*(y[i, 1] - mu[2, i,1])/Sigma[2, i,1, 1] - (y[i, 2] - mu[2,i, 2])*(y[i, 2] - mu[2, i,2])/Sigma[2,i, 2, 2] + 2.*ro[2]*(y[i, 1] - mu[2, i,1])*(y[i, 2] - mu[2, i,2])/sqrt (Sigma[2, i,1, 1])/sqrt (Sigma[2, i,2, 2]))) +
           theta[3]*(1./2./3.1415926/sqrt (Sigma[3, i,1, 1])/sqrt (Sigma[3,i, 2, 2])/sqrt (1 - ro[3]*ro[3]))*exp (-1./2./(1 - ro[3]*ro[3])*(-(y[i, 1] - mu[3, i,1])*(y[i, 1] - mu[3, i,1])/Sigma[3, i,1, 1] - (y[i, 2] - mu[3,i, 2])*(y[i, 2] - mu[3, i,2])/Sigma[3,i, 2, 2] + 2.*ro[3]*(y[i, 1] - mu[3, i,1])*(y[i, 2] - mu[3, i,2])/sqrt (Sigma[3, i,1, 1])/sqrt (Sigma[3, i,2, 2]))) +
           theta[4]*(1./2./3.1415926/sqrt (Sigma[4, i,1, 1])/sqrt (Sigma[4,i, 2, 2])/sqrt (1 - ro[4]*ro[4]))*exp (-1./2./(1 - ro[4]*ro[4])*(-(y[i, 1] - mu[4, i,1])*(y[i, 1] - mu[4, i,1])/Sigma[4, i,1, 1] - (y[i, 2] - mu[4,i, 2])*(y[i, 2] - mu[4, i,2])/Sigma[4,i, 2, 2] + 2.*ro[4]*(y[i, 1] - mu[4, i,1])*(y[i, 2] - mu[4, i,2])/sqrt (Sigma[4, i,1, 1])/sqrt (Sigma[4, i,2, 2]))));}
    }

    model {
     real ps[K];
     theta ~ dirichlet(rep_vector(2.0, 4));
     for(k in 1:K){
     v[k,1] ~ normal(0.0,4.1);// uniform(340/100,380/100);//
     v[k,2] ~  normal(0.0,4.1);//uniform(3160/100,3190/100);//
     Dif[k] ~ normal(0.5,0.2);//exponential(0.05);//beta(2,5);
     L[k] ~ lkj_corr_cholesky(2);// contain rho 
     con ~ exponential(lamba);
     }

     for (n in 1:N){
     for (k in 1:K){
      ps[k] = log(theta[k])+multi_normal_cholesky_lpdf(y[n] | mu[k,N], cov[k,N]); //increment log probability of the gaussian
     }
     target += log_sum_exp(ps);
     }
       for(i in 1:N){
       target +=   - lamba[i]*con[i]+log(lamba[i]);
      }
    }
    '''

    dat={'D':2,'K':4,'N':9,'y':x,'con':c,'s':s,'st':st}
    fit = pystan.stan(model_code=model,data=dat,iter=1000,warmup=500, chains=1,init_r=0.5)
    print(fit)

I'm not very proficient in C++ since I have been using python and the pystan module requires the code to be written in C++. I hope there is a way to multithread the number of iterations for the different cores on my Windows.

Solution

Since Stan is its own language, you can only achieve what the compiler is designed to parse and emit code for, which does not include support for arbitrary C++ code.

The backend of Stan does provide support for parallelization of single chains via MPI, but as you correctly note, this unfortunately doesn't extend to Windows at this time. Other than trying to compiling your own versions of the backends that somehow leverage the MPI libraries, there's not really anything you can do in the modeling language per se to remedy this.

If you are on a non-Windows platform, then you can start taking advantage of the new parallelized math library through a map-reduce operation map_rect. The User Guide (see 22. Map-Reduce, pp 237-244) has some details on using the method, with examples.

Note that you can still use this method on any platform with a v2.18+ Stan backend installed, but only on Linux and Mac OS X will MPI get used. I'm not sure what the plans are to roll out MPI support on Windows, but it might be worth keeping an eye on this MPI Parallelism wiki page.

Update: Possible Experimental Threading Support

It seems some have been playing around with compiling the Stan backend with the prerelease version of RTools, which includes version 8.x of g++ and that can enable threaded math libraries and MPI. Sounds like a rabbit hole, but if you want to try the red pill here it is.