rmissing-datastanbernoulli-probability

Bernoulli Prior in R STAN


I am fitting a logistic model in STAN (rstan library). My response variable does not have any missing values, however one of my covariates "HB" is binary and has missing entries.

Thus the goal is to impute at each iteration the missing entries in the binary vector, using a Bernoulli prior (with parameter, say, 0.5).

However, I am running into issues:

I used the guidelines provided in section 3.3 of the STAN user guide. With the model below, the parser gives me an error at the bernoulli assignment line (penultimate line in the model block), saying it needs integers. Note: I also tried defining HB_miss as a real in the parameters block and getting the same error.

m2 <- '
data {                          
int<lower=0> N;                // total number of observations
int<lower=0,upper=1> y[N];     // setting the dependent variable y as binary
vector[N] X;                   // independent variable 1

int<lower=0> N_obs; 
int<lower=0> N_miss; 
int<lower=1, upper=N> ii_obs[N_obs]; 
int<lower=1, upper=N> ii_miss[N_miss]; 

vector[N_obs] HB_obs;         // independent variable 2 (observed) 

}
parameters {
real b_0;                      // intercept
real b_X;                      // beta 1,2, ...
real b_HB;
vector[N_miss] HB_miss;
}
transformed parameters {
vector[N] HB;
HB[ii_obs] = HB_obs;
HB[ii_miss] = HB_miss;
}
model {
b_0 ~ normal(0,100);           
b_X ~ normal(0,100);           
b_HB ~ normal(0,100); 
HB_miss ~ bernoulli(0.5); // This is where the parser gives me an error
y ~ bernoulli_logit(b_0 + b_X * X + b_HB * HB); // model
}

Any ideas how I can assign a bernoulli prior to HB_miss effectively in STAN?


Solution

  • For the reasons you mentioned, it is not possible to treat missing discrete values as unknowns in a Stan program. All of the algorithms in Stan utilize gradients, and derivatives are not defined for discrete unknowns.

    Instead, you need to marginalize over the unknown values, which is not too tedious when everything is binary. Essentially, you can use the log_mix function whose arguments are:

    1. The probability the missing value is 1, which you say is 0.5 in your case
    2. The log-likelihood contribution for the observation in question if the missing value were 1
    3. The log-likelihood contribution for the observation in question if the missing value were 0

    So, it would be something like

    for (n in 1:N)
      target += log_mix(0.5, bernoulli_logit_lpmf(y[n] | b_0 + b_X * X[i] + b_HB),
                             bernoulli_logit_lpmf(y[n] | b_0 + b_X * X[i]));
    

    For more details, you could read this blog post.