rbayesianmcmcstanrstan

Estimating parameters using stan when the distribution for response variable in a regression is non-normal


I am using R+stan for Bayesian estimates of model parameters when the distribution for response variable in a regression is not normal but rather some custom distribution as below.

Let say I have below data generating process

x <- rnorm(100, 0, .5)
noise <- rnorm(100,0,1)
y = exp(10 * x + noise) / (1 + exp(10 * x + noise))

data <- list( x= x, y = y, N = length(x))

In stan, I am creating below stan object

Model = "
data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    for (f in 1:N) {
        mu[f] = alpha + beta * x[f];
    }
}

model {
  sigma ~ chi_square(5);
  alpha ~ normal(0, 1);
  beta ~ normal(0, 1);
  y ~ ???;
}
"

However as you can see, what will be the right stan continuous distribution in the model block for y?

Any pointer will be highly appreciated.

Thanks for your time.


Solution

  • The problem is not so much that the distribution of errors isn't normal (which is the assumption in a regular linear regression), but that that's clearly not a linear relationship between x and y. You DO have a linear relationship with normally distributed noise (z = x * 10 + noise, where I use z to avoid confusion with your y), but then you apply the softmax function: y = softmax(z). If you want to model this using a linear regression, you need to invert the softmax (i.e. get the z back from y), which you do using the inverse softmax (which is the logit function since the softmax is the inverse logit function and the inverse inverse logit is the logit). Then you can do a standard linear regression.

    model = "
    data {
      int<lower=0> N;
      vector[N] x;
      vector[N] y;
    }
    
    transformed data{
      // invert the softmax function, so there's a linear relationship between x and z
      vector[N] z = logit(y);
    }
    
    parameters {
      real alpha;
      real beta;
      real<lower=0> sigma;
    }
    
    transformed parameters {
        vector[N] mu;
        // no need to loop here, this can be vectorized
        mu = alpha + beta * x;
    }
    
    model {
      sigma ~ chi_square(5);
      alpha ~ normal(0, 1);
      beta ~ normal(0, 1);
      z ~ normal(mu, sigma);
    }
    
    generated quantities {
     // now if you want to check the prediction, 
     //you predict linearly and then apply the softmax again
     vector[N] y_pred = inv_logit(alpha + beta * x);
    }
    "
    

    If you won't use mu again outside the model, you can skip the transformed parameters block and compute it directly when needed:

    z ~ normal(alpha + beta * x, sigma);
    

    On a sidenote: You might want to reconsider your priors. The true values for alpha and beta in this case are 0 and 10 respectively. The likelihood is precise enough to overwrite the prior largely, but you'll probably see some shrinkage towards zero for beta (i.e. you might get 9 instead of 10). Try something like normal(0, 10) instead. And I've never seen someone use a chi-squared distribution as a prior on standard deviations.