linear-regressionbayesianpymc3pymc

Predictions of coupled PyMC linear bayesian models for new inputs


I would like to model two observables, O_2 and O_3 using bayesian linear regression: O_2 ~ m_2 * O_1 + q_2 and O_3 ~ m_3 * O_2 + q_3. I would like to couple them via O_2. I am using the PyMC framework in Python. My code looks like the following:

with pm.Model() as model:
    # Note: only O_1 is mutable
    O_1_data = pm.Data('O_1', data['O_1'], mutable=True)
    O_2_data = pm.Data('O_2', data['O_2'], mutable=False)
    O_3_data = pm.Data('O_3', data['O_3'], mutable=False)

    # Weak priors for regression coefficients
    q_2 = pm.Normal('q_2', mu=0, sigma=20)
    m_2 = pm.Normal('m_2', mu=0, sigma=20)
    sigma_2 = pm.HalfNormal('sigma_2', sigma=std_O2)

    q_3 = pm.Normal('q_3', mu=0, sigma=20)
    m_3 = pm.Normal('m_3', mu=0, sigma=20)
    sigma_3 = pm.HalfNormal('sigma_3', sigma=std_O3)

    # Likelihood for O_2
    mu_O2 = q_2 + m_2 * O_1_data
    O_2_obs = pm.Normal('O_2_obs', mu=mu_O2, sigma=sigma_2, observed=O_2_data)

    # Likelihood for O_3 using O_2 predictions
    mu_O3 = q_3 + m_3 * O_2_obs 
    O_3_obs = pm.Normal('O_3_obs', mu=mu_O3, sigma=sigma_3, observed=O_3_data)

    # Run MCMC simulation
    trace = pm.sample(2000, tune=1000, return_inferencedata=True, random_seed=1503)

Now I would like to sample posterior predictive samples for O_3, given a new values of O_1 (unseen during models fitting). Following the suggestions in this link, here is what I tried:

# Example new O1 test data point:
new_O_1_value = np.array([2.3])

# Generate posterior predictive samples for new O1 values
with model:
    # Update with new O1 values
    pm.set_data({'O_1': new_O_1_value})
    
    # Generate posterior predictive samples for O3
    posterior_predictive = pm.sample_posterior_predictive(trace, predictions=True, random_seed=1503)
    
    # Extract O3 posterior predictive samples
    O3_post_pred = posterior_predictive.predictions['O_3_obs']

Contrarily to my expectations, O3_post_pred has shape (# chains, # Monte Carlo samples, # rows in training data) (i.e in my case (4, 2000, 95), but I believe it should be (# chains, # Monte Carlo samples, # rows in new data) (i.e in my case (4, 2000, 1), since new_O_1_value has only one value.

Question: What am I doing wrong? Any help appreciated.


Solution

  • The issue was that in the definition of the likelihood of the two models for O_2 and O_3 I was not passing the shape argument, which if not explicitly set, it defaults to the shape of the data used to fit the models, even when passing a new input to predict on. The correct code looks like:

    O_2_obs = pm.Normal('O_2_obs', mu=mu_O2, sigma=sigma_2, observed=O_2_data, shape=O_1.shape) 
    ...
    O_3_obs = pm.Normal('O_3_obs', mu=mu_O3, sigma=sigma_3, observed=O_3_data, shape=O_2_obs.shape)
    

    all the rest being the same.