I am trying to fit a linear model to data using Bayesian inference technique. For this, I thought of using PyMC. Naturally, after training a model, I want to test its performance on new data and that's where the problem occurs. I don't seem to be able to set a new dataset. Anybody with experience?
Example script that shows the error:
import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
def run_model():
# Generate synthetic data
np.random.seed(42)
x = np.linspace(0, 10, 100)
a_true = 2.5 # True slope
b_true = 1.0 # True intercept
y_true = a_true * x + b_true
y = y_true + np.random.normal(0, 1, size=x.size) # Add some noise
# Split into training and test sets
x_train, x_test = x[:80], x[80:]
y_train, y_test = y[:80], y[80:]
# Define and fit the model
with pm.Model() as linear_model:
# Define x as a pm.Data variable to allow updating with pm.set_data
x_shared = pm.Data("x", x_train)
# Priors for slope and intercept
a = pm.Normal("a", mu=0, sigma=10)
b = pm.Normal("b", mu=0, sigma=10)
sigma = pm.HalfNormal("sigma", sigma=1)
# Expected value of y
mu = a * x_shared + b
# Likelihood
y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y_train)
# Sample from the posterior
trace = pm.sample(1000, tune=1000, return_inferencedata=True, chains=1)
# Predict on training data
with linear_model:
pm.set_data({"x": x_train}) # Update data to training
post_pred_train = pm.sample_posterior_predictive(trace)
# Predict on test data
with linear_model:
pm.set_data({"x": x_test}) # Update data to testing
post_pred_test = pm.sample_posterior_predictive(trace)
# Plot results
plt.figure(figsize=(10, 5))
# Plot training data
plt.scatter(x_train, y_train, c="blue", label="Training data")
plt.plot(x_train, y_true[:80], "k--", label="True function")
# Plot posterior predictive for training data
plt.plot(
x_train,
post_pred_train["y_obs"].mean(axis=0),
label="Posterior predictive (train)",
color="red",
)
# Plot test data
plt.scatter(x_test, y_test, c="green", label="Test data")
# Plot posterior predictive for test data
plt.plot(
x_test,
post_pred_test["y_obs"].mean(axis=0),
label="Posterior predictive (test)",
color="orange",
)
plt.legend()
plt.xlabel("x")
plt.ylabel("y")
plt.title("Bayesian Linear Regression with PyMC")
plt.show()
# Summary of the model parameters
print(az.summary(trace, var_names=["a", "b", "sigma"]))
# Only execute if run as the main module
if __name__ == '__main__':
run_model()
Error that results:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (80,) and arg 1 with shape (20,).
Apply node that caused the error: normal_rv{"(),()->()"}(RNG(<Generator(PCG64) at 0x1F23323B5A0>), [80], Composite{((i0 * i1) + i2)}.0, ExpandDims{axis=0}.0)
Toposort index: 4
Inputs types: [RandomGeneratorType, TensorType(int64, shape=(1,)), TensorType(float64, shape=(None,)), TensorType(float64, shape=(1,))]
Inputs shapes: ['No shapes', (1,), (20,), (1,)]
Inputs strides: ['No strides', (8,), (8,), (8,)]
Inputs values: [Generator(PCG64) at 0x1F23323B5A0, array([80], dtype=int64), 'not shown', array([0.97974278])]
Outputs clients: [[output[1](normal_rv{"(),()->()"}.0)], [output[0](y_obs)]]
This blog post allowed me to figure out the answer, so credit mostly goes there and I would recommended you read it in full.
The key point is that you need to include the shape
argument in your definition of y_obs
:
y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, shape=x_shared.shape, observed=y_train)
Note the shape=x_shared.shape
!
I also saw in multiple examples (the blog and the pymc docs) that you should do
post_pred_test = pm.sample_posterior_predictive(trace, predictions=True)
i.e. set predictions=True
- not sure if this makes a big difference but it seems like the right thing to do...
Your code has some other issues, at least with my version of pymc:
You cannot do post_pred_train["y_obs"].mean(axis=0),
but I got it working with
plt.plot(
x_train,
post_pred_train.posterior_predictive["y_obs"].mean(("chain", "draw")),
label="Posterior predictive (train)",
color="red",
)
and similarly (but confusingly slightly different), for the test data:
plt.plot(
x_test,
post_pred_test.predictions["y_obs"].mean(("chain", "draw")),
label="Posterior predictive (test)",
color="orange",
)
Note: .predictions
here instead of .posterior_predictive
And in both cases, I needed to take the mean across chain and draw