pythonpymc3pymc

PyMC3: Different predictions for identical inputs


In PyMC3, single new observations passed via set_data() are currently not handled correctly by sample_posterior_predictive(), which in such cases predicts the training data instead (see #3640). Therefore, I decided to add a second artificial row, which is identical to the first one, to my input data in order to bypass this behavior.

Now, I stumbled across something that I currently fail to make sense of: the predictions for the first and second row are different. With a constant random_seed, I would have expected the two predictions to be identical. Can anyone please (i) affirm that this is intended behavior rather than a bug and, if so, (ii) explain why sample_posterior_predictive() creates different results for one and the same input data?

Here's a reproducible example based on the iris dataset, where petal width and length serve as predictor and response, respectively, and everything but the last row is used for training. The model is subsequently tested against the last row. pd.concat() is used to duplicate the first row of the test data frame to circumvent the above bug.

import seaborn as sns
import pymc3 as pm
import pandas as pd
import numpy as np


### . training ----

dat = sns.load_dataset('iris')
trn = dat.iloc[:-1]

with pm.Model() as model:
    s_data = pm.Data('s_data', trn['petal_width'])
    outcome = pm.glm.GLM(x = s_data, y = trn['petal_length'], labels = 'petal_width')
    trace = pm.sample(500, cores = 1, random_seed = 1899)


### . testing ----

tst = dat.iloc[-1:]
tst = pd.concat([tst, tst], axis = 0, ignore_index = True)

with model:
    pm.set_data({'s_data': tst['petal_width']})
    ppc = pm.sample_posterior_predictive(trace, random_seed = 1900)

np.mean(ppc['y'], axis = 0)
# array([5.09585088, 5.08377112]) # mean predicted value for [first, second] row

Solution

  • I don't think it's a bug and I also don't find it troubling. Since PyMC3 doesn't check whether the points being predicted are identical, it treats them separately and each one results in a random draw from the model. While each PPC draw (row in ppc['y']) is using the same random parameter settings for the GLM taken from the trace, the model is still stochastic (i.e., there is always measurement error). I think this explains the difference.

    If you increase the number of draws in the PPC, you will see that the difference in the means decreases, which is consistent with this just being a difference in sampling.