pythonstatisticspymc3arviz

Does pymc3 have an equivalent for the predict method in scikit-learn?


I'm going through the Statistical Rethinking course using PyMC3. At the end of Chapter 4, they ask for the HDI of individual values for data points that were not in the original (!Kung) dataset. Is it possible to do that in PyMC3?

In scikit-learn, you have fit() and predict(), and you're able to predict the output for completely new inputs.

With PyMC3, you can sample() to get your trace, and you can ask for a posterior predictive check, but I wasn't able to pass it any parameters for the values I'm interested in. I did manage to do it in a roundabout way using shared theano variables, and also manually.

edit: I added a pm.Data() and pm.set_data() example at the very end. I think this might be the answer, but I'm waiting for someone else to confirm before I mark it as Answered.


Here's what I did.

weight_s is the standardized weight data. Standardization is done with this function:

def standardize(array, reference=None):
    if reference is None:
        reference = array
    return (array - reference.mean()) / reference.std()

Here's the PyMC3 model:

import pymc3 as pm

with pm.Model() as m_adult:
    a = pm.Normal("α", mu=155, sd=20)
    b = pm.Lognormal("β", mu=0, sd=1)
    mu =  pm.Deterministic("μ", a + b * adults.weight_s)
    sigma = pm.Uniform("σ", 0, 50)
    height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
    trace_adult = pm.sample()

This is what the data looks like (note that the HDI is 89%):

height_pred = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]

fig, ax = plt.subplots()
ax.plot(adults.weight, adults.height, ".")
ax.plot(adults.weight, trace_adult.μ.mean(axis=0), color="black")
az.plot_hdi(adults.weight, trace_adult.μ, ax=ax, color="black")
az.plot_hdi(adults.weight, height_pred, ax=ax)
ax.set(xlabel="weight", ylabel="height")
fig.tight_layout()

The shape of the data

First, I'll show you the manual version:

missing_weights = np.array([45, 40, 65, 31, 53])

expected_height = np.array([
    (trace_adult.α + trace_adult.β * standardize(weight, adults.weight)).mean()
    for weight in missing_weights
])
hdis = np.array([
    az.hdi(np.random.normal(
        trace_adult.α + trace_adult.β * standardize(weight, adults.weight),
        trace_adult.σ,
    )) for weight in missing_weights
])

data = np.vstack((missing_weights, expected_height, hdis.T)).T
missing_df = pd.DataFrame(, columns=["weight", "expected_height", "hdi_lower", "hdi_upper"])
print(missing_df)

This gives us:

   weight  expected_height   hdi_lower   hdi_upper
0      45       154.603176  146.981285  163.149938
1      40       150.105295  142.095583  158.474277
2      65       172.594698  164.401102  180.786641
3      31       142.009110  134.163952  150.233028
4      53       161.799785  153.881956  170.209779

The numbers make sense if you look at the graph.

Now for the shared variable. We can modify the model like this:

from theano import shared

shared_weights_s = shared(adults.weight_s.values)
with pm.Model() as m_adult:
    a = pm.Normal("α", mu=155, sd=20)
    b = pm.Lognormal("β", mu=0, sd=1)
    mu =  pm.Deterministic("μ", a + b * shared_weights_s)
    sigma = pm.Uniform("σ", 0, 50)
    height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
    trace_adult = pm.sample()

Now, we have three choices for a new value of shared_weights:

For the one-by-one case:

missing_weights = np.array([45, 40, 65, 31, 53])

rows = []

for weight in missing_weights:
    row = [weight]
    shared_weights_s.set_value(standardize(np.array([weight]), adults.weight))
    height_pred_single = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]
    row.append(height_pred_single.mean())
    row.extend(list(az.hdi(height_pred_single).mean(axis=0)))
    rows.append(row)

missing_df = pd.DataFrame(rows, columns=["weight", "expected_height", "hdi_lower", "hdi_upper"])
print(missing_df)

Doing it for all of them gives us:

   weight  expected_height   hdi_lower   hdi_upper
0      45       154.604520  146.485327  162.713345
1      40       150.113378  142.001151  158.263953
2      65       172.580212  164.357970  180.843184
3      31       142.010954  133.786200  150.142080
4      53       161.792962  153.651266  169.926615

You can do them all at once:

missing_weights = np.array([45, 40, 65, 31, 53])

shared_weights_s.set_value(standardize(missing_weights, adults.weight))
height_pred_replace = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]
missing_df = pd.DataFrame(missing_weights, columns=["weight"])
missing_df["expected_height"] = height_pred_replace.mean(axis=0)
missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_replace)
print(missing_df)

Which gives us:

   weight  expected_height   hdi_lower   hdi_upper
0      45       154.578096  147.066342  163.069805
1      40       150.042506  141.561599  158.120596
2      65       172.568430  164.079591  180.536870
3      31       142.080048  134.173959  150.345556
4      53       161.830472  153.327694  169.717058

Finally, we can add it to the end of the previous shared weight variable and take the tail:

missing_weights = np.array([45, 40, 65, 31, 53])

shared_weights_s.set_value(np.append(adults.weight_s.values, standardize(missing_weights, adults.weight)))
height_pred_append = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]
missing_df = pd.DataFrame(missing_weights, columns=["weight"])
missing_df["expected_height"] = height_pred_append.mean(axis=0)[-len(missing_weights):]
missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_append)[-len(missing_weights):]
print(missing_df)

Which gives us:

   weight  expected_height   hdi_lower   hdi_upper
0      45       154.640287  146.093825  162.477313
1      40       150.088713  142.168331  158.314038
2      65       172.633776  164.086280  180.483805
3      31       142.019331  133.516545  150.491937
4      53       161.880175  153.530868  169.771088

As you can see, all these methods end up giving the same results. Is there an official/best way of doing it? Can it be done without setting up the global shared variable and modifying it? Does PyMC3 have such a function, or is it something that might be added in the future? (I may be able to make a pull request for this if it's simple enough; I'm still new to PyMC3.)


edit: I think I found the answer: Use pm.Data().

with pm.Model() as m_adult:
    weight_s = pm.Data("weight_s", adults.weight_s.values)
    a = pm.Normal("α", mu=155, sd=20)
    b = pm.Lognormal("β", mu=0, sd=1)
    mu =  pm.Deterministic("μ", a + b * weight_s)
    sigma = pm.Uniform("σ", 0, 50)
    height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
    trace_adult = pm.sample()

Then, when trying things out, we pm.set_data():

missing_weights = np.array([45, 40, 65, 31, 53])

with m_adult:
    pm.set_data({"weight_s": standardize(missing_weights, adults.weight)})
    height_pred_data = pm.fast_sample_posterior_predictive(trace_adult)["height"]

missing_df = pd.DataFrame(missing_weights, columns=["weight"])
missing_df["expected_height"] = height_pred_data.mean(axis=0)
missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_data)
print(missing_df)

Which gives:

   weight  expected_height   hdi_lower   hdi_upper
0      45       154.584063  145.828088  162.512174
1      40       150.184853  142.272258  158.451555
2      65       172.662069  164.522903  180.803430
3      31       141.949137  133.310865  149.811098
4      53       161.719867  153.848599  169.638495

Solution

  • I think I found the answer: Use pm.Data().

    with pm.Model() as m_adult:
        weight_s = pm.Data("weight_s", adults.weight_s.values)
        a = pm.Normal("α", mu=155, sd=20)
        b = pm.Lognormal("β", mu=0, sd=1)
        mu =  pm.Deterministic("μ", a + b * weight_s)
        sigma = pm.Uniform("σ", 0, 50)
        height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
        trace_adult = pm.sample()
    

    Then, when trying things out, we pm.set_data():

    missing_weights = np.array([45, 40, 65, 31, 53])
    
    with m_adult:
        pm.set_data({"weight_s": standardize(missing_weights, adults.weight)})
        height_pred_data = pm.fast_sample_posterior_predictive(trace_adult)["height"]
    
    missing_df = pd.DataFrame(missing_weights, columns=["weight"])
    missing_df["expected_height"] = height_pred_data.mean(axis=0)
    missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_data)
    print(missing_df)
    

    Which gives:

       weight  expected_height   hdi_lower   hdi_upper
    0      45       154.584063  145.828088  162.512174
    1      40       150.184853  142.272258  158.451555
    2      65       172.662069  164.522903  180.803430
    3      31       141.949137  133.310865  149.811098
    4      53       161.719867  153.848599  169.638495