I'm going through the Statistical Rethinking course using PyMC3. At the end of Chapter 4, they ask for the HDI of individual values for data points that were not in the original (!Kung) dataset. Is it possible to do that in PyMC3?
In scikit-learn, you have fit()
and predict()
, and you're able to predict the output for completely new inputs.
With PyMC3, you can sample()
to get your trace, and you can ask for a posterior predictive check, but I wasn't able to pass it any parameters for the values I'm interested in. I did manage to do it in a roundabout way using shared theano variables, and also manually.
edit: I added a pm.Data()
and pm.set_data()
example at the very end. I think this might be the answer, but I'm waiting for someone else to confirm before I mark it as Answered.
Here's what I did.
weight_s
is the standardized weight data. Standardization is done with this function:
def standardize(array, reference=None):
if reference is None:
reference = array
return (array - reference.mean()) / reference.std()
Here's the PyMC3 model:
import pymc3 as pm
with pm.Model() as m_adult:
a = pm.Normal("α", mu=155, sd=20)
b = pm.Lognormal("β", mu=0, sd=1)
mu = pm.Deterministic("μ", a + b * adults.weight_s)
sigma = pm.Uniform("σ", 0, 50)
height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
trace_adult = pm.sample()
This is what the data looks like (note that the HDI is 89%):
height_pred = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]
fig, ax = plt.subplots()
ax.plot(adults.weight, adults.height, ".")
ax.plot(adults.weight, trace_adult.μ.mean(axis=0), color="black")
az.plot_hdi(adults.weight, trace_adult.μ, ax=ax, color="black")
az.plot_hdi(adults.weight, height_pred, ax=ax)
ax.set(xlabel="weight", ylabel="height")
fig.tight_layout()
First, I'll show you the manual version:
missing_weights = np.array([45, 40, 65, 31, 53])
expected_height = np.array([
(trace_adult.α + trace_adult.β * standardize(weight, adults.weight)).mean()
for weight in missing_weights
])
hdis = np.array([
az.hdi(np.random.normal(
trace_adult.α + trace_adult.β * standardize(weight, adults.weight),
trace_adult.σ,
)) for weight in missing_weights
])
data = np.vstack((missing_weights, expected_height, hdis.T)).T
missing_df = pd.DataFrame(, columns=["weight", "expected_height", "hdi_lower", "hdi_upper"])
print(missing_df)
This gives us:
weight expected_height hdi_lower hdi_upper
0 45 154.603176 146.981285 163.149938
1 40 150.105295 142.095583 158.474277
2 65 172.594698 164.401102 180.786641
3 31 142.009110 134.163952 150.233028
4 53 161.799785 153.881956 170.209779
The numbers make sense if you look at the graph.
Now for the shared variable. We can modify the model like this:
from theano import shared
shared_weights_s = shared(adults.weight_s.values)
with pm.Model() as m_adult:
a = pm.Normal("α", mu=155, sd=20)
b = pm.Lognormal("β", mu=0, sd=1)
mu = pm.Deterministic("μ", a + b * shared_weights_s)
sigma = pm.Uniform("σ", 0, 50)
height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
trace_adult = pm.sample()
Now, we have three choices for a new value of shared_weights:
For the one-by-one case:
missing_weights = np.array([45, 40, 65, 31, 53])
rows = []
for weight in missing_weights:
row = [weight]
shared_weights_s.set_value(standardize(np.array([weight]), adults.weight))
height_pred_single = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]
row.append(height_pred_single.mean())
row.extend(list(az.hdi(height_pred_single).mean(axis=0)))
rows.append(row)
missing_df = pd.DataFrame(rows, columns=["weight", "expected_height", "hdi_lower", "hdi_upper"])
print(missing_df)
Doing it for all of them gives us:
weight expected_height hdi_lower hdi_upper
0 45 154.604520 146.485327 162.713345
1 40 150.113378 142.001151 158.263953
2 65 172.580212 164.357970 180.843184
3 31 142.010954 133.786200 150.142080
4 53 161.792962 153.651266 169.926615
You can do them all at once:
missing_weights = np.array([45, 40, 65, 31, 53])
shared_weights_s.set_value(standardize(missing_weights, adults.weight))
height_pred_replace = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]
missing_df = pd.DataFrame(missing_weights, columns=["weight"])
missing_df["expected_height"] = height_pred_replace.mean(axis=0)
missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_replace)
print(missing_df)
Which gives us:
weight expected_height hdi_lower hdi_upper
0 45 154.578096 147.066342 163.069805
1 40 150.042506 141.561599 158.120596
2 65 172.568430 164.079591 180.536870
3 31 142.080048 134.173959 150.345556
4 53 161.830472 153.327694 169.717058
Finally, we can add it to the end of the previous shared weight variable and take the tail:
missing_weights = np.array([45, 40, 65, 31, 53])
shared_weights_s.set_value(np.append(adults.weight_s.values, standardize(missing_weights, adults.weight)))
height_pred_append = pm.fast_sample_posterior_predictive(trace_adult, model=m_adult)["height"]
missing_df = pd.DataFrame(missing_weights, columns=["weight"])
missing_df["expected_height"] = height_pred_append.mean(axis=0)[-len(missing_weights):]
missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_append)[-len(missing_weights):]
print(missing_df)
Which gives us:
weight expected_height hdi_lower hdi_upper
0 45 154.640287 146.093825 162.477313
1 40 150.088713 142.168331 158.314038
2 65 172.633776 164.086280 180.483805
3 31 142.019331 133.516545 150.491937
4 53 161.880175 153.530868 169.771088
As you can see, all these methods end up giving the same results. Is there an official/best way of doing it? Can it be done without setting up the global shared variable and modifying it? Does PyMC3 have such a function, or is it something that might be added in the future? (I may be able to make a pull request for this if it's simple enough; I'm still new to PyMC3.)
edit: I think I found the answer: Use pm.Data()
.
with pm.Model() as m_adult:
weight_s = pm.Data("weight_s", adults.weight_s.values)
a = pm.Normal("α", mu=155, sd=20)
b = pm.Lognormal("β", mu=0, sd=1)
mu = pm.Deterministic("μ", a + b * weight_s)
sigma = pm.Uniform("σ", 0, 50)
height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
trace_adult = pm.sample()
Then, when trying things out, we pm.set_data()
:
missing_weights = np.array([45, 40, 65, 31, 53])
with m_adult:
pm.set_data({"weight_s": standardize(missing_weights, adults.weight)})
height_pred_data = pm.fast_sample_posterior_predictive(trace_adult)["height"]
missing_df = pd.DataFrame(missing_weights, columns=["weight"])
missing_df["expected_height"] = height_pred_data.mean(axis=0)
missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_data)
print(missing_df)
Which gives:
weight expected_height hdi_lower hdi_upper
0 45 154.584063 145.828088 162.512174
1 40 150.184853 142.272258 158.451555
2 65 172.662069 164.522903 180.803430
3 31 141.949137 133.310865 149.811098
4 53 161.719867 153.848599 169.638495
I think I found the answer: Use pm.Data()
.
with pm.Model() as m_adult:
weight_s = pm.Data("weight_s", adults.weight_s.values)
a = pm.Normal("α", mu=155, sd=20)
b = pm.Lognormal("β", mu=0, sd=1)
mu = pm.Deterministic("μ", a + b * weight_s)
sigma = pm.Uniform("σ", 0, 50)
height = pm.Normal("height", mu=mu, sd=sigma, observed=adults.height)
trace_adult = pm.sample()
Then, when trying things out, we pm.set_data()
:
missing_weights = np.array([45, 40, 65, 31, 53])
with m_adult:
pm.set_data({"weight_s": standardize(missing_weights, adults.weight)})
height_pred_data = pm.fast_sample_posterior_predictive(trace_adult)["height"]
missing_df = pd.DataFrame(missing_weights, columns=["weight"])
missing_df["expected_height"] = height_pred_data.mean(axis=0)
missing_df[["hdi_lower", "hdi_upper"]] = az.hdi(height_pred_data)
print(missing_df)
Which gives:
weight expected_height hdi_lower hdi_upper
0 45 154.584063 145.828088 162.512174
1 40 150.184853 142.272258 158.451555
2 65 172.662069 164.522903 180.803430
3 31 141.949137 133.310865 149.811098
4 53 161.719867 153.848599 169.638495