I have measured the diameter of 80 fruits last year, and after checking what is the best distribution of the values, I've created a PyMC3 model
with Model() as diam_model:
mu = Normal('mu',mu=57,sd=5.42)
sigma = Uniform('sigma',0,10)
after, as far as I understand, I've "trained" the model with my prior data (the 80 values)
with diam_model:
dist = Normal('dist',mu=mu,sd=sigma, observed=prior_data.values)
with diam_model:
samples=fit().sample(1000)
then I used the plot_posterior
of the samples
, returning also the mean and HPD.
My idea is to measure again this year using Bayesian update to reduce the sample size. How can I add single values, and update the posterior, expecting that the HPD gets smaller and smaller?
Using the other answer suggested as a duplicate, one could extract approximate versions of the priors using the code from this Jupyter notebook.
I'll assume we have data from the first round of sampling, which we can impose the mean 57.0 and standard deviation of 5.42.
import numpy as np
import pymc3 as pm
from sklearn.preprocessing import scale
from scipy import stats
# generate data forced to match distribution indicated
Y0 = 57.0 + scale(np.random.normal(size=80))*5.42
with pm.Model() as m0:
# let's place an informed, but broad prior on the mean
mu = pm.Normal('mu', mu=50, sd=10)
sigma = pm.Uniform('sigma', 0, 10)
y = pm.Normal('y', mu=mu, sd=sigma, observed=Y0)
trace0 = pm.sample(5000, tune=5000)
We can then use the results of this model to extract KDE posteriors on the parameters with the following code from the referenced notebook:
def from_posterior(param, samples, k=100):
smin, smax = np.min(samples), np.max(samples)
width = smax - smin
x = np.linspace(smin, smax, k)
y = stats.gaussian_kde(samples)(x)
# what was never sampled should have a small probability but not 0,
# so we'll extend the domain and use linear approximation of density on it
x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
y = np.concatenate([[0], y, [0]])
return pm.Interpolated(param, x, y)
Now, if we have more data we can run a new model using the KDE updated priors:
Y1 = np.random.normal(loc=57, scale=5.42, size=100)
with pm.Model() as m1:
mu = from_posterior('mu', trace0['mu'])
sigma = from_posterior('sigma', trace0['sigma'])
y = pm.Normal('y', mu=mu, sd=sigma, observed=Y1)
trace1 = pm.sample(5000, tune=5000)
And similarly, one could use this trace to extract updated posterior estimates for future rounds of incoming data.
The above methodology yields approximations to true updated priors and would be most useful in cases where conjugate priors are not possible. It should also be noted that I'm not sure the extent to which such KDE-base approximations introduce errors and how they propagate in a model when used repeatedly. It's a neat trick but one should be cautious about putting this into production without further validation of its robustness.
In particular, I would be very concerned about situations where the posterior distributions have strong correlation structures. The code provided here generates a "prior" distribution using only the marginals of each latent variable. This seems fine for simple models like this, and admittedly the initial priors also lack correlations, so it may not be a huge issue here. Generally, however, summarizing to marginals involves discarding information about how variables are related, and in other contexts this could be rather significant. For example, the default parameterization of a Beta distribution always leads to correlated parameters in the posterior and thus the above technique would inappropriate. Instead, one would need to infer a multi-dimensional KDE that incorporates all the latent variables.
However, in your particular case the expected distribution is Gaussian and those distributions have established closed-form conjugate models, i.e., principled solutions rather than approximations. I strongly recommend working through Kevin Murphy's Conjugate Bayesian analysis of the Gaussian distribution.
The Normal-Inverse Gamma model estimates both the mean and the variance of an observed normal random variable. The mean is modeled with a normal prior; the variance with an inverse gamma. This model uses four parameters for the prior:
mu_0 = prior mean
nu = number of observations used to estimate the mean
alpha = half the number of obs used to estimate variance
beta = half the sum of squared deviations
Given your initial model, we could use the values
mu_0 = 57.0
nu = 80
alpha = 40
beta = alpha*5.42**2
You can then plot the log-likelihood of the prior as follows:
# points to compute likelihood at
mu_grid, sd_grid = np.meshgrid(np.linspace(47, 67, 101),
np.linspace(4, 8, 101))
# normal ~ N(X | mu_0, sigma/sqrt(nu))
logN = stats.norm.logpdf(x=mu_grid, loc=mu_0, scale=sd_grid/np.sqrt(nu))
# inv-gamma ~ IG(sigma^2 | alpha, beta)
logIG = stats.invgamma.logpdf(x=sd_grid**2, a=alpha, scale=beta)
# full log-likelihood
logNIG = logN + logIG
# actually, we'll plot the -log(-log(likelihood)) to get nicer contour
plt.figure(figsize=(8,8))
plt.contourf(mu_grid, sd_grid, -np.log(-logNIG))
plt.xlabel("$\mu$")
plt.ylabel("$\sigma$")
plt.show()
Given new data, Y1
, one updates the parameters as follows:
# precompute some helpful values
n = Y1.shape[0]
mu_y = Y1.mean()
# updated NIG parameters
mu_n = (nu*mu_0 + n*mu_y)/(nu + n)
nu_n = nu + n
alpha_n = alpha + n/2
beta_n = beta + 0.5*np.square(Y1 - mu_y).sum() + 0.5*(n*nu/nu_n)*(mu_y - mu_0)**2
For the sake of illustrating change in the model, let's generate some data from a slightly different distribution and then plot the resulting posterior log-likelihood:
np.random.seed(53211277)
Y1 = np.random.normal(loc=62, scale=7.0, size=20)
which yields
Here, the 20 observations are not enough to completely move to the new location and scale I provided, but both parameters appear shifted in that direction.