I was working on a simple Bayesian linear regression using PyMC3 in python. In defining the likelihood function I came across this syntax.
likelihood = pm.Normal('Y', mu=intercept + x_coeff * df['x'],sd=sigma, observed=df['y'])
In the parameters for pm.Normal(), what does the "observed = " do? Please explain with examples if possible.
observed
means that the value of the linear regression's response variable (typically named "y", but here confusingly named likelihood
) is known (through observation) to be equal to df[y]
.
When the inference algorithm is run, values df['y']
will be used to determine the likely values of stochastic variables intercept
and x_coeff
that would have caused them. To do that, it uses the causal relationship between them, namely that the observed variable is Normally-distributed with mean equal to intercept + x_coeff*df['x']
and standard deviation sigma
.
Note that df['y']
is typically an array with multiple observations. So the algorithm will try to infer the distributions of intercept
and x_coeff
likely to have induced these multiple observations df['y']
.
Note that the algorithm will not infer the values for df['x']
since that is also fixed, observed data.
I mentioned the variable was confusingly named likelihood
instead of y
. That is because pm.Normal
does create a stochastic variable object, not a real-valued likelihood. I believe the reason this name was chosen was tradition, because the observed values define a likelihood that is internally used by the inference algorithm to infer the distributions for the other stochastic variables.
In fact, in the PyMC introduction we see a similar definition using the name Y_obs
instead:
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)