I try to understand the differences between the two methods bayes
and mle
in the bn.fit
function of the package bnlearn
.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle
is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes
estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit
, leading to bnlearn:::bn.fit.backend
, leading to bnlearn:::smartSapply
but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit
applies to discrete variables. The key is the optional iss
argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X
in some network, the bayes
option in bnlearn::bn.fit
returns (Nx + iss / cptsize) / (N + iss)
as the probability of X = x
, where N
is your number of samples, Nx
the number of samples with X = x
, and cptsize
the size of the CPT of X
; in this case cptsize = 2
. The relevant code is in the bnlearn:::bn.fit.backend.discrete
function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize
is the number of imaginary observations for each entry in a CPT, as opposed to N
, the number of 'real' observations. With iss = 0
you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss
with respect to N
, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss
and a growing N
, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss
so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.