This is something of a follow-up to this thread, where I was getting erroneous results with the GaussianNB classifier, which turned out to be because I had scikit-learn v0.10 on the linux VM I was doing experiments on. I ended up using the Bernoulli and Multinomial NB classifiers instead, but when I (finally) got scipy installed on my macbook, the scikit-learn version I grabbed was 0.13, the latest as of this writing. I was now presented with a new problem:
Does anyone know what changed between the versions? I had a look at the repo history but didn't see anything to account for this kind of change in accuracy. Since I'm getting really good results with BernoulliNB v0.10, I'd obviously like to use them, but I'm hesitant to do so without a little more understanding of the conflicting results between versions.
I've tried setting the (newer) class_prior property but that didn't change the results for 0.13.
Edit: short of coming up with a worked example (which I will, well, work on), the 0.13 outcomes are heavily biased, which is not something I would expect as much from a bayesian classifier, and leads me to believe it may have been a regression on the class prior calculations, though I haven't tracked it down yet. For example:
0.10:
T\P F M
F 120 18
M 19 175
0.13:
T\P F M
F 119 19
M 59 135
Edit 2:
I worked through a few examples by hand. The 0.13 version is definitely correct and the 0.10 version definitely is not, which is what I both suspected and feared. The error in 0.10 appears to be in the class prior calculation. The _count
function is bugged, specifically on this line of the file, the class counts are simply wrong: compare to the 0.13 branch, ignoring that the two branches pull in the smoothing factors at different places.
I have to think about this some more, as to why the botched feature counts are resulting in such good performance on my data, and I'm still a little unsure as to why setting the class priors didn't work. Perhaps it is penalizing against the male bias already present in the source documents?
Edit 3:
I believe that is exactly what it is doing. The _count
function, and consequently the calculation of the feature priors within fit
, does not take this parameter into effect, so while the class_priors are taken into account within predict
, they are not used to build the model during training. Not sure if this is intentional - would you want to ignore the priors used to build the model at testing time?
To sum up my results, the bug was in the 0.10 version of the BernoulliNB classifier, where it was skewing the class counts when calculating the feature priors, and apparently biasing the resulting model to yield superior results. I managed to adapt pieces of what this was doing and eventually got equivalent performance from the (Correct) MultinomialNB in version 0.13.