As shown below, the balanced, one dimensional data below can be perfectly separated by sklearn GaussianNB
. Why is it that sklearn ComplementNB
gives classifications that are all zeros for the same data?
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import ComplementNB
import numpy as np
N = 20
np.random.seed(9)
pos = np.random.uniform(size = N, low = 0.7, high = 0.8).reshape(-1, 1)
neg = np.random.uniform(size = N, low = 0.4, high = 0.5).reshape(-1, 1)
X = np.r_[pos, neg]
Y = np.array([1] * N + [0] * N)
gnb = GaussianNB()
cnb = ComplementNB()
gnb.fit(X,Y)
cnb.fit(X,Y)
#predict training data
print(gnb.predict(X))
print(cnb.predict(X))
The Gaussian Naive Bayes model is 100% correct. The Complement Naive Bayes model only predicts zeros. Why?
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Complement Naive Bayes actually looses it's power when there are only one feature available think of several books consisting of a single repeated word, the language model created by the positive and negative class each will produce that word with probability one and hence that feature will make no sense.
To be more precise consider the weight calculations in the sklearn documentation:
As you can see when there is only one feature the summation in k will result in one value only and hence the values of theta will be one, the logarithm of which is zero and makes not contribution to the classification. As you can see by the output of the following code:
cnb.feature_log_prob_
which indicates that the features will be weighted by zero.