I have to implement a naive bayes classifier for classifying a document to a class. So, in getting the conditional probability for a term belonging to class, along with laplace smoothing, we have:
prob(t | c) = Num(Word occurences in the docs of the class c) + 1 / Num(documents in class c) + |V|
Its a bernoulli model, which will have either 1 or 0 and the vocabulary is really large, like perhaps 20000 words and so on. So, won't the laplace smoothing give really small values due to the large size of the vocabulary or am I doing something wrong.
According to the psuedo code from this link: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html, for the bernoulli model we just add 2 instead of |V|. Why so?
Consider the case of multinomial naive Bayes. The smoothing you defined above is such that you can never get a zero probability.
With the multivariate/Bernoulli case, there is an additional constraint: probabilities of exactly 1 are not allowed either. This is because when some t
from the known vocabulary is not present in the document d
, a probability of 1 - prob(t | c)
is multiplied to the document probability. If prob(t | c)
is 1, then once again this is going to produce a posterior probability of 0.
(Likewise, when using logs instead, log(1 - prob(t | c))
is undefined when the probability is 1)
So in the Bernoulli equation (Nct + 1) / (Nc + 2)
, both cases are protected against. If Nct == Nc
, then the probability will be 1/2 rather than 1. This also has the consequence of producing a likelihood of 1/2 regardless of whether t
exists (P(t | c) == 1/2
) or not (1 - P(t | c) == 1/2
)