In NLP when we use Laplace(Add-one) smoothing technique we assume that the every word is seen one more time than the actual count and the formula is like this
where V is the size of the vocabulary. My question is why do we add V when we are only considering the count of the previous word.
I only have a rough idea that every word is incremented by one so we have to normalize it by V time but I still don't understand it properly. As I said we are only considering the count of previous word right so why don't just add 1 to it.
I also saw that if we add V then the addition of all bigrams will be 1 which is what it should be. But is there any other explanation of why V?
The |V|
variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from:
First, we start with the naive assumption that if we add 1 to the numerator, we also add 1 to denominator to avoid math division error.
But instead of adding +1 to all terms in the vocabulary, we could simply add the size of the vocab, thus you see the sum(c(wi-1)) + |V|
in the denominator, instead of sum(c(wi-1) + 1)
, note the scope of the "sum" function.
Sometimes I find it easier to see the math in code, consider this ngram without laplace:
# From https://docs.python.org/3/library/itertools.html#itertools.pairwise
try:
from itertools import pairwise
except:
from itertools import tee
def pairwise(iterable):
# pairwise('ABCDEFG') --> AB BC CD DE EF FG
a, b = tee(iterable)
next(b, None)
return zip(a, b)
def count_numerator(ngrams, sentences):
return sum([list(pairwise(s)).count(ngrams) for s in sentences])
def count_denominator(word, sentences):
return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) for s in sentences])
s1 = ["<s>", "this", "is", "foo", "bar", '</s>']
s2 = ["<s>", "foo", "bar", "is", "good", '</s>']
s3 = ["<s>", "fruit", "bar", "is", "better", '</s>']
s4 = ["<s>", "this", "is", "good", "</s>"]
s5 = ["<s>", "i", "like", "this", "thing", '</s>']
sentences = [s1, s2, s3, s4, s5]
prob_bos_this = count_numerator(('<s>', 'this'), sentences) / count_denominator("<s>", sentences)
prob_this_is = count_numerator(('this', 'is'), sentences) / count_denominator("this", sentences)
prob_is_good = count_numerator(('is', 'good'), sentences) / count_denominator("is", sentences)
prob_good_eos = count_numerator(('good', '</s>'), sentences) / count_denominator("good", sentences)
print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
print_this_is_good # Outputs: 0.1333
Now consider the laplace smoothing with the incremental +1 on the numerator and denominator:
def count_numerator_laplace(ngrams, sentences):
return sum([list(pairwise(s)).count(ngrams) + 1 for s in sentences])
def count_denominator_laplace(word, sentences):
return sum([sum([int(ng[0] == word) + 1 for ng in pairwise(s)]) for s in sentences])
prob_bos_this = count_numerator_laplace(('<s>', 'this'), sentences) / count_denominator_laplace("<s>", sentences)
prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences)
prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences)
prob_good_eos = count_numerator_laplace(('good', '</s>'), sentences) / count_denominator_laplace("good", sentences)
print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
print_this_is_good # 0.004212103350034384
Note that the +1 on the numerator is adding simple +1 to each wi-1, wi
count. The denominator is adding +1 for each ngram that exists in the corpus containing the wi-1, *
.
Now, we see that we are summing +1 for every ngrams that occurs in a similar fashion, we can just add the no. of ngrams that exists, e.g.
def count_numerator_laplace(ngrams, sentences):
return sum([list(pairwise(s)).count(ngrams) + 1 for s in sentences])
def count_denominator_laplace(word, sentences):
return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) + len(list(pairwise(s))) for s in sentences])
prob_bos_this = count_numerator_laplace(('<s>', 'this'), sentences) / count_denominator_laplace("<s>", sentences)
prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences)
prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences)
prob_good_eos = count_numerator_laplace(('good', '</s>'), sentences) / count_denominator_laplace("good", sentences)
print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
print_this_is_good # 0.004212103350034384
This is a good prove of concept for how the +1 in the inner sum becomes a + |V| in when you carry the +1 out of the summation, e.g.
my_list = [1,2,3]
sum(i+1 for i in my_list) == sum(i for i in my_list) + len(my_list)