nlpsmoothinglanguage-model

Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?


In NLP when we use Laplace(Add-one) smoothing technique we assume that the every word is seen one more time than the actual count and the formula is like this

formula for add one smoothing

where V is the size of the vocabulary. My question is why do we add V when we are only considering the count of the previous word.

I only have a rough idea that every word is incremented by one so we have to normalize it by V time but I still don't understand it properly. As I said we are only considering the count of previous word right so why don't just add 1 to it.

I also saw that if we add V then the addition of all bigrams will be 1 which is what it should be. But is there any other explanation of why V?


Solution

  • The |V| variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from:

    enter image description here

    First, we start with the naive assumption that if we add 1 to the numerator, we also add 1 to denominator to avoid math division error.

    But instead of adding +1 to all terms in the vocabulary, we could simply add the size of the vocab, thus you see the sum(c(wi-1)) + |V| in the denominator, instead of sum(c(wi-1) + 1), note the scope of the "sum" function.


    More details

    Sometimes I find it easier to see the math in code, consider this ngram without laplace:

    # From https://docs.python.org/3/library/itertools.html#itertools.pairwise
    
    try:
        from itertools import pairwise
    except:
        from itertools import tee
        def pairwise(iterable):
            # pairwise('ABCDEFG') --> AB BC CD DE EF FG
            a, b = tee(iterable)
            next(b, None)
            return zip(a, b)
    
        
    def count_numerator(ngrams, sentences):
        return sum([list(pairwise(s)).count(ngrams) for s in sentences])
    
    def count_denominator(word, sentences):
        return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) for s in sentences])
    
    
    s1 = ["<s>", "this", "is", "foo", "bar", '</s>']
    s2 = ["<s>", "foo", "bar", "is", "good", '</s>']
    s3 = ["<s>", "fruit", "bar", "is", "better", '</s>']
    s4 = ["<s>", "this", "is", "good", "</s>"]
    s5 = ["<s>", "i", "like", "this", "thing", '</s>']
    
    sentences = [s1, s2, s3, s4, s5]
    
    
    prob_bos_this = count_numerator(('<s>', 'this'), sentences) / count_denominator("<s>", sentences) 
    
    prob_this_is = count_numerator(('this', 'is'), sentences) / count_denominator("this", sentences)
    
    prob_is_good = count_numerator(('is', 'good'), sentences) / count_denominator("is", sentences)
    
    prob_good_eos = count_numerator(('good', '</s>'), sentences) / count_denominator("good", sentences)
    
    print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
    
    print_this_is_good # Outputs: 0.1333
    

    Now consider the laplace smoothing with the incremental +1 on the numerator and denominator:

    def count_numerator_laplace(ngrams, sentences):
        return sum([list(pairwise(s)).count(ngrams) + 1  for s in sentences])
    
    def count_denominator_laplace(word, sentences):
        return sum([sum([int(ng[0] == word) + 1 for ng in pairwise(s)])  for s in sentences])
    
    
    prob_bos_this = count_numerator_laplace(('<s>', 'this'), sentences) / count_denominator_laplace("<s>", sentences) 
    
    prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences)
    
    prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences)
    
    prob_good_eos = count_numerator_laplace(('good', '</s>'), sentences) / count_denominator_laplace("good", sentences)
    
    print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
    print_this_is_good  # 0.004212103350034384
    

    Note that the +1 on the numerator is adding simple +1 to each wi-1, wi count. The denominator is adding +1 for each ngram that exists in the corpus containing the wi-1, *.


    Now, we see that we are summing +1 for every ngrams that occurs in a similar fashion, we can just add the no. of ngrams that exists, e.g.

    def count_numerator_laplace(ngrams, sentences):
        return sum([list(pairwise(s)).count(ngrams) + 1  for s in sentences])
    
    def count_denominator_laplace(word, sentences):
        return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) + len(list(pairwise(s))) for s in sentences])
    
    
    prob_bos_this = count_numerator_laplace(('<s>', 'this'), sentences) / count_denominator_laplace("<s>", sentences) 
    
    prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences)
    
    prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences)
    
    prob_good_eos = count_numerator_laplace(('good', '</s>'), sentences) / count_denominator_laplace("good", sentences)
    
    print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
    print_this_is_good  # 0.004212103350034384
    

    This is a good prove of concept for how the +1 in the inner sum becomes a + |V| in when you carry the +1 out of the summation, e.g.

    my_list = [1,2,3]
    sum(i+1 for i in my_list) == sum(i for i in my_list) + len(my_list)