nlpword2vecsubsampling

subsampling formula skipgram NLP


I'm studying how to implement a Skip-Gram model using Pytorch, I follow this tutorial, in the subsampling part the author used this formula:

import random
import math

def subsample_prob(word, t=1e-3):
    z = freq[word_to_ix[word]] / sum_freq
    return (math.sqrt(z/t) + 1) * t/z

words_subsample = [w for w in words if random.random() < subsample_prob(w)]

where z variable is the proportion of counts of a certain word by the total of words in the corpus. my doubt is that depending on the proportion of words this formula gives a result greater than one, then the word is always added to the sub sample corpus, shouldn't it return a value between zero and one?


Solution

  • The frequent-word downsampling option ('subsampling') introduced in the original word2vec (as a -sample argument) indeed applies downsampling only to a small subset of the very-most-frequent words. (And, given the 'tall head'/Zipfian distributions of words in natural-language texts, that's plenty.)

    Typical values leave most words fully sampled, as reflected in this formula by a sampling-probability greater-than 1.0.

    So: there's no error here. It's how the original word2vec implementation, and others, interpret the sample parameter. Most words are exempt from any thinning, but some of the most-common words are heavily dropped. (But, there's still plenty of their varied usage examples in the training set – and indeed spending fewer training updates redundantly on those words lets other words get better vectors, facing less contention/dilution from overtraining of common words.)