I'm studying how to implement a Skip-Gram model using Pytorch, I follow this tutorial, in the subsampling part the author used this formula:
import random
import math
def subsample_prob(word, t=1e-3):
z = freq[word_to_ix[word]] / sum_freq
return (math.sqrt(z/t) + 1) * t/z
words_subsample = [w for w in words if random.random() < subsample_prob(w)]
where z
variable is the proportion of counts of a certain word by the total of words in the corpus. my doubt is that depending on the proportion of words this formula gives a result greater than one, then the word is always added to the sub sample corpus, shouldn't it return a value between zero and one?
The frequent-word downsampling option ('subsampling') introduced in the original word2vec (as a -sample
argument) indeed applies downsampling only to a small subset of the very-most-frequent words. (And, given the 'tall head'/Zipfian distributions of words in natural-language texts, that's plenty.)
Typical values leave most words fully sampled, as reflected in this formula by a sampling-probability greater-than 1.0
.
So: there's no error here. It's how the original word2vec implementation, and others, interpret the sample
parameter. Most words are exempt from any thinning, but some of the most-common words are heavily dropped. (But, there's still plenty of their varied usage examples in the training set – and indeed spending fewer training updates redundantly on those words lets other words get better vectors, facing less contention/dilution from overtraining of common words.)