I am trying to reimplement wor2vec in pytorch. I implemented subsamping according to the code of the original paper. However, I am trying to understand how subsampling is implemented in Gensim. I looked at the source code, but I did not manage to grasp how it reconnects to the original paper.
Thanks a lot in advance.
The key line is:
if c.sample and word.sample_int < random_int32(&c.next_random):
continue
If c.sample
tests if frequent-word downsampling is enabled at all (any non-zero value).
The word.sample_int
is a value, per vocabulary word, that was precalculated during the vocabulary-discovery phase. It's essentially the 0.0-to-1.0 probability that a word should be kept, but scaled to the range 0-to-(2^32-1).
Most words, that are never down-sampled, simply have the value (2^32-1) there - so no matter what random int was just generated, that random int is less than the threshold, and the word is retained.
The few most-frequent words have other scaled values there, and thus sometimes the random int generated is larger than their sample_int
. Thus, that word is, in that one training-cycle, skipped via the continue
to the next word in the sentence. (That one word doesn't get made part of effective_words
, this one time.)
You can see the original assignment & precalculation of the .sample_int
values, per unique vocabulary word, at and around: