gensimword2vecsubsampling

How does Gensim implement subsampling in Word2Vec?


I am trying to reimplement wor2vec in pytorch. I implemented subsamping according to the code of the original paper. However, I am trying to understand how subsampling is implemented in Gensim. I looked at the source code, but I did not manage to grasp how it reconnects to the original paper.

Thanks a lot in advance.


Solution

  • The key line is:

    https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec_inner.pyx#L543

        if c.sample and word.sample_int < random_int32(&c.next_random):
            continue
    

    If c.sample tests if frequent-word downsampling is enabled at all (any non-zero value).

    The word.sample_int is a value, per vocabulary word, that was precalculated during the vocabulary-discovery phase. It's essentially the 0.0-to-1.0 probability that a word should be kept, but scaled to the range 0-to-(2^32-1).

    Most words, that are never down-sampled, simply have the value (2^32-1) there - so no matter what random int was just generated, that random int is less than the threshold, and the word is retained.

    The few most-frequent words have other scaled values there, and thus sometimes the random int generated is larger than their sample_int. Thus, that word is, in that one training-cycle, skipped via the continue to the next word in the sentence. (That one word doesn't get made part of effective_words, this one time.)

    You can see the original assignment & precalculation of the .sample_int values, per unique vocabulary word, at and around:

    https://github.com/RaRe-Technologies/gensim/blob/e391f0c25599c751e127dde925e062c7132e4737/gensim/models/word2vec.py#L1544