pythonkerasstatisticsword2vecsubsampling

Statistical reasoning: how and why does tf.keras.preprocessing.sequence skipgrams use sampling_table this way?


The sampling_table parameter is only used in the tf.keras.preprocessing.sequence.skipgrams method once to test if the probability of the target word in the sampling_table is smaller than some random number drawn from 0 to 1 (random.random()).

If you have a large vocabulary and a sentence that uses a lot of infrequent words, doesn't this cause the method to skip a lot of the infrequent words in creating skipgrams? Given the values of a sampling_table that is log-linear like a zipf distribution, doesn't this mean you can end up with no skip grams at all?

Very confused by this. I am trying to replicate the Word2Vec tutorial hand don't understand or how the sampling_table is being used.

In the source code, this is the lines in question:

            if sampling_table[wi] < random.random():
                continue

Solution

  • This looks like the frequent-word-downsampling feature common in word2vec implementations. (In the original Google word2vec.c code release, and the Python Gensim library, it's adjusted by the sample parameter.)

    In practice, it's likely sampling_table has been precalculated so that the rarest words are always used, common words skipped a little, and the very-most-common words skipped a lot.

    That seems to be the intent reflected by the comment for make_sample_table().

    You could go ahead and call that with a probe value, like say 1000 for a 100-word vocabulary, and see what sampleing_table it gives back. I suspect it'll be numbers close to 1.0 early (drop lots of common words), and close to 0.0 late (keep most/all rare words).

    This tends to improve word-vector quality, by reserving more relative attention for medium- and low-frequency words, and not exessively overtraining/overweighting plentiful words.