The sampling_table parameter
is only used in the tf.keras.preprocessing.sequence.skipgrams
method once to test if the probability of the target word in the sampling_table
is smaller than some random number drawn from 0 to 1 (random.random()
).
If you have a large vocabulary and a sentence that uses a lot of infrequent words, doesn't this cause the method to skip a lot of the infrequent words in creating skipgrams? Given the values of a sampling_table that is log-linear like a zipf distribution, doesn't this mean you can end up with no skip grams at all?
Very confused by this. I am trying to replicate the Word2Vec tutorial hand don't understand or how the sampling_table
is being used.
In the source code, this is the lines in question:
if sampling_table[wi] < random.random():
continue
This looks like the frequent-word-downsampling feature common in word2vec implementations. (In the original Google word2vec.c
code release, and the Python Gensim library, it's adjusted by the sample
parameter.)
In practice, it's likely sampling_table
has been precalculated so that the rarest words are always used, common words skipped a little, and the very-most-common words skipped a lot.
That seems to be the intent reflected by the comment for make_sample_table()
.
You could go ahead and call that with a probe value, like say 1000 for a 100-word vocabulary, and see what sampleing_table
it gives back. I suspect it'll be numbers close to 1.0
early (drop lots of common words), and close to 0.0
late (keep most/all rare words).
This tends to improve word-vector quality, by reserving more relative attention for medium- and low-frequency words, and not exessively overtraining/overweighting plentiful words.