word2veczipf

Why is '[UNK]' word the first in word2vec vocabulary?


If the vocabulary is ordered from the more frequent word to the less frequent, placing '[UNK]' at the beginning means that it occurs most. But what if '[UNK]' isn't the most frequent word? Should I put it at another place in the vocabulary, according to its frequency?

I found such issue when doing this tutorial -> https://www.tensorflow.org/tutorials/text/word2vec

When I'm doing negative sampling using the function tf.random.log_uniform_candidate_sampler, the negative samples with low token (s.g. 0,1,2 ...) will be sampled most. If '[UNK]' is the first (or second when using padding) in the vocabulary, which means that it has token 0 (or 1 when using padding), then the '[UNK]' will be heavily sampled as negative sample. If '[UNK]' happens a lot, there is no problem, but what if it doesn't? Then it should receive a higher token, shouldn't?


Solution

  • The method which TextVectorization.get_vocabulary() calls will always put padding and the "OOV" characters as the first elements in the vector, which would imply that they're the most common as you've mentioned.

    Not sure why it was written that way, as the OOV may not always be the most frequent as you've mentioned, but that's how it was implmented:

    Source: https://github.com/keras-team/keras/blob/v2.13.1/keras/layers/preprocessing/index_lookup.py#L370

    However, in order to ensure that it (or any other stop-words) are not oversampled as you mentioned you were concerned about, the tutorial does show how to use the "tf.keras.preprocessing.sequence.make_sampling_table" function in order to downweight the probability that items earlier in the vocabulary will not be oversampled.

    In order to simply not use the OOV character in the vocab you can always exclude it as well:

    inverse_vocab = vectorize_layer.get_vocabulary(include_special_tokens=False)

    Seems like you could manually shuffle the "[UNK]" value to its appropriate index too if you wanted it to be as accurate as possible as you suggested.