machine-learningnlpword2vec

What is the concept of negative-sampling in word2vec?


I'm reading the 2014 paper word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method (note: direct download link) and it references the concept of "negative-sampling":

Mikolov et al. present the negative-sampling approach as a more efficient way of deriving word embeddings. While negative-sampling is based on the skip-gram model, it is in fact optimizing a different objective.

I have some issue understanding the concept of negative-sampling.

https://arxiv.org/pdf/1402.3722v1.pdf

Can anyone explain in layman's terms what negative-sampling is?


Solution

  • The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have

          v_c . v_w
     -------------------
       sum_i(v_ci . v_w)
    

    The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.