nlpword2vecembeddinglanguage-model

Does adding a list of Word2Vec embeddings give a meaningful represenation?


I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.

Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging? Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.

It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.


Solution

  • Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.

    You could try a simple sum, as well.

    But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.

    On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.

    Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:

    What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.

    Separately: