gensimword-embeddingfasttext

How does gensim calculate sentence embeddings when using a pretrained fasttext model?


According to this answer, sentence similarity for FastText is calculated with one of two ways (depending if the embeddings are created superviser or unsupervised)

  1. The mean of the normalized word vectors (unsupervised)
  2. The mean of the word vectors (supervised)

But I cannot make either of those give the same answer as the sentence embedding

from gensim.models import fasttext
import numpy as np

wv = fasttext.load_facebook_vectors("transtotag/cc.da.300.bin")

w1 = wv["til"]
norm_w1 = np.linalg.norm(wv["til"], ord=2)
s1 = w1/norm_w1

w2 = wv["skat"]
norm_w2 = np.linalg.norm(wv["skat"], ord=2)
s2 = w2/norm_w2

w3 = wv["til skat"]

# Using "raw" embeddings
((w1+w2)/2-w3).max() #0.25
((w1+w2)-w3).max() # 0.5

# using normalized embeddings
((s1+s2)/2-w3).max() # 0.18
((s1+s2)-w3).max() # 0.37

I even tried to add the EOS (as stated in the answer) aswell

nl = wv["</s>"]
norm_nl = np.linalg.norm(wv["</s>"],2)
snl = nl/norm_nl

w3 = wv["til skat"]

((s1+s2+snl)/3-w3).max() #0.12

If we look in the source code, then wv[] just returns vstack([self.get_vector(key) for key in key_or_keys]) i.e it treats til skat a single word.

I cannot find anyting about how sentence embeddings are created in the docs aswell.


Solution

  • In Gensim, you should use get_sentence_vector method, which was recently added.

    Please read the docs and notice that this method expects a list of words specified by string or int ids.