According to this answer, sentence similarity for FastText is calculated with one of two ways (depending if the embeddings are created superviser or unsupervised)
But I cannot make either of those give the same answer as the sentence embedding
from gensim.models import fasttext
import numpy as np
wv = fasttext.load_facebook_vectors("transtotag/cc.da.300.bin")
w1 = wv["til"]
norm_w1 = np.linalg.norm(wv["til"], ord=2)
s1 = w1/norm_w1
w2 = wv["skat"]
norm_w2 = np.linalg.norm(wv["skat"], ord=2)
s2 = w2/norm_w2
w3 = wv["til skat"]
# Using "raw" embeddings
((w1+w2)/2-w3).max() #0.25
((w1+w2)-w3).max() # 0.5
# using normalized embeddings
((s1+s2)/2-w3).max() # 0.18
((s1+s2)-w3).max() # 0.37
I even tried to add the EOS (as stated in the answer) aswell
nl = wv["</s>"]
norm_nl = np.linalg.norm(wv["</s>"],2)
snl = nl/norm_nl
w3 = wv["til skat"]
((s1+s2+snl)/3-w3).max() #0.12
If we look in the source code, then wv[]
just returns vstack([self.get_vector(key) for key in key_or_keys])
i.e it treats til skat
a single word.
I cannot find anyting about how sentence embeddings are created in the docs aswell.
In Gensim, you should use get_sentence_vector method, which was recently added.
Please read the docs and notice that this method expects a list of words specified by string or int ids.