Implementing a standard gensim word2vec model (continuous bag of words) on a series of Chinese characters, and for (comparison between chinese homophones and words of similar frequency) our cosine similarities are positive and weirdly high (>0.3), any clue why this is the case? We have vector size set to 300 and min word set to 1, other than that no modifications made to the gensim's standard implementation of word2vec.
Also if anyone has any resources to look into for how to learn how these embeddings are actually generated that would be really helpful. Thank you very much in advance!
I don't think it's typical for all pairwise comparisons within a model to be positive, but large sets of tokens-of-interest for one specific investigation might all have positive similarities.
0.3
isn't necessarily a particularly high similarity, but also note such similarity values don't have any absolute interpretation. Rather, they only have meaning compared to the other similarities in the same model.
Depending on other chosen parameters, especially vector_size
dimensionality, the very best nearest-neighbors to a token might be of nearly any positive similirity. That token B is most-similar to A, or that B is more-similar than other tokens, is more meaningful & reliable than whether cossim(a_vector, b_vector)
is ~0.3
or ~0.9
.
Separately, min_count=1
is almost always a bad idea for these models. A token that only appears with one single context won't get a good vector from this sort of algorithm, but typical natural-lanaguae corpora may have many such one-off or few-off rare words – which altogether soak up a lot of training time to get nothing of value, while also seeving as 'noise' to worsen the vectors of other tokens which do have adequately numerous/varied contexts. Discarding rarer words, per the default min_count=5
(or even higer values as soon as your corpus is large enough) is a best practice which results in far more improvement to the remaining tokens' vectors than loss from ignoring rarer words.
And, vector_size=300
would only be appropriate if you have a large-enough corpus to justify it. How large is your corpus, in terms of (1) total tokens; (2) unique words (overall & after applying a reasonable min_count
); (3) average text length (in token-count). Most of these stats about your corpus will appear in logging output, as Genim Word2Vec
works, if you enable Python logging to the INFO
level.
If you continue to have problems, you should either expand (edit) this question, or ask a new question, with more details, including:
Word2Vec
parameters used