When I tested the two words "ally" and "friend" for cosine similarity (using a function verified to be a correct implementation) in python 3.6 with GloVe word vectors, the cosine similarity was
0.6274969008615137
. When I tested "ally" and "friend" however, the result was 0.4700224263147646
.
It seems that "ally" and "friend", two nouns given as synonyms, should have a larger cosine similarity than "ally" and "powerful", a noun and a barely related word.
Am I misunderstanding the idea of word vectors or cosine similarity?
Welcome to the wonderful word of learned embeddings. And to its pitfalls.
I am trying to explain this on a higher level, but feel free to read up on this topic, as there seems to be quite a bit of literature regarding the problem.
Neural networks in general suffer from the problem that the results are not naturally intuitive to humans - they often simply find statistically significant similarities in your training data, whether they are wanted or not
To take your specific example (GloVe) and analyze some of the problems, let us cite its official documentation:
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
What this tells us is that the learned representations are in general depending on the context of a specific word. Imagine if we (for example) have a training set that consists of a number of news articles, it is more likely to encounter articles that talk about "allies"/"ally" and "powerful" in the same context (think of political news articles), compared to articles that mention both "ally" and "friend" in a synonymous context.
Unless you actually encounter plenty of examples in which the context for both words is very similar (and thus the learned expression is similar), it is unlikely that your learned representation is close/similar.
The thing about embeddings is, that, while we can certainly find such counter examples in our data, overall they provide a really good numerical interpretation of our vocabulary, at least for the most common languages in research (English, Spanish, French being probably the most popular ones).
So the question becomes whether you want to spend the time to manually annotate a whole number of words, probably forgetting about associations in their respective context (for example, Apple might be a good example for both the fruit and company, but not everyone who hears of Toyota also thinks of it as a very common Japanese last name).
This, plus the obvious automated processing of word embeddings make them so atractive in the current time. I'm sure I potentially missed a few obvious point, and I want to add that the acceptance of embeddings widely ranges between different research areas, so please take this with a grain of salt.