pythonnlpnltkgensimwordnet

improve gensim most_similar() return values by using wordnet hypernyms


import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-200')

I first ran this code to download the pre-trained model.

glove.most_similar(positive=['sushi', 'uae'], negative=['japan'])

would then result in:

[('nahyan', 0.5181387066841125),
 ('caviar', 0.4778318405151367),
 ('paella', 0.4497394263744354),
 ('nahayan', 0.44313961267471313),
 ('zayed', 0.4321245849132538),
 ('omani', 0.4285220503807068),
 ('seafood', 0.4279175102710724),
 ('saif', 0.426000714302063),
 ('dirham', 0.4214130640029907),
 ('sashimi', 0.4165934920310974)]

and in this example, we can see that the method failed to capture the 'type' or 'category' of the query. 'zayed', 'nahyan' are not actually of 'type' food and rather they represent person name.

The approach suggested by my professor is to use wordnet hypernyms to find the 'type'.

With much research, the closest solution I found is to somehow incorporate lowest_common_hypernyms() that will give the lowest common hypernym between two synsets and use it to filter the results of most_similar().

I am not sure if my idea make sense and would like the community feedback on this.

My idea is compute the hypernym of, e.g. 'sushi' and the hypernyms of all the similar words returned by most_similar() and only choose the word with 'longest' lowest common hypernym path. I expect this should return the word that best matches the 'type'

Not sure if it makes sense...


Solution

  • Does your proposed approach give adequate results when you try it?

    That's the only test of whether the idea makes sense.

    Word2vec is generally oblivious to the all the variations of category that a lexicon like WordNet can provide – all the words that are similar to another word, in any aspect, will be neighbors. Even words that people consider opposites – like 'hot' and 'cold' – will be often be fairly close to each other, in some direction in the coordinate space, as they are similar in what they describe and what contexts they're used in. (They can be drop-in replacements for each other.)

    Word2vec is also fairly oblivious to polysemy in its standard formulation.

    Some other things worth trying might be: