The task that I am trying to achieve is finding the top 20 most common hypernyms for all nouns and verbs in a text file. I believe that my output is erroneous and that there is a more elegant solution, particularly to avoid manually creating a list of the most common nouns and verbs and the code that iterates over the synsets to identify the hypernyms.
Please see below for the code I have attempted so far, any guidance would be appreciated:
nouns_verbs = [token.text for token in hamlet_spacy if (not token.is_stop and not token.is_punct and token.pos_ == "VERB" or token.pos_ == "NOUN")]
def check_hypernym(word_list):
return_list=[]
for word in word_list:
w = wordnet.synsets(word)
for syn in w:
if not((len(syn.hypernyms()))==0):
return_list.append(word)
break
return return_list
hypernyms = check_hyper(nouns_verbs)
fd = nltk.FreqDist(hypernyms)
top_20 = fd.most_common(20)
word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
hypernym_list = []
for word in word_list:
syn_list = wordnet.synsets(word)
hypernym_list.append(syn_list)
final_list = []
for syn in syn_list:
hypernyms_syn = syn.hypernyms()
final_list.append(hypernyms_syn)
final_list
I tried identifying the top 20 most common words and verbs, and then identified their synsets and subsequently their hypernyms. I would prefer to use a more cohesive solution, especially since I am unsure of whether my current result is accurate.
For the first part of getting all nouns and verbs from the text, you didn't provide the original text so I wasn't able to reproduce this but I believe you can shorten this since it is given that if a token is a noun or verb it is not punctuation. You can also use in
so that you don't need two separate boolean conditions for NOUN
and VERB
.
nouns_verbs = [token.text for token in hamlet_spacy if not token.is_stop and token.pos_ in ["VERB", "NOUN"]]
Other than that it looks fine.
For the second part of getting the most common hypernyms, your general approach is fine. You could make it a little more memory efficient for long texts where you potentially have the same hypernym appearing many times by using a Counter
object from the get-go instead of constructing a long list. See the below code.
from nltk.corpus import wordnet as wn
from collections import Counter
word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
hypernym_counts = Counter()
for word in word_list:
for synset in wn.synsets(word):
hypernym_counts.update(synset.hypernyms())
top_20_hypernyms = hypernym_counts.most_common()[:20]
for i, hypernym in enumerate(top_20_hypernyms, start=1):
hypernym, count = hypernym
print(f"{i}. {hypernym.name()} ({count})")
Outputs:
1. time_period.n.01 (6)
2. be.v.01 (3)
3. communicate.v.02 (3)
4. male.n.02 (3)
5. think.v.03 (3)
6. male_aristocrat.n.01 (2)
7. letter.n.02 (2)
8. thyroid_hormone.n.01 (2)
9. experience.v.01 (2)
10. copulate.v.01 (2)
11. travel.v.01 (2)
12. time_unit.n.01 (2)
13. serve.n.01 (2)
14. induce.v.02 (2)
15. accept.v.03 (2)
16. make.v.02 (2)
17. leave.v.04 (2)
18. give.v.03 (2)
19. parent.n.01 (2)
20. make.v.03 (2)