python-3.xnlpnltkwordnet

Wordnet: Finding the most common hypernyms


The task that I am trying to achieve is finding the top 20 most common hypernyms for all nouns and verbs in a text file. I believe that my output is erroneous and that there is a more elegant solution, particularly to avoid manually creating a list of the most common nouns and verbs and the code that iterates over the synsets to identify the hypernyms.

Please see below for the code I have attempted so far, any guidance would be appreciated:

nouns_verbs = [token.text for token in hamlet_spacy if (not token.is_stop and not token.is_punct and token.pos_ == "VERB" or token.pos_ == "NOUN")]

def check_hypernym(word_list):
    return_list=[]
    for word in word_list:
        w = wordnet.synsets(word)
        for syn in w:
            if not((len(syn.hypernyms()))==0):
                return_list.append(word)
                break
    return return_list

hypernyms = check_hyper(nouns_verbs)
fd = nltk.FreqDist(hypernyms)
top_20 = fd.most_common(20)

word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']

hypernym_list = []
for word in word_list:
    syn_list = wordnet.synsets(word)
    hypernym_list.append(syn_list)

    final_list = []
    for syn in syn_list:
        hypernyms_syn = syn.hypernyms()
        final_list.append(hypernyms_syn)

final_list

I tried identifying the top 20 most common words and verbs, and then identified their synsets and subsequently their hypernyms. I would prefer to use a more cohesive solution, especially since I am unsure of whether my current result is accurate.


Solution

  • For the first part of getting all nouns and verbs from the text, you didn't provide the original text so I wasn't able to reproduce this but I believe you can shorten this since it is given that if a token is a noun or verb it is not punctuation. You can also use in so that you don't need two separate boolean conditions for NOUN and VERB.

    nouns_verbs = [token.text for token in hamlet_spacy if not token.is_stop and token.pos_ in ["VERB", "NOUN"]]
    

    Other than that it looks fine.

    For the second part of getting the most common hypernyms, your general approach is fine. You could make it a little more memory efficient for long texts where you potentially have the same hypernym appearing many times by using a Counter object from the get-go instead of constructing a long list. See the below code.

    from nltk.corpus import wordnet as wn
    from collections import Counter
    
    word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
    
    hypernym_counts = Counter()
    for word in word_list:
        for synset in wn.synsets(word):
            hypernym_counts.update(synset.hypernyms())
    
    top_20_hypernyms = hypernym_counts.most_common()[:20]
    for i, hypernym in enumerate(top_20_hypernyms, start=1):
        hypernym, count = hypernym
        print(f"{i}. {hypernym.name()} ({count})")
    

    Outputs:

    1. time_period.n.01 (6)
    2. be.v.01 (3)
    3. communicate.v.02 (3)
    4. male.n.02 (3)
    5. think.v.03 (3)
    6. male_aristocrat.n.01 (2)
    7. letter.n.02 (2)
    8. thyroid_hormone.n.01 (2)
    9. experience.v.01 (2)
    10. copulate.v.01 (2)
    11. travel.v.01 (2)
    12. time_unit.n.01 (2)
    13. serve.n.01 (2)
    14. induce.v.02 (2)
    15. accept.v.03 (2)
    16. make.v.02 (2)
    17. leave.v.04 (2)
    18. give.v.03 (2)
    19. parent.n.01 (2)
    20. make.v.03 (2)