pythonnlpnltkwordnet

How to get domain of words using WordNet in Python?


How can I find domain of words using nltk Python module and WordNet?

Suppose I have words like (transaction, Demand Draft, cheque, passbook) and the domain for all these words is "BANK". How can we get this using nltk and WordNet in Python?

I am trying through hypernym and hyponym relationship:

For example:

from nltk.corpus import wordnet as wn
sports = wn.synset('sport.n.01')
sports.hyponyms()
[Synset('judo.n.01'), Synset('athletic_game.n.01'), Synset('spectator_sport.n.01'),    Synset('contact_sport.n.01'), Synset('cycling.n.01'), Synset('funambulism.n.01'), Synset('water_sport.n.01'), Synset('riding.n.01'), Synset('gymnastics.n.01'), Synset('sledding.n.01'), Synset('skating.n.01'), Synset('skiing.n.01'), Synset('outdoor_sport.n.01'), Synset('rowing.n.01'), Synset('track_and_field.n.01'), Synset('archery.n.01'), Synset('team_sport.n.01'), Synset('rock_climbing.n.01'), Synset('racing.n.01'), Synset('blood_sport.n.01')]

and

bark = wn.synset('bark.n.02')
bark.hypernyms()
[Synset('noise.n.01')]

Solution

  • There is no explicit domain information in the Princeton WordNet nor the NLTK's WN API.

    I would recommend you get a copy of the WordNet Domain resource and then link your synsets using the domains, see http://wndomains.fbk.eu/

    After you've registered and completed the download you will see a wn-domains-3.2-20070223 textfile, which is a tab-delimited file with first column the offset-PartofSpeech identifier and the 2nd column contains the domain tags separated by spaces, e.g.

    00584282-v  military pedagogy
    00584395-v  military school university
    00584526-v  animals pedagogy
    00584634-v  pedagogy
    00584743-v  school university
    00585097-v  school university
    00585271-v  pedagogy
    00585495-v  pedagogy
    00585683-v  psychological_features
    

    Then you use the following script to access synsets' domain(s):

    from collections import defaultdict
    from nltk.corpus import wordnet as wn
    
    # Loading the Wordnet domains.
    domain2synsets = defaultdict(list)
    synset2domains = defaultdict(list)
    for i in open('wn-domains-3.2-20070223', 'r'):
        ssid, doms = i.strip().split('\t')
        doms = doms.split()
        synset2domains[ssid] = doms
        for d in doms:
            domain2synsets[d].append(ssid)
    
    # Gets domains given synset.
    for ss in wn.all_synsets():
        ssid = str(ss.offset).zfill(8) + "-" + ss.pos()
        if synset2domains[ssid]: # not all synsets are in WordNet Domain.
            print ss, ssid, synset2domains[ssid]
    
    # Gets synsets given domain.        
    for dom in sorted(domain2synsets):
        print dom, domain2synsets[dom][:3]
    

    Also look for the wn-affect that is very useful to disambiguate words for sentiment within the WordNet Domain resource.


    With updated NLTK v3.0, it comes with the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/), and since the French synsets share the same offset IDs, you can simply use the WND as a crosslingual resource. The french lemma names can be accessed as such:

    # Gets domains given synset.
    for ss in wn.all_synsets():
        ssid = str(ss.offset()).zfill(8) + "-" + ss.pos()
        if synset2domains[ssid]: # not all synsets are in WordNet Domain.
            print ss, ss.lemma_names('fre'), ssid, synset2domains[ssid]
    

    Note that the most recent version of NLTK changes synset properties to "get" functions: Synset.offset -> Synset.offset()