pythonnltkwordnetopen-multilingual-wordnet

Language confusion using Open Multilingual Wordnet with NLTK


I'm trying to understand NLTK's Wordnet module. Even downloading the omw-1.4 module, when I request the synsets of a word in Portuguese it returns terms in English. When I ask for the languages loaded in the module, only 'eng' appears, but I understand that others should appear, since I loaded Open Multilingual Wordnet (omw-1.4). The same does not happen when I request synonyms. The synonyms all appear in Portuguese.

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

    [nltk_data] Downloading package wordnet to /root/nltk_data...
    [nltk_data]   Package wordnet is already up-to-date!
    [nltk_data] Downloading package omw-1.4 to /root/nltk_data...
    [nltk_data]   Package omw-1.4 is already up-to-date!
    True

from nltk.corpus import wordnet as wnet
sorted(wnet.langs())
    ['eng']

wnet.synsets("casa", lang='por')
[Synset('apartment.n.01'),
 Synset('building.n.01'),
 Synset('chalet.n.01'),
 Synset('cringle.n.01'),
 Synset('detached_house.n.01'),
 Synset('dwelling.n.01'),
 Synset('house.n.01'),
 Synset('house.n.12'),
 Synset('housing.n.01'),
 Synset('theater.n.01'),
 Synset('house.n.06'),
 Synset('manufacturer.n.01'),
 Synset('family.n.01'),
 Synset('residence.n.01'),
 Synset('domicile.n.01'),
 Synset('home.n.01'),
 Synset('home.n.06'),
 Synset('home.n.07')]

wnet.synonyms("casa",lang='por')

  #output:      
    [['apartamento', 'aposentos'],
         ['edifício', 'edifícios', 'prédio'],
         ['chalé'],
         ['ilhó'],
         [],
         ['aposento',
          'domicílio',
          'habitação',
          'lar',
          'morada',
          'moradia',
          'residência'],
         ['Habitações',
          'edifícios_residenciais',
          'firma',
          'habitação',
          'teatro',
          'vivenda'],
         [],
         ['abrigo', 'abrigos', 'alojamentos'],
         ['teatro'],
         [],
         ['fabricante'],
         ['agregado_familiar',
          'classe',
          'família',
          'linhagem',
          'pessoa_da_família_que_mora_na_mesma_casa'],
         ['domicílio', 'residência'],
         ['domicílio'],
         ['Lar', 'lar'],
         ['lar'],
         ['lar']]

Solution

  • The Synset('apartment.n.01'), is the name of a graph node; it has been named after an English word, but you have to do an extra step to get text in English (or any language).

    So to get the human language text use lemma_names(), like this:

    wnet.synset('apartment.n.01').lemma_names('por')
    

    Ref: https://www.nltk.org/howto/wordnet.html