pythonnlpnltkarabicword-sense-disambiguation

Word Sense Disambiguation for Arabic text with NLTK


NLTK allows me to disambiguate text with nltk.wsd.lesk, e.g.

>>> from nltk.corpus import wordnet as wn
>>> from nltk.wsd import lesk
>>> sent = "I went to the bank to deposit money"
>>> ambiguous = "deposit"
>>> lesk(sent, ambiguous, pos='v')
Synset('deposit.v.02')

PyWSD does the same but it's only for English text.


NLTK supports arabic wordnet from the Open Multilingual WordNet, e.g.

>>> wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')
[u'\u0623\u064e\u0648\u0652\u062f\u064e\u0639\u064e']
>>> print wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')[0]
أَوْدَعَ

Also, the synsets are indexed for Arabic:

>>> wn.synsets(u'أَوْدَعَ', lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]

But how could i disambiguate Arabic texts and extract concepts from a query using nltk?

I was wondering if it is possible to use Lesk algorithm to deal with Arabic texts through nltk?


Solution

  • It's a little tricky but maybe this will work:

    1. Translate the sentence and the ambiguous word
    2. Use lesk on the English version of the sentence

    Try:

    alvas@ubi:~$ wget -O translate.sh http://pastebin.com/raw.php?i=aHgFzmMU
    --2015-08-05 23:32:46--  http://pastebin.com/raw.php?i=aHgFzmMU
    Resolving pastebin.com (pastebin.com)... 190.93.241.15, 190.93.240.15, 141.101.112.16, ...
    Connecting to pastebin.com (pastebin.com)|190.93.241.15|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: unspecified [text/plain]
    Saving to: ‘translate.sh’
    
        [ <=>                                                                                                                            ] 212         --.-K/s   in 0s      
    
    2015-08-05 23:32:47 (9.99 MB/s) - ‘translate.sh’ saved [212]
    
    alvas@ubi:~$ python
    Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
    [GCC 4.8.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import os
    >>> import nltk
    >>> from nltk.corpus import wordnet as wn
    >>> text = 'لديه يودع المال في البنك'
    >>> cmd = 'echo "{}" | bash translate.sh'.format(text)
    >>> translation = os.popen(cmd).read()
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   193    0    40  100   153     21     83  0:00:01  0:00:01 --:--:--    83
    >>> translation
    'He has deposited the money in the bank. '
    >>> ambiguous = u'أَوْدَعَ'
    >>> wn.synsets(ambiguous, lang='arb')
    [Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
    >>> nltk.wsd.lesk(translation_stems, '', synsets=wn.synsets(ambiguous,lang='arb'))
    Synset('entrust.v.02')
    

    But as you can see, there are many limitations: