pythonnltkwordnetlemmatizationplural

Why NLTK Lemmatizer can't lemmatize some plural words?


I have tried to lemmatize a words from Quran Holy Book, but some words can't be lemmatized.

here's my sentence:

sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"

that sentence is part of my txt dataset. as you can see, there's "surahs" which is a plural form of "surah". I've tried my codes:

def lemmatize(self, ayat):
    wordnet_lemmatizer = WordNetLemmatizer()
    result = []

    for i in xrange (len(ayat)):
        result.append(wordnet_lemmatizer.lemmatize(sentence[i],'v'))
    return result

which when I run and print, the result is like this:

['bring', 'ten', 'surahs', 'like', u'invent', 'call', 'upon', 'assistance', 'whomever', 'besides', 'Allah', 'truthful']

the 'surahs' isn't changed into 'surah'.

anybody can tell why? thanks.


Solution

  • See

    For most non-standard English word, WordNet Lemmatizer is not going to help much in getting the correct lemma, try a stemmer:

    >>> from nltk.stem import PorterStemmer
    >>> porter = PorterStemmer()
    >>> porter.stem('surahs')
    u'surah'
    

    Also, try the lemmatize_sent in earthy (an nltk wrapper, "shameless plug"):

    >>> from earthy.nltk_wrappers import lemmatize_sent
    >>> sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"
    >>> lemmatize_sent(sentence)
    [('Then', 'Then', 'RB'), ('bring', 'bring', 'VBG'), ('ten', 'ten', 'RP'), ('surahs', 'surahs', 'NNS'), ('like', 'like', 'IN'), ('it', 'it', 'PRP'), ('that', 'that', 'WDT'), ('have', 'have', 'VBP'), ('been', u'be', 'VBN'), ('invented', u'invent', 'VBN'), ('and', 'and', 'CC'), ('call', 'call', 'VB'), ('upon', 'upon', 'NN'), ('for', 'for', 'IN'), ('assistance', 'assistance', 'NN'), ('whomever', 'whomever', 'NN'), ('you', 'you', 'PRP'), ('can', 'can', 'MD'), ('besides', 'besides', 'VB'), ('Allah', 'Allah', 'NNP'), ('if', 'if', 'IN'), ('you', 'you', 'PRP'), ('should', 'should', 'MD'), ('be', 'be', 'VB'), ('truthful', 'truthful', 'JJ')]
    
    >>> words, lemmas, tags = zip(*lemmatize_sent(sentence))
    >>> lemmas
    ('Then', 'bring', 'ten', 'surahs', 'like', 'it', 'that', 'have', u'be', u'invent', 'and', 'call', 'upon', 'for', 'assistance', 'whomever', 'you', 'can', 'besides', 'Allah', 'if', 'you', 'should', 'be', 'truthful')
    
    >>> from earthy.nltk_wrappers import pywsd_lemmatize
    >>> pywsd_lemmatize('surahs')
    'surahs'
    
    >>> from earthy.nltk_wrappers import porter_stem
    >>> porter_stem('surahs')
    u'surah'