I have tried to lemmatize a words from Quran Holy Book, but some words can't be lemmatized.
here's my sentence:
sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"
that sentence is part of my txt dataset. as you can see, there's "surahs" which is a plural form of "surah". I've tried my codes:
def lemmatize(self, ayat):
wordnet_lemmatizer = WordNetLemmatizer()
result = []
for i in xrange (len(ayat)):
result.append(wordnet_lemmatizer.lemmatize(sentence[i],'v'))
return result
which when I run and print, the result is like this:
['bring', 'ten', 'surahs', 'like', u'invent', 'call', 'upon', 'assistance', 'whomever', 'besides', 'Allah', 'truthful']
the 'surahs' isn't changed into 'surah'.
anybody can tell why? thanks.
See
For most non-standard English word, WordNet Lemmatizer is not going to help much in getting the correct lemma, try a stemmer:
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('surahs')
u'surah'
Also, try the lemmatize_sent
in earthy
(an nltk
wrapper, "shameless plug"):
>>> from earthy.nltk_wrappers import lemmatize_sent
>>> sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"
>>> lemmatize_sent(sentence)
[('Then', 'Then', 'RB'), ('bring', 'bring', 'VBG'), ('ten', 'ten', 'RP'), ('surahs', 'surahs', 'NNS'), ('like', 'like', 'IN'), ('it', 'it', 'PRP'), ('that', 'that', 'WDT'), ('have', 'have', 'VBP'), ('been', u'be', 'VBN'), ('invented', u'invent', 'VBN'), ('and', 'and', 'CC'), ('call', 'call', 'VB'), ('upon', 'upon', 'NN'), ('for', 'for', 'IN'), ('assistance', 'assistance', 'NN'), ('whomever', 'whomever', 'NN'), ('you', 'you', 'PRP'), ('can', 'can', 'MD'), ('besides', 'besides', 'VB'), ('Allah', 'Allah', 'NNP'), ('if', 'if', 'IN'), ('you', 'you', 'PRP'), ('should', 'should', 'MD'), ('be', 'be', 'VB'), ('truthful', 'truthful', 'JJ')]
>>> words, lemmas, tags = zip(*lemmatize_sent(sentence))
>>> lemmas
('Then', 'bring', 'ten', 'surahs', 'like', 'it', 'that', 'have', u'be', u'invent', 'and', 'call', 'upon', 'for', 'assistance', 'whomever', 'you', 'can', 'besides', 'Allah', 'if', 'you', 'should', 'be', 'truthful')
>>> from earthy.nltk_wrappers import pywsd_lemmatize
>>> pywsd_lemmatize('surahs')
'surahs'
>>> from earthy.nltk_wrappers import porter_stem
>>> porter_stem('surahs')
u'surah'