pythonnltktext-analysislemmatization

necessary condition to fix weird lemma's?


(Executed in jupyter notbook) I'm applying lemmatization on documents that I've tokenised and I can't help but notice that the word "us" gets lemmatized to "u" every time which wouldn't make sense from a clarity point of view and could possibly lead people to understand it as something else. Am I missing out a condition for my pos function? How could I fix this problem?

Defining the function

from nltk import pos_tag

def penn2wordNet(treebank_tags):
    wordNet_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return wordNet_tag[penntag[:2]]
    except:
        return 'n'
paired_tags = []
for doc in wordTokens:
    paired_tags.append(pos_tag(doc))
    print(paired_tags)

snippet output of the code above

Lemmatizing the tokens

    from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

print(wordTokens[1])
lemmatized_wordTokens = []
for index in range(len(paired_tags)):
    lemmatized_wordTokens.append(([wnl.lemmatize(word, pos=penn2wordNet(tag)) for word, tag in paired_tags[index]]))
print(lemmatized_wordTokens[1])

output after lemmatization showing before and after


Solution

  • Your penn2wordNet function assigns the noun POS tag to "us" although pos_tag(['us']) returns [('us', 'PRP')]. This makes WordNetLemmatizer to treat "us" as a noun. You have to add an additional condition to handle personal pronouns.