I am designing a text processing program and need to stem the words for exploratory analysis later. One of my processes is to stem the words and I have to use Porter Stemmer.
I have designed a DataFrame structure to store my data. Furthermore, I have also designed a function to apply to the DataFrame. When I apply the function to the DataFrame, the stemming works but it does not keep the capitalised (or proper nouns) words.
A snippet of my code:
from nltk.stem.porter import PorterStemmer
def stemming(word):
stemmer = PorterStemmer()
word = str(word)
if word.title():
stemmer.stem(word).capitalize()
elif word.isupper():
stemmer.stem(word).upper()
else:
stemmer.stem(word)
return word
dfBody['body'] = dfBody['body'].apply(lambda x: [stemming(y) for y in x])
This is my result with that has no capitalised words: output
Sample of dataset (my dataset is very large):
file body
PP3169 ['performing', 'Maker', 'USA', 'computer', 'Conference', 'NIPS']
Expected output (after applying stemming function):
file body
PP3169 ['perform', 'Make', 'USA', 'comput', 'Confer', 'NIPS']
Any advice will be greatly appreciated!
First: you should assing result to word
word = stemmer.stem(word).capitalize()
Second: word.title()
doesn't check if word is capitalized but it creates capitalized word so you should check
if word == word.title():
eventually
if word[0].isupper() and word[1:].islower():
def stemming(word):
stemmer = PorterStemmer()
word = str(word)
if word == word.title():
word = stemmer.stem(word).capitalize()
elif word.isupper():
word = stemmer.stem(word).upper()
else:
word = stemmer.stem(word)
return word