I have the following code for an off-line environment:
import pandas as pd
import re
from nltk.stem import PorterStemmer
test = {'grams': ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']}
test = pd.DataFrame(test, columns = ['grams'])
STOPWORDS = {'and', 'does', 'because'}
def rower(x):
cleanQ = []
for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
splitQ = []
for row in cleanQ: splitQ.append(row.split())
splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
splitQ = list(map(' '.join, splitQ))
print(splitQ)
originQ = []
for i in splitQ:
originQ.append(PorterStemmer().stem(i))
print(originQ)
rower(test.grams)
Which produces this:
['first value one two three', 'second value three three four', 'third donkey three']
['first value one two thre', 'second value three three four', 'third donkey thre']
The first list shows the sentences before applying the PorterStemmer()
function. The second list shows sentences after applying the PorterStemmer()
function.
As you can see, PorterStemmer()
trims the word, three
, into thre
only when the word is positioned as the last word in a sentence. When the word, three
, is not the last word, three
stays three
. I can't seem to figure out why it is doing this. I also worry that if I applied the rower(x)
function to other sentences, it may produce similar outcomes without me noticing.
How do I prevent PorterStemmer
from treating the last word differently?
The main mistake here is that you are passing multiple words to the stemmer instead of one word at a time. The entire string (third donkey three) is considered one word and the last part is being stemmed.
import pandas as pd
import re
from nltk.stem import PorterStemmer
test = {'grams': ['First value because one does two THREE', 'Second value because three and three four',
'Third donkey three']}
test = pd.DataFrame(test, columns=['grams'])
STOPWORDS = {'and', 'does', 'because'}
ps = PorterStemmer()
def rower(x):
cleanQ = []
for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
splitQ = []
for row in cleanQ: splitQ.append(row.split())
splitQ = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
print('IN:', splitQ)
originQ = [[ps.stem(word) for word in sent] for sent in splitQ]
print('OUT:', originQ)
rower(test.grams)
Output:
IN: [['first', 'value', 'one', 'two', 'three'], ['second', 'value', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
OUT: [['first', 'valu', 'one', 'two', 'three'], ['second', 'valu', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
There are good explanations to why stemming leaves out the last 'e' on some words. Consider using a lemmatizer if the output doesn't meet your expectations.
How to stop NLTK stemmer from removing the trailing āeā?