I need to use stemming
D Words
0 2020-06-19 excellent
1 2020-06-19 make
2 2020-06-19 many
3 2020-06-19 game
4 2020-06-19 play
... ... ...
3042607 2020-07-28 praised
3042608 2020-07-28 playing
3042609 2020-07-28 made
3042610 2020-07-28 terms
3042611 2020-07-28 bad
I have tried to use Portstemmer to do it as follows:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
for w in df.Words:
print(w, " : ", ps.stem(w))
but I do not get the desired outputs (stemmed words). I will need to keep date (D) information, so at the end I should have a similar dataset but with stemmed words), but I would like to run stemmed words through Words columns in order to have something similar to this:
D Words
0 2020-06-19 excellent
1 2020-06-19 make
2 2020-06-19 many
3 2020-06-19 game
4 2020-06-19 play
... ... ...
3042607 2020-07-28 praise
3042608 2020-07-28 play
3042609 2020-07-28 make
3042610 2020-07-28 terms
3042611 2020-07-28 bad
Any tips will be welcomed.
When I run your code
ps = PorterStemmer()
for w in df.Words:
print(w, " : ", ps.stem(w))
it prints the word : stem
structure correctly (according to the PorterStemmer at least).
If you want to have the stem as a column in your dataframe, you'll need to create a new column, by applying the ps.stem
function over the whole Words
column, as this:
df['stem'] = df1.Words.apply(ps.stem)
This turns your dataframe to this form:
D Words stem
0 2020-06-19 excellent excel
1 2020-06-19 make make
2 2020-06-19 many mani
3 2020-06-19 game game
4 2020-06-19 play play
and so now you can use the stem
column for any further analysis without dropping the rest of the data.