pythonpandasstemmingporter-stemmer

Stemming words within a column


I need to use stemming

   D            Words
0   2020-06-19  excellent
1   2020-06-19  make
2   2020-06-19  many
3   2020-06-19  game
4   2020-06-19  play
... ... ...
3042607 2020-07-28  praised
3042608 2020-07-28  playing
3042609 2020-07-28  made
3042610 2020-07-28  terms
3042611 2020-07-28  bad
 

I have tried to use Portstemmer to do it as follows:

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
for w in df.Words: 
    print(w, " : ", ps.stem(w)) 

but I do not get the desired outputs (stemmed words). I will need to keep date (D) information, so at the end I should have a similar dataset but with stemmed words), but I would like to run stemmed words through Words columns in order to have something similar to this:

 D          Words
    0   2020-06-19  excellent
    1   2020-06-19  make
    2   2020-06-19  many
    3   2020-06-19  game
    4   2020-06-19  play
    ... ... ...
    3042607 2020-07-28  praise
    3042608 2020-07-28  play
    3042609 2020-07-28  make
    3042610 2020-07-28  terms
    3042611 2020-07-28  bad

Any tips will be welcomed.


Solution

  • When I run your code

    ps = PorterStemmer() 
    for w in df.Words: 
        print(w, " : ", ps.stem(w)) 
    

    it prints the word : stem structure correctly (according to the PorterStemmer at least).

    If you want to have the stem as a column in your dataframe, you'll need to create a new column, by applying the ps.stem function over the whole Words column, as this:

    df['stem'] = df1.Words.apply(ps.stem)
    

    This turns your dataframe to this form:

        D           Words     stem
    0   2020-06-19  excellent excel
    1   2020-06-19  make      make
    2   2020-06-19  many      mani
    3   2020-06-19  game      game
    4   2020-06-19  play      play
    

    and so now you can use the stem column for any further analysis without dropping the rest of the data.