pythontextnltkstop-wordsmining

How to remove punctuation and irrelevant words with stopwords (Text Mining)


The libraries I'm using are:

      import pandas as pd
      import string
      from nltk.corpus import stopwords
      import nltk

I have the following dataframe:

     df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells 
                                  with clearly defined nuclei).',
                                 'The Golgi apparatus is responsible for transporting, modifying, and 
                                  packaging proteins',
                                 'Non-foliated metamorphic rocks do not have a platy or sheet-like 
                                  structure.',
                                 'The process of metamorphism does not melt the rocks.'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

     print(df)

                              Send                           Class
         Golgi body, membrane-bound organelle of eukary...  biology
         The Golgi apparatus is responsible for transpo...  biology
         Non-foliated metamorphic rocks do not have a p...  geography
         The process of metamorphism does not melt the ...  geography

I would like to generate a function for cleaning the data in the 'Send' column. I would like to:

  1. Remove the score;
  2. Remove stop words 'stopwords';
  3. Return a new data frame with the 'Send' column containing the "clean words".

The attempt was to develop the following function:

      def Text_Process(mess): 
           nopunc = [char for char in mess if char not in string.punctuation]
           nopunc = ''.join(nopunc)  
           return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

However, the return is not feeling exactly what I would like. When I run:

        Text_Process(df['Send'])

The output is:

       ['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
        'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible',  'transporting,', 
        'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
        'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
        'melt', 'rocks.']

I would like the output to be the dataframe with the modified 'Send' column:

       df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells 
                                   clearly defined nuclei',
                                  'Golgi apparatus responsible transporting modifying                                     
                                   packaging proteins',
                                 'Non foliated metamorphic rocks platy sheet like 
                                  structure',
                                 'process metamorphism mel rocks'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

I would like the output to be the dataframe with the column 'Send' clean (without score and without words that are not relevant).

Thank you.


Solution

  • Here is a script to clean the column. Note you may want to add more words to the stopword set to meet your requirements.

    import pandas as pd
    import string
    import re
    from nltk.corpus import stopwords
    
    df = pd.DataFrame(
        {'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
                  'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
                  'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
                  'The process of metamorphism does not melt the rocks.'],
         'Class': ['biology', 'biology', 'geography', 'geography']})
    
    table = str.maketrans('', '', string.punctuation)
    
    def text_process(mess):
        words = re.split(r'\W+', mess)
        nopunc = [w.translate(table) for w in words]
        nostop =  ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
        return nostop
    
    df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)
    
    print(df)
    

    Output:

                                                                                     Send      Class
    0  Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei     biology
    1               Golgi apparatus responsible transporting modifying packaging proteins    biology
    2                          Non foliated metamorphic rocks platy sheet like structure   geography
    3                                                    process metamorphism melt rocks   geography