The libraries I'm using are:
import pandas as pd
import string
from nltk.corpus import stopwords
import nltk
I have the following dataframe:
df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells
with clearly defined nuclei).',
'The Golgi apparatus is responsible for transporting, modifying, and
packaging proteins',
'Non-foliated metamorphic rocks do not have a platy or sheet-like
structure.',
'The process of metamorphism does not melt the rocks.'],
'Class': ['biology', 'biology', 'geography', 'geography']})
print(df)
Send Class
Golgi body, membrane-bound organelle of eukary... biology
The Golgi apparatus is responsible for transpo... biology
Non-foliated metamorphic rocks do not have a p... geography
The process of metamorphism does not melt the ... geography
I would like to generate a function for cleaning the data in the 'Send' column. I would like to:
The attempt was to develop the following function:
def Text_Process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
However, the return is not feeling exactly what I would like. When I run:
Text_Process(df['Send'])
The output is:
['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible', 'transporting,',
'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
'melt', 'rocks.']
I would like the output to be the dataframe with the modified 'Send' column:
df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells
clearly defined nuclei',
'Golgi apparatus responsible transporting modifying
packaging proteins',
'Non foliated metamorphic rocks platy sheet like
structure',
'process metamorphism mel rocks'],
'Class': ['biology', 'biology', 'geography', 'geography']})
I would like the output to be the dataframe with the column 'Send' clean (without score and without words that are not relevant).
Thank you.
Here is a script to clean the column. Note you may want to add more words to the stopword set to meet your requirements.
import pandas as pd
import string
import re
from nltk.corpus import stopwords
df = pd.DataFrame(
{'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
'The process of metamorphism does not melt the rocks.'],
'Class': ['biology', 'biology', 'geography', 'geography']})
table = str.maketrans('', '', string.punctuation)
def text_process(mess):
words = re.split(r'\W+', mess)
nopunc = [w.translate(table) for w in words]
nostop = ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
return nostop
df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)
print(df)
Output:
Send Class
0 Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei biology
1 Golgi apparatus responsible transporting modifying packaging proteins biology
2 Non foliated metamorphic rocks platy sheet like structure geography
3 process metamorphism melt rocks geography