Below is the subset of my dataset. I am trying to clean my dataset using Porter stemmer
that is available in nltk
package. I would like to drop columns that are similar in their stems for example "abandon','abondoned','abondening' should be just abondoned in my dataset. Below is the code I am trying, where I can see words/columns being stemmed. But I am not sure about how to drop those columns? I have already tokeninze and removed punctuation from the corpus.
Note: I am new to Python
and Textmining
.
Dataset Subset
{
'aaaahhhs':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'aahs':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'aamir':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'aardman':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'aaron':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'abandon':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'abandoned':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'abandoning':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'abandonment':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
},
'abandons':{
0:0,
1:0,
2:0,
3:0,
4:0,
5:0
}
}
code so far..
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
for w in clean_df.columns:
print(ps.stem(w))
I think something like this does what you want:
import collections
# Here the assotiations between stems and column names are built:
stems = collections.defaultdict(list)
for column_name in clean_df.columns:
stems[ps.stem(column_name)].append(column_name)
# Here for each stem the first (in lexicographical order) is gotten:
new_columns = [sorted(columns)[0] for _, columns in stems.items()]
# Here the new `DataFrame` is created which contains selected columns:
new_df = clean_df[new_columns]