df = pd.DataFrame({'id' : ['a','b','c','d','e'],
'title' : ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor','amd ryzen 5 2400x cpu processor',
'amd ryzen computer accessories for processor','amd ryzen cpu processor for gamers'],
'gen_key' : ['amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer',
'amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer'],
'elas_key' : ['ryzen-7, best processor for processing, sale now for christmas gift',
'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
'processor, RYZEN, gamers best world, available on sale']})
So this is my dataframe, I am trying to do a preprocessing and get the final "elas_key" as a lowercase set without stopwords, specific punctuation marks, certain objective claims, plural nouns, duplicates from the "gen_key" and "title" and org names which are not there in the title. So i have processed certain things, but I am kind of stuck at tokenization, I am kind of getting extra spaces and commas when tokenizing the list:
def lower_case(new_keys):
lower = list(w.lower() for w in new_keys)
return lower
stop = stopwords.words('english')
other_claims = ['best','sale','available','avail','new','hurry','promotion']
stop += other_claims
def stopwords_removal(new_keys):
stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed
def remove_specific_punkt(new_keys):
punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df
after the removal of punctuation marks, I get the following table (named as List1):
but then when I run the tokenization script, I get a list of lists with added commas and spaces, I have tried using strip(), replace() to remove those, but nothing is giving me the expected result
def word_tokenizing(new_keys):
tokenized_words = [word_tokenize(i) for i in new_keys]
return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df
the table is as follows (named as List2):
Can someone please help me out with this ? Also after removing stopwords, I am getting some of the rows like this:
[processor, ryzen, gamers world,]
The actual list was:
[processor, ryzen, gamers best world, available on sale]
But the words like "available", "on", "sale" were either stopwords or other_claims, and even though the words are getting removed, but I am getting an additional "," at the end
My expected output should look something like this after removing stopwords, punctuation and other_claims:
[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]
Like ryzen7 was a word it became ryzen,7 I am able to do it if the keywords are in multiple rows like:
[ryzen, 7]
[processor, processing]
[gamers, world]
So that it will be easier for me to pos_tag them
Apologies if the question is too confusing, I am kind of in learning stage
You could try the following:
from nltk import word_tokenize
from nltk.corpus import stopwords
other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
if (words := [word for word in words if word not in STOPS]):
return words
def word_tokenizing(words):
return [token for word in words for token in word_tokenize(word)]
df['elas_key'] = (
df['elas_key'].str.lower()
.str.split(', ').explode()
.str.replace(r'[;:-]', r' ', regex=True).str.strip()
.str.split().map(remove_stops).dropna().str.join(' ')
.groupby(level=0).agg(list)
.map(word_tokenizing)
)
Result for your sample dataframe (only column elas_key
):
elas_key
0 [ryzen, 7, processor, processing, christmas, gift]
1 [ryzen, 8, gamer, processor, processing, christmas, gift]
2 [ryzen, 5, processor, gamers, christmas, gift, amd]
3 [ryzen, accessories, gamers, headsets, pro, players]
4 [processor, ryzen, gamers, world]