pythonpandaslisttokenizestop-words

How to tokenize the list without getting extra spaces and commas (Python)


df = pd.DataFrame({'id' : ['a','b','c','d','e'],
              'title' : ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor','amd ryzen 5 2400x cpu processor',
                        'amd ryzen computer accessories for processor','amd ryzen cpu processor for gamers'],
              'gen_key' : ['amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer',
                          'amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer'],
              'elas_key' : ['ryzen-7, best processor for processing, sale now for christmas gift',
                           'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
                           'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
                           'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
                           'processor, RYZEN, gamers best world, available on sale']})

So this is my dataframe, I am trying to do a preprocessing and get the final "elas_key" as a lowercase set without stopwords, specific punctuation marks, certain objective claims, plural nouns, duplicates from the "gen_key" and "title" and org names which are not there in the title. So i have processed certain things, but I am kind of stuck at tokenization, I am kind of getting extra spaces and commas when tokenizing the list:

def lower_case(new_keys):
  lower = list(w.lower() for w in new_keys)
  return lower 

stop = stopwords.words('english')
other_claims = ['best','sale','available','avail','new','hurry','promotion']
stop += other_claims

def stopwords_removal(new_keys):
  stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed

def remove_specific_punkt(new_keys):
  punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
  return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df

after the removal of punctuation marks, I get the following table (named as List1): enter image description here

but then when I run the tokenization script, I get a list of lists with added commas and spaces, I have tried using strip(), replace() to remove those, but nothing is giving me the expected result

def word_tokenizing(new_keys):
  tokenized_words = [word_tokenize(i) for i in new_keys]
  return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df

the table is as follows (named as List2): enter image description here

Can someone please help me out with this ? Also after removing stopwords, I am getting some of the rows like this:

[processor, ryzen, gamers world,]

The actual list was:

[processor, ryzen, gamers best world, available on sale]

But the words like "available", "on", "sale" were either stopwords or other_claims, and even though the words are getting removed, but I am getting an additional "," at the end

My expected output should look something like this after removing stopwords, punctuation and other_claims:

[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]

Like ryzen7 was a word it became ryzen,7 I am able to do it if the keywords are in multiple rows like:

[ryzen, 7]
[processor, processing]
[gamers, world]

So that it will be easier for me to pos_tag them

Apologies if the question is too confusing, I am kind of in learning stage


Solution

  • You could try the following:

    from nltk import word_tokenize
    from nltk.corpus import stopwords
    
    other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
    STOPS = set(stopwords.words('english') + other_claims)
    def remove_stops(words):
        if (words := [word for word in words if word not in STOPS]):
            return words
    
    def word_tokenizing(words):
      return [token for word in words for token in word_tokenize(word)]
    
    df['elas_key'] = (
        df['elas_key'].str.lower()
        .str.split(', ').explode()
        .str.replace(r'[;:-]', r' ', regex=True).str.strip()
        .str.split().map(remove_stops).dropna().str.join(' ')
        .groupby(level=0).agg(list)
        .map(word_tokenizing)
    )
    

    Result for your sample dataframe (only column elas_key):

                                                        elas_key  
    0         [ryzen, 7, processor, processing, christmas, gift]  
    1  [ryzen, 8, gamer, processor, processing, christmas, gift]  
    2        [ryzen, 5, processor, gamers, christmas, gift, amd]  
    3       [ryzen, accessories, gamers, headsets, pro, players]  
    4                          [processor, ryzen, gamers, world]