pythonpandaskeyword

How to remove words from one column by comparing another column in Pandas


I am trying to automatically generate keywords by using some machine learning algorithm. In that process, on the outcome I also see some unwanted keywords are also generated, and now I need to remove that unwanted/redundant words from the output column algorithmically. [unwanted keywords are nothing but, words that are not existing in the input column but still generated in the output column] Below is an example, I am trying to generate keywords by referring "query_text" column. The results are store in "auto generated keywords" column. But you see there are few keywords that are extracted unnecessarily ('diamond' and 'ring') and I highlighted the same in red color (in row 1 and row 3 respectively). Now in the final (corrected keywords) column, I have given only the necessary words.

How can I do this algorithmically by comparing the results (auto generated keywords) and the input (query_text).

 S.No                      query_text auto generated keywords corrected keywords
    1                     I want ring            diamond|ring               ring
    2             I want wedding band            band|wedding       band|wedding
    3  I look for sapphire collection           ring|sapphire           sapphire
    4          I want diamond earring         diamond|earring    diamond|earring
    5 I am looking for stackable ring          ring|stackable     ring|stackable
    6            I need gold bracelet           bracelet|gold      bracelet|gold
    7            I look for gold ring               gold|ring          gold|ring
    8            I need sapphire ring           ring|sapphire      ring|sapphire

Data with higlighted extra words:


Solution

  • You need to use a list comprehension on pairs of query/auto generated words (zip), with a set for efficient membership test:

    df['corrected keywords'] = ['|'.join(w for w in l if w in S)
                                for S, l in zip(df['query_text'].apply(lambda x: set(x.split())),
                                                df['auto generated keywords'].str.split('|'))]
    

    Output:

       S.No                       query_text auto generated keywords corrected keywords
    0     1                      I want ring            diamond|ring               ring
    1     2              I want wedding band            band|wedding       band|wedding
    2     3   I look for sapphire collection           ring|sapphire           sapphire
    3     4           I want diamond earring         diamond|earring    diamond|earring
    4     5  I am looking for stackable ring          ring|stackable     ring|stackable
    5     6             I need gold bracelet           bracelet|gold      bracelet|gold
    6     7             I look for gold ring               gold|ring          gold|ring
    7     8             I need sapphire ring           ring|sapphire      ring|sapphire