pythonpython-3.xpandasdataframemodin

remove rows from one dataframe based on conditions from another dataframe in pandas Python


I have two pandas data frame contains millions of rows in python. I want to remove rows from the first data frame that contains words in seconds data frame based on three conditions:

  1. If the word appears at the beginning of the sentence in a row
  2. If the word appears at the end of the sentence in a row
  3. If the word appears in the mid the sentence in a row (exact word, not a subset)

Example:

First Dataframe:

This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence 

Second Dataframe:

Second
forth
fifth

Output Expected:

This is the first sentence
This is fifth_sentence 

Please note that I have millions of records in both the data frame, how can I process it and export in the most efficient way?

I tried but it takes very much time

import pandas as pd
import re

bad_words_file_data = pd.read_csv("words.txt", sep = ",", header = None)
sentences_file_data = pd.read_csv("setences.txt", sep = ".", header = None)

bad_words_index = []
for i in sentences_file_data.index:
    print("Processing Sentence:- ", i, "\n")
    single_sentence = sentences_file_data[0][i]
    for j in bad_words_file_data.index:
        word = bad_words_file_data[0][j]
        if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
            bad_words_index.append(i)
            break
            
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None, index = False)

Thanks


Solution

  • You can use numpy.where function and create a variable called 'remove' which will mark as 1 if the conditions you outlined are satisfied. Firstly, create a list with the values of df2

    Condition 1: will check whether the cell values start with any of the values in your list

    Condition 2: same as above but it will check if cell values end with any of the values in your list

    Condition 3: Splits each cell and checks if any value from the splitter string are in your list

    Thereafter, you can create your new dataframe with filtering out the 1:

    # Imports
    import pandas as pd
    import numpy as np
    
    # Get the values from df2 in a list
    l = list(set(df2['col']))
    
    # Set conditions
    c = df['col']
    
    cond = (c.str.startswith(tuple(l)) \
            |(c.str.endswith(tuple(l))) \
            |pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))
    
    # Assign 1 or 0
    df['remove'] = np.where(cond,1,0)
    
    # Create 
    out = (df[df['remove']!=1]).drop(['remove'],axis=1)
    

    out prints:

                              col
    0  This is the first sentence
    4      This is fifth_sentence
    

    References:

    Pandas Row Select Where String Starts With Any Item In List

    check if a columns contains any str from list

    Dataframes used:

    >>> df.to_dict()
    
    {'col': {0: 'This is the first sentence',
      1: 'Second this is another sentence',
      2: 'This is the third sentence forth',
      3: 'This is fifth sentence',
      4: 'This is fifth_sentence'}}
    
    >>> df2.to_dict()
    
    Out[80]: {'col': {0: 'Second', 1: 'forth', 2: 'fifth'}}