I have two pandas data frame contains millions of rows in python. I want to remove rows from the first data frame that contains words in seconds data frame based on three conditions:
Example:
First Dataframe:
This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence
Second Dataframe:
Second
forth
fifth
Output Expected:
This is the first sentence
This is fifth_sentence
Please note that I have millions of records in both the data frame, how can I process it and export in the most efficient way?
I tried but it takes very much time
import pandas as pd
import re
bad_words_file_data = pd.read_csv("words.txt", sep = ",", header = None)
sentences_file_data = pd.read_csv("setences.txt", sep = ".", header = None)
bad_words_index = []
for i in sentences_file_data.index:
print("Processing Sentence:- ", i, "\n")
single_sentence = sentences_file_data[0][i]
for j in bad_words_file_data.index:
word = bad_words_file_data[0][j]
if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
bad_words_index.append(i)
break
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None, index = False)
Thanks
You can use numpy.where
function and create a variable called 'remove' which will mark as 1 if the conditions you outlined are satisfied. Firstly, create a list with the values of df2
Condition 1: will check whether the cell values start with any of the values in your list
Condition 2: same as above but it will check if cell values end with any of the values in your list
Condition 3: Splits each cell and checks if any value from the splitter string are in your list
Thereafter, you can create your new dataframe with filtering out the 1
:
# Imports
import pandas as pd
import numpy as np
# Get the values from df2 in a list
l = list(set(df2['col']))
# Set conditions
c = df['col']
cond = (c.str.startswith(tuple(l)) \
|(c.str.endswith(tuple(l))) \
|pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))
# Assign 1 or 0
df['remove'] = np.where(cond,1,0)
# Create
out = (df[df['remove']!=1]).drop(['remove'],axis=1)
out
prints:
col
0 This is the first sentence
4 This is fifth_sentence
References:
Pandas Row Select Where String Starts With Any Item In List
check if a columns contains any str from list
Dataframes used:
>>> df.to_dict()
{'col': {0: 'This is the first sentence',
1: 'Second this is another sentence',
2: 'This is the third sentence forth',
3: 'This is fifth sentence',
4: 'This is fifth_sentence'}}
>>> df2.to_dict()
Out[80]: {'col': {0: 'Second', 1: 'forth', 2: 'fifth'}}