I have two dataframes df1
and df2
print(df1.shape)
(1042009, 40)
print(df1.columns)
Index(['date_acte', 'transaction_id', 'amount', ...],
dtype='object')
print(df2.shape)
(734738, 37)
print(df2.columns)
Index(['date', 'transaction_id', 'amount', ...],
dtype='object')
I would like to remove the unique transaction_id
in df2
from df1
and keep the rest.
I did the following:
Filtre = list(df2.transaction_id.unique())
print(len(Filtre))
733465
noMatched = df1.loc[
(~df1['transaction_id'].str.contains('|'.join(Filtre), case=False, na=False))]
My problem is that the output noMatched
takes almost 5 hours to get ready. I wonder if there is a more efficient way to write this piece of code. Can the output be generated in less than 5 hours?
You could do it this way:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
'amount': np.random.rand(1042009),
'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
})
df2 = pd.DataFrame({
'transaction_id': np.random.randint(1500000, 2500000, size=734738),
'amount': np.random.rand(734738),
'date': pd.date_range('2020-01-01', periods=734738, freq='T')
})
start_time = time.time()
filtre_set = set(df2['transaction_id'].unique())
noMatched = df1[~df1['transaction_id'].isin(filtre_set)]
end_time = time.time()
print(f"Filtered DataFrame shape: {noMatched}")
print(f"Execution time: {end_time - start_time:.2f} seconds")
which returns
Filtered DataFrame shape: transaction_id amount date_acte
1 1231651 0.849124 2020-01-01 00:01:00
2 1443550 0.031414 2020-01-01 00:02:00
3 1164444 0.973699 2020-01-01 00:03:00
4 1371353 0.554666 2020-01-01 00:04:00
7 1072327 0.867207 2020-01-01 00:07:00
... ... ... ...
1042004 1499512 0.114861 2021-12-24 14:44:00
1042005 1255963 0.756608 2021-12-24 14:45:00
1042006 1203341 0.091380 2021-12-24 14:46:00
1042007 1016687 0.153179 2021-12-24 14:47:00
1042008 1036581 0.382781 2021-12-24 14:48:00
[770625 rows x 3 columns]
Execution time: 0.52 seconds