pythonpandasdataframecontains

Removing a large number of IDs from a large dataframe takes a long time


I have two dataframes df1 and df2

print(df1.shape)
(1042009, 40)

print(df1.columns)
Index(['date_acte', 'transaction_id', 'amount', ...],
      dtype='object')

print(df2.shape)
(734738, 37)

print(df2.columns)
Index(['date', 'transaction_id', 'amount', ...],
      dtype='object')

I would like to remove the unique transaction_id in df2 from df1 and keep the rest.

I did the following:

Filtre = list(df2.transaction_id.unique())
print(len(Filtre))
733465

noMatched = df1.loc[
    (~df1['transaction_id'].str.contains('|'.join(Filtre), case=False, na=False))]

My problem is that the output noMatched takes almost 5 hours to get ready. I wonder if there is a more efficient way to write this piece of code. Can the output be generated in less than 5 hours?


Solution

  • You could do it this way:

    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({
        'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
        'amount': np.random.rand(1042009),
        'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
    })
    
    df2 = pd.DataFrame({
        'transaction_id': np.random.randint(1500000, 2500000, size=734738),
        'amount': np.random.rand(734738),
        'date': pd.date_range('2020-01-01', periods=734738, freq='T')
    })
    
    start_time = time.time()
    
    filtre_set = set(df2['transaction_id'].unique())
    
    noMatched = df1[~df1['transaction_id'].isin(filtre_set)]
    
    end_time = time.time()
    
    print(f"Filtered DataFrame shape: {noMatched}")
    print(f"Execution time: {end_time - start_time:.2f} seconds")
    

    which returns

    Filtered DataFrame shape:          transaction_id    amount           date_acte
    1               1231651  0.849124 2020-01-01 00:01:00
    2               1443550  0.031414 2020-01-01 00:02:00
    3               1164444  0.973699 2020-01-01 00:03:00
    4               1371353  0.554666 2020-01-01 00:04:00
    7               1072327  0.867207 2020-01-01 00:07:00
    ...                 ...       ...                 ...
    1042004         1499512  0.114861 2021-12-24 14:44:00
    1042005         1255963  0.756608 2021-12-24 14:45:00
    1042006         1203341  0.091380 2021-12-24 14:46:00
    1042007         1016687  0.153179 2021-12-24 14:47:00
    1042008         1036581  0.382781 2021-12-24 14:48:00
    
    [770625 rows x 3 columns]
    Execution time: 0.52 seconds