pythonpandasdataframepairwise

Flag positive/negative pairs where id is the same


I have a dataset with ~50k rows. There are 2 relevant columns for my problem which are 'Saldo' and 'id'.

I would like to flag pairs (2 rows) were id's match and were Saldo = - Saldo.

NB:

a data example is:

import pandas as pd
data = {'id': ['A', 'B', 'A', 'D', 'A', 'A', 'F', 'F', 'F', 'G', 'H', 'H', 'I','I','I'], \
        'Saldo': [-2209.00, -17391.00, 2209.00, -2209.00, 2209.00, -2209.00, -17588.51, -18457.71, -104179.79, -16957.20, 23644.95, -23644.95, -2209.00, 2209.00, -2209.00]}
df = pd.DataFrame(data)
   id      Saldo
0   A   -2209.00
1   B  -17391.00
2   A    2209.00
3   D   -2209.00
4   A    2209.00
5   A   -2209.00
6   F  -17588.51
7   F  -18457.71
8   F -104179.79
9   G  -16957.20
10  H   23644.95
11  H  -23644.95
12  I   -2209.00
13  I    2209.00
14  I   -2209.00

I came up with this code:

for i in range(len(df)-1):
    if df.iloc[i]['Reconciled'] == 'Reconciled': # If row is already Reconciled, skip
        continue
    # Look for matching rows in the remaining rows
    for j in range(i+1, len(df)):
        if (df.iloc[i]['id'] == df.iloc[j]['id'] and
            df.iloc[j]['Reconciled'] == '' and
                df.iloc[i]['Saldo'] == - df.iloc[j]['Saldo']):
            # Mark both rows as reconciled
            df.at[i, 'Reconciled'] = 'Reconciled'
            df.at[j, 'Reconciled'] = 'Reconciled'
            break

It works, but it takes more than a day to process my dataset.

I think this process could be sped up by using a good mask or by vectorizing? Was also thinking on sorting and grouping.

The solution would look something like this:

   id      Saldo  Reconciled
0   A   -2209.00  Reconciled
1   B  -17391.00
2   A    2209.00  Reconciled
3   D   -2209.00
4   A    2209.00  Reconciled
5   A   -2209.00  Reconciled
6   F  -17588.51
7   F  -18457.71
8   F -104179.79
9   G  -16957.20
10  H   23644.95  Reconciled
11  H  -23644.95  Reconciled
12  I   -2209.00  Reconciled
13  I    2209.00  Reconciled
14  I   -2209.00

Here is another attempt I made, but it flags unpaired entries in trios as Reconciled

# No good mask, trio bug :(
for index, row in df.iterrows():
    matching_rows = [(df.loc[index:,'id'] == row['id']) & (opbrengst['Saldo'] == -row['Saldo']) \
    & (row['Reconciled']== '')]  

Solution

  • You want to group your data by id, which should nudge you towards using df.groupby Since each id can have multiple Saldo values, you want to also group by the absolute value of Saldo.:

    df["Saldo_abs"] = df["Saldo"].abs()
    for k, v in df.groupby(["id", "Saldo_abs"]):
        print(v, end="\n\n")
    

    I've truncated the output to show only selected groups

      id   Saldo  Saldo_abs
    0  A -2209.0     2209.0
    2  A  2209.0     2209.0
    4  A  2209.0     2209.0
    5  A -2209.0     2209.0
    
      id    Saldo  Saldo_abs
    1  B -17391.0    17391.0
    
    ...
    
      id     Saldo  Saldo_abs
    6  F -17588.51   17588.51
    
      id     Saldo  Saldo_abs
    7  F -18457.71   18457.71
    
      id      Saldo  Saldo_abs
    8  F -104179.79  104179.79
    
    ...
    

    Next, you want to separate the positive and negative values in each group, and pair them up:

    all_pairs = []
    df["Saldo_abs"] = df["Saldo"].abs()
    for (grp_id, saldo_abs), grp in df.groupby(["id", "Saldo_abs"]):
        saldos = grp["Saldo"]
        grp_pos = saldos[saldos > 0].reset_index() # We don't care about the index when creating the pairs, so reset the index
        grp_neg = saldos[saldos < 0].reset_index()
        pairs = pd.concat((grp_pos, grp_neg), axis=1)
        pairs.columns = ["Index1", "Saldo1", "Index2", "Saldo2"]
        pairs.insert(0, "id", value=grp_id)
        all_pairs.append(pairs)
    

    This gives a list of dataframes in all_pairs, which you can concatenate into a single dataframe:

    pd.concat(all_pairs).reset_index(drop=True)
    

    which gives:

       id  Index1    Saldo1  Index2     Saldo2
    0   A     2.0   2209.00       0   -2209.00
    1   A     4.0   2209.00       5   -2209.00
    2   B     NaN       NaN       1  -17391.00
    3   D     NaN       NaN       3   -2209.00
    4   F     NaN       NaN       6  -17588.51
    5   F     NaN       NaN       7  -18457.71
    6   F     NaN       NaN       8 -104179.79
    7   G     NaN       NaN       9  -16957.20
    8   H    10.0  23644.95      11  -23644.95
    9   I    13.0   2209.00      12   -2209.00
    10  I     NaN       NaN      14   -2209.00
    

    After you updated your question with your desired output, it looks like you don't care about the actual pairs, only whether they have a pair. You can change the code in the loop to modify the original dataframe instead of creating a bunch of dataframes in all_pairs:

    df["Saldo_abs"] = df["Saldo"].abs()
    for (grp_id, saldo_abs), grp in df.groupby(["id", "Saldo_abs"]):
        saldos = grp["Saldo"]
        grp_pos = saldos[saldos > 0] # We DO care about the original index now, because that's how we know which rows to set as Reconciled
        grp_neg = saldos[saldos < 0] # So don't reset index
    
        num_pairs = min(len(grp_pos), len(grp_neg))
        # The first `num_pairs` elements from grp_pos and grp_neg are paired
        # So set the flag in the original dataframe
        df.loc[grp_pos.index[:num_pairs], "Reconciled"] = "Reconciled"
        df.loc[grp_neg.index[:num_pairs], "Reconciled"] = "Reconciled"
    
    df.drop("Saldo_abs", axis=1, inplace=True)
    df["Reconciled"].fillna("", inplace=True)
    

    Which gives your desired:

       id      Saldo  Reconciled
    0   A   -2209.00  Reconciled
    1   B  -17391.00            
    2   A    2209.00  Reconciled
    3   D   -2209.00            
    4   A    2209.00  Reconciled
    5   A   -2209.00  Reconciled
    6   F  -17588.51            
    7   F  -18457.71            
    8   F -104179.79            
    9   G  -16957.20            
    10  H   23644.95  Reconciled
    11  H  -23644.95  Reconciled
    12  I   -2209.00  Reconciled
    13  I    2209.00  Reconciled
    14  I   -2209.00   
    

    While this approach still involves iteration, it only loops over the groups. Modifying the actual rows of the dataframe is done in a vectorized way, so this method will be faster than iterating over all the rows of your dataframe