I have a dataset with ~50k rows. There are 2 relevant columns for my problem which are 'Saldo' and 'id'.
I would like to flag pairs (2 rows) were id's match and were Saldo = - Saldo.
NB:
id
.id
can have multiple pairs (A).id="I"
in the data below, where only index 12
and 13
should be flagged.a data example is:
import pandas as pd
data = {'id': ['A', 'B', 'A', 'D', 'A', 'A', 'F', 'F', 'F', 'G', 'H', 'H', 'I','I','I'], \
'Saldo': [-2209.00, -17391.00, 2209.00, -2209.00, 2209.00, -2209.00, -17588.51, -18457.71, -104179.79, -16957.20, 23644.95, -23644.95, -2209.00, 2209.00, -2209.00]}
df = pd.DataFrame(data)
id Saldo
0 A -2209.00
1 B -17391.00
2 A 2209.00
3 D -2209.00
4 A 2209.00
5 A -2209.00
6 F -17588.51
7 F -18457.71
8 F -104179.79
9 G -16957.20
10 H 23644.95
11 H -23644.95
12 I -2209.00
13 I 2209.00
14 I -2209.00
I came up with this code:
for i in range(len(df)-1):
if df.iloc[i]['Reconciled'] == 'Reconciled': # If row is already Reconciled, skip
continue
# Look for matching rows in the remaining rows
for j in range(i+1, len(df)):
if (df.iloc[i]['id'] == df.iloc[j]['id'] and
df.iloc[j]['Reconciled'] == '' and
df.iloc[i]['Saldo'] == - df.iloc[j]['Saldo']):
# Mark both rows as reconciled
df.at[i, 'Reconciled'] = 'Reconciled'
df.at[j, 'Reconciled'] = 'Reconciled'
break
It works, but it takes more than a day to process my dataset.
I think this process could be sped up by using a good mask or by vectorizing? Was also thinking on sorting and grouping.
The solution would look something like this:
id Saldo Reconciled
0 A -2209.00 Reconciled
1 B -17391.00
2 A 2209.00 Reconciled
3 D -2209.00
4 A 2209.00 Reconciled
5 A -2209.00 Reconciled
6 F -17588.51
7 F -18457.71
8 F -104179.79
9 G -16957.20
10 H 23644.95 Reconciled
11 H -23644.95 Reconciled
12 I -2209.00 Reconciled
13 I 2209.00 Reconciled
14 I -2209.00
Here is another attempt I made, but it flags unpaired entries in trios as Reconciled
# No good mask, trio bug :(
for index, row in df.iterrows():
matching_rows = [(df.loc[index:,'id'] == row['id']) & (opbrengst['Saldo'] == -row['Saldo']) \
& (row['Reconciled']== '')]
You want to group your data by id
, which should nudge you towards using df.groupby
Since each id
can have multiple Saldo
values, you want to also group by the absolute value of Saldo
.:
df["Saldo_abs"] = df["Saldo"].abs()
for k, v in df.groupby(["id", "Saldo_abs"]):
print(v, end="\n\n")
I've truncated the output to show only selected groups
id Saldo Saldo_abs
0 A -2209.0 2209.0
2 A 2209.0 2209.0
4 A 2209.0 2209.0
5 A -2209.0 2209.0
id Saldo Saldo_abs
1 B -17391.0 17391.0
...
id Saldo Saldo_abs
6 F -17588.51 17588.51
id Saldo Saldo_abs
7 F -18457.71 18457.71
id Saldo Saldo_abs
8 F -104179.79 104179.79
...
Next, you want to separate the positive and negative values in each group, and pair them up:
all_pairs = []
df["Saldo_abs"] = df["Saldo"].abs()
for (grp_id, saldo_abs), grp in df.groupby(["id", "Saldo_abs"]):
saldos = grp["Saldo"]
grp_pos = saldos[saldos > 0].reset_index() # We don't care about the index when creating the pairs, so reset the index
grp_neg = saldos[saldos < 0].reset_index()
pairs = pd.concat((grp_pos, grp_neg), axis=1)
pairs.columns = ["Index1", "Saldo1", "Index2", "Saldo2"]
pairs.insert(0, "id", value=grp_id)
all_pairs.append(pairs)
This gives a list of dataframes in all_pairs
, which you can concatenate into a single dataframe:
pd.concat(all_pairs).reset_index(drop=True)
which gives:
id Index1 Saldo1 Index2 Saldo2
0 A 2.0 2209.00 0 -2209.00
1 A 4.0 2209.00 5 -2209.00
2 B NaN NaN 1 -17391.00
3 D NaN NaN 3 -2209.00
4 F NaN NaN 6 -17588.51
5 F NaN NaN 7 -18457.71
6 F NaN NaN 8 -104179.79
7 G NaN NaN 9 -16957.20
8 H 10.0 23644.95 11 -23644.95
9 I 13.0 2209.00 12 -2209.00
10 I NaN NaN 14 -2209.00
After you updated your question with your desired output, it looks like you don't care about the actual pairs, only whether they have a pair. You can change the code in the loop to modify the original dataframe instead of creating a bunch of dataframes in all_pairs
:
df["Saldo_abs"] = df["Saldo"].abs()
for (grp_id, saldo_abs), grp in df.groupby(["id", "Saldo_abs"]):
saldos = grp["Saldo"]
grp_pos = saldos[saldos > 0] # We DO care about the original index now, because that's how we know which rows to set as Reconciled
grp_neg = saldos[saldos < 0] # So don't reset index
num_pairs = min(len(grp_pos), len(grp_neg))
# The first `num_pairs` elements from grp_pos and grp_neg are paired
# So set the flag in the original dataframe
df.loc[grp_pos.index[:num_pairs], "Reconciled"] = "Reconciled"
df.loc[grp_neg.index[:num_pairs], "Reconciled"] = "Reconciled"
df.drop("Saldo_abs", axis=1, inplace=True)
df["Reconciled"].fillna("", inplace=True)
Which gives your desired:
id Saldo Reconciled
0 A -2209.00 Reconciled
1 B -17391.00
2 A 2209.00 Reconciled
3 D -2209.00
4 A 2209.00 Reconciled
5 A -2209.00 Reconciled
6 F -17588.51
7 F -18457.71
8 F -104179.79
9 G -16957.20
10 H 23644.95 Reconciled
11 H -23644.95 Reconciled
12 I -2209.00 Reconciled
13 I 2209.00 Reconciled
14 I -2209.00
While this approach still involves iteration, it only loops over the groups. Modifying the actual rows of the dataframe is done in a vectorized way, so this method will be faster than iterating over all the rows of your dataframe