How to merge Pandas DataFrames with Many-to-Many Relationship avoiding collisions?

I have to match Payment Slips with Transaction data. Those two sets of data are not directly linked, as there isn't a shared ID among them.

I came up with the Block_ID, calculated value I can use to match them as close as possible.

Unfortunately, there may be repeated Block_ID so some Transaction may match the same Payment Slip while the other Payment Slips with the same Block_ID will remain unmatched.

transactions = pd.DataFrame(
    [["9103cd45-95a4-4671-892b-10c1192d774e", 2.40,     "2023-09-20", "20230920|2.40", None],
     ["900899fb-13eb-483d-9fc4-aa62d18e0b8a", 2.40,     "2023-09-20", "20230920|2.40", None],
     ["6007e789-fa40-4983-9dc7-06560c98616b", 2.40,     "2023-09-20", "20230920|2.40", None],
     ["421714d0-b6dc-4dd3-b747-6ba2fc7b63b7", 414.66,   "2023-09-20", "20230920|414.66", None],
     ["8845a1ef-c625-4874-89cf-0ad83abd7ec3", 429.64,   "2023-09-20", "20230920|429.64", None],
     ["64934365-c5b6-4976-9b73-ebd0605e4ad6", 1377.50,  "2023-09-20", "20230920|1377.50", None],
     ["571bc685-6459-4fe9-ac1c-ea4a80a2533b", 2190.00,  "2023-09-20", "20230920|2190.00", None],
     ["20876834-40a2-42e3-80af-728d7dda1c84", 2600.00,  "2023-09-20", "20230920|2600.00", None],
     ["e642c6b7-d154-49d2-9863-7849ff64b8c8", 10480.02, "2023-09-20", "20230920|10480.02", None],
     ["317f414e-8570-41e3-ac4c-39a5bd20a6a4", 32856.66, "2023-09-20", "20230920|32856.66", None]],
    columns=["id", "amount", "date", "block_id", "slip_id"])

payment_slips = pd.DataFrame(
    [["7f2176dd-68ee-4b49-9f57-9e661cb3bf41", 2.40,     "2023-09-20", "20230920|2.40"],
     ["21843656-15b1-4ea6-b9a8-6ad9672b4a2e", 10.80,    "2023-09-20", "20230920|10.80"],
     ["b0389f34-086d-4e66-ac75-a69c215d7c9a", 206.85,   "2023-09-20", "20230920|206.85"],
     ["857a161b-a8fe-4939-b951-686e73a7c5b6", 414.66,   "2023-09-20", "20230920|414.66"],
     ["f6f8305c-5260-4bce-9be8-0618ed500389", 429.64,   "2023-09-20", "20230920|429.64"],
     ["790d5425-ed53-4ecc-8e13-5d7726cae289", 1377.50,  "2023-09-20", "20230920|1377.50"],
     ["f0885f50-6617-4fa8-a2e4-79e917cb2237", 2190.00,  "2023-09-20", "20230920|2190.00"],
     ["5f24b735-8dfb-4592-bef5-6fb74d6f4e24", 2600.00,  "2023-09-20", "20230920|2600.00"],
     ["717cc01c-7520-45ad-bb72-2c0bafc79c6d", 10480.02, "2023-09-20", "20230920|10480.02"],
     ["f3b31d20-9a74-43ac-aaee-56a3b87674d8", 32856.66, "2023-09-20", "20230920|32856.66"]],
    columns=["id", "amount", "date", "block_id"])

known_slips = set()
for transaction in transactions.itertuples():
    match = payment_slips.loc[
        (payment_slips['block_id'] == transaction.block_id)
        & (~payment_slips['id'].isin(known_slips))]
    if not match.empty:
        transactions.at[transaction.Index, 'slip_id'] = match.iloc[0]['id']
        known_slips.add(match.iloc[0]['id'])

In the example above, you can see that the top 3 Transactions will all match the first Payment Slip, which is wrong, only the first Transaction should have a match, while the others don't.

This was solved by keeping track of all matched slip IDs, so a previously matched ID cannot be paired again.

While this approach works, I know it is inefficient because I have to iterate the whole transactions DataFrame and re-search the Slips DataFrame multiple times.

And while not in the example above, The data is residing in a SQL Database, and I would like to use pandas DataFrame read_sql as I can batch data using the chunksize parameter.

Currently my studies pointed me to pandas DataFrame merge to accomplish this but it seems I must perform a cross-apply and then filter.

This approach also seems wasteful, as I will end up with a Cartesian product of the frames which easily surpass thousands of records.

I realize there may be other approaches using pandas however I am not familiar with the library enough.

Is there a way to solve this problem with a pandas solution?

Solution

From what I understand, you want to pair up payment slips with transactions by occurrence of the block ID, so that means all you need to do is enumerate the occurrences of each block ID. See piRSquared's answer on "enumerate items in each group". Then you can merge.

That would look like this:

join_key = 'block_id'

def _enum(df: pd.DataFrame) -> pd.Series:
    return df.groupby(join_key).cumcount()

result = (
    transactions
    .drop(columns=['slip_id'])  # Will be merged instead
    .assign(_n=_enum)
    .merge(
        payment_slips
        .rename(columns={'id': 'slip_id'})
        .assign(_n=_enum),
        how='left',
    )
    # .drop(columns=['block_id', '_n'])  # Drop merge keys
)

print(result.drop(columns=['id', 'block_id']))  # To show result on one page

     amount        date  _n                               slip_id
0      2.40  2023-09-20   0  7f2176dd-68ee-4b49-9f57-9e661cb3bf41
1      2.40  2023-09-20   1                                   NaN
2      2.40  2023-09-20   2                                   NaN
3    414.66  2023-09-20   0  857a161b-a8fe-4939-b951-686e73a7c5b6
4    429.64  2023-09-20   0  f6f8305c-5260-4bce-9be8-0618ed500389
5   1377.50  2023-09-20   0  790d5425-ed53-4ecc-8e13-5d7726cae289
6   2190.00  2023-09-20   0  f0885f50-6617-4fa8-a2e4-79e917cb2237
7   2600.00  2023-09-20   0  5f24b735-8dfb-4592-bef5-6fb74d6f4e24
8  10480.02  2023-09-20   0  717cc01c-7520-45ad-bb72-2c0bafc79c6d
9  32856.66  2023-09-20   0  f3b31d20-9a74-43ac-aaee-56a3b87674d8