python-3.xpandasdataframecombinationsnested-for-loop

How to speed up a nested loop with hundreds of thousands of combinations records in python?


Consider I've two data frames GL_df and RE_df

GL_ID GL_ACCNO GL_DATE GL_CRDR GL_AMOUNT ISMATCHED MATCH_TABLE MATCH_ID
1175595887 0004366490004 2022-03-14 C 17482.12 0 NULL NULL
1175595893 0004366490004 2022-03-14 D -91.22 0 NULL NULL
1175595897 0004366490004 2022-03-14 D -18.24 0 NULL NULL
1179466130 0004366490004 2022-03-22 D -400000.00 0 NULL NULL
1179466158 0004366490004 2022-03-22 D -500000.00 0 NULL NULL
RE_ID RE_ACCNO RE_DATE RE_CRDR RE_AMOUNT ISMATCHED
1261337 0004366490004 2022-03-22 C 500000.00 0
1261342 0004366490004 2022-03-22 D -44707.99 0
1261343 0004366490004 2022-03-22 D -16226.15 0
1261346 0004366490004 2022-03-22 D -17338.43 0
1261348 0004366490004 2022-03-22 C 500000.00 0

In the above

  1. I've to find all possible combinations of Id's of two dataframe where IS_MATCHED columns value equals to zero
from itertools import combinations
import pandas as pd
import numpy as np

Max_per_com = 4
dict_data = GL_df.loc[GL_df['ISMATCHED']==0].set_index('GL_ID')['GL_AMOUNT'].to_dict()
GL_per_com_list = [i for j in range(Max_per_com) for i in combinations(dict_data, j) if sum(map(dict_data.get, i))]

dict_data = RE_df.loc[RE_df['ISMATCHED']==0].set_index('RE_ID')['RE_AMOUNT'].to_dict()
RE_per_com_list = [i for j in range(Max_per_com) for i in combinations(dict_data, j) if sum(map(dict_data.get, i))]

  1. filter and sum all possible combination amounts from GL_df Id's with RE_df ID's, the amount with in the Variance level mark it as matched and skip it for further combinations

    • skip the already matched records from both tables
    • here GL_df,RE_df date (format '%m/%Y') count should be 1
    • also GL_df,RE_df date (format '%m/%Y') should be same
Variance = 1

for i in range(0,len(GL_per_com_list)):
    if 1 in (GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['ISMATCHED'].values):
        continue
    if len(GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_DATE'].dt.strftime('%m/%Y').unique()) > 1:
        continue
    for j in range(0,len(RE_per_com_list)):
        if 1 in (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['ISMATCHED'].values):
            continue
        if len(RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_DATE'].dt.strftime('%m/%Y').unique()) > 1:
            continue
        if ((GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_DATE'].dt.strftime('%m/%Y').unique()[0]) != (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_DATE'].dt.strftime('%m/%Y').unique()[0])):
            continue
        amount = abs((GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_AMOUNT'].sum()) + (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_AMOUNT'].sum()))
        if amount <= Variance:
            GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'ISMATCHED'] = 1
            RE_df.loc[RE_df['RE_ID'].isin(list(RE_per_com_list[j])),'ISMATCHED'] = 1
            GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'MATCH_TABLE'] = 'tbl$matched$entry'
            GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'MATCH_ID'] = str(list(RE_per_com_list[j]))
            break

The above cases it works as expected. but it will take hours to perform. I just wan't to speed it up with in seconds or minutes


Solution

  • Thank God, Big thanks to Myself

    Gone through the internet and got some idea. To process millions of records in efficient way you need to convert the data into Python Dictionary or JSON data. here I converted the data frame to json and performed the above logic in json data which reduces my processing in seconds

    GL_df['G_Date'] = GL_df['GL_DATE'].dt.strftime('%Y-%m')
    RE_df['R_Date'] = RE_df['RE_DATE'].dt.strftime('%Y-%m')
    GL_dict = GL_df.to_json(orient='records')
    RE_dict = RE_df.to_json(orient='records')
    
    # Transform json input to python objects
    import json
    GL_dict = json.loads(GL_dict)
    RE_dict = json.loads(RE_dict)
    
    for i in range(0,len(GL_per_com_list)):
        if (len([k for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])] if k['ISMATCHED'] == 1]) > 0):
            continue
        if (len(set(k['G_Date'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])])) > 1):
            continue
        for j in range(0,len(RE_per_com_list)):
            if (len([k for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])] if k['ISMATCHED'] == 1]) > 0):
                continue
            if (len(set(k['R_Date'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])])) > 1):
                continue
            if (set(k['G_Date'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]) != set(k['R_Date'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])])):
                continue
            amount = abs(sum(k['GL_AMOUNT'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]) + sum(k['RE_AMOUNT'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])]))
            if amount <= Variance:
                for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]:
                    k['ISMATCHED'] = 1
                    k['MATCH_TABLE'] = 'tbl$matched$entry'
                    k['MATCH_ID'] = str(list(RE_per_com_list[j]))
    
                for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])]:
                    k['ISMATCHED'] = 1
                break
    

    In the above the execution time is about four and half seconds for 200 thousand records

    pic1

    But in the earliest, pandas data frame filter took around 13 mins

    pic2