Consider I've two data frames GL_df and RE_df
GL_ID | GL_ACCNO | GL_DATE | GL_CRDR | GL_AMOUNT | ISMATCHED | MATCH_TABLE | MATCH_ID |
---|---|---|---|---|---|---|---|
1175595887 | 0004366490004 | 2022-03-14 | C | 17482.12 | 0 | NULL | NULL |
1175595893 | 0004366490004 | 2022-03-14 | D | -91.22 | 0 | NULL | NULL |
1175595897 | 0004366490004 | 2022-03-14 | D | -18.24 | 0 | NULL | NULL |
1179466130 | 0004366490004 | 2022-03-22 | D | -400000.00 | 0 | NULL | NULL |
1179466158 | 0004366490004 | 2022-03-22 | D | -500000.00 | 0 | NULL | NULL |
RE_ID | RE_ACCNO | RE_DATE | RE_CRDR | RE_AMOUNT | ISMATCHED |
---|---|---|---|---|---|
1261337 | 0004366490004 | 2022-03-22 | C | 500000.00 | 0 |
1261342 | 0004366490004 | 2022-03-22 | D | -44707.99 | 0 |
1261343 | 0004366490004 | 2022-03-22 | D | -16226.15 | 0 |
1261346 | 0004366490004 | 2022-03-22 | D | -17338.43 | 0 |
1261348 | 0004366490004 | 2022-03-22 | C | 500000.00 | 0 |
In the above
from itertools import combinations
import pandas as pd
import numpy as np
Max_per_com = 4
dict_data = GL_df.loc[GL_df['ISMATCHED']==0].set_index('GL_ID')['GL_AMOUNT'].to_dict()
GL_per_com_list = [i for j in range(Max_per_com) for i in combinations(dict_data, j) if sum(map(dict_data.get, i))]
dict_data = RE_df.loc[RE_df['ISMATCHED']==0].set_index('RE_ID')['RE_AMOUNT'].to_dict()
RE_per_com_list = [i for j in range(Max_per_com) for i in combinations(dict_data, j) if sum(map(dict_data.get, i))]
filter and sum all possible combination amounts from GL_df Id's with RE_df ID's, the amount with in the Variance level mark it as matched and skip it for further combinations
Variance = 1
for i in range(0,len(GL_per_com_list)):
if 1 in (GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['ISMATCHED'].values):
continue
if len(GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_DATE'].dt.strftime('%m/%Y').unique()) > 1:
continue
for j in range(0,len(RE_per_com_list)):
if 1 in (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['ISMATCHED'].values):
continue
if len(RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_DATE'].dt.strftime('%m/%Y').unique()) > 1:
continue
if ((GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_DATE'].dt.strftime('%m/%Y').unique()[0]) != (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_DATE'].dt.strftime('%m/%Y').unique()[0])):
continue
amount = abs((GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_AMOUNT'].sum()) + (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_AMOUNT'].sum()))
if amount <= Variance:
GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'ISMATCHED'] = 1
RE_df.loc[RE_df['RE_ID'].isin(list(RE_per_com_list[j])),'ISMATCHED'] = 1
GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'MATCH_TABLE'] = 'tbl$matched$entry'
GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'MATCH_ID'] = str(list(RE_per_com_list[j]))
break
The above cases it works as expected. but it will take hours to perform. I just wan't to speed it up with in seconds or minutes
Thank God, Big thanks to Myself
Gone through the internet and got some idea. To process millions of records in efficient way you need to convert the data into Python Dictionary or JSON data. here I converted the data frame to json and performed the above logic in json data which reduces my processing in seconds
GL_df['G_Date'] = GL_df['GL_DATE'].dt.strftime('%Y-%m')
RE_df['R_Date'] = RE_df['RE_DATE'].dt.strftime('%Y-%m')
GL_dict = GL_df.to_json(orient='records')
RE_dict = RE_df.to_json(orient='records')
# Transform json input to python objects
import json
GL_dict = json.loads(GL_dict)
RE_dict = json.loads(RE_dict)
for i in range(0,len(GL_per_com_list)):
if (len([k for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])] if k['ISMATCHED'] == 1]) > 0):
continue
if (len(set(k['G_Date'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])])) > 1):
continue
for j in range(0,len(RE_per_com_list)):
if (len([k for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])] if k['ISMATCHED'] == 1]) > 0):
continue
if (len(set(k['R_Date'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])])) > 1):
continue
if (set(k['G_Date'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]) != set(k['R_Date'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])])):
continue
amount = abs(sum(k['GL_AMOUNT'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]) + sum(k['RE_AMOUNT'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])]))
if amount <= Variance:
for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]:
k['ISMATCHED'] = 1
k['MATCH_TABLE'] = 'tbl$matched$entry'
k['MATCH_ID'] = str(list(RE_per_com_list[j]))
for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])]:
k['ISMATCHED'] = 1
break
In the above the execution time is about four and half seconds for 200 thousand records
But in the earliest, pandas data frame filter took around 13 mins