I am working on a small function to do a simple cleanup of a csv using pandas. Here is the code:
def clean_charges(conn, cur):
charges = pd.read_csv('csv/all_charges.csv', parse_dates=['CreatedDate', 'PostingDate',
'PrimaryInsurancePaymentPostingDate',
'SecondaryInsurancePaymentPostingDate',
'TertiaryInsurancePaymentPostingDate'])
# Split charges into 10 equal sized dataframes
num_splits = 10
charges_split = np.array_split(charges, num_splits)
cur_month = datetime.combine(datetime.now().date().replace(day=1), datetime.min.time())
count = 0
total = 0
for cur_charge in charges_split:
for index, charge in cur_charge.iterrows():
if total % 1000 == 0:
print(total)
total += 1
# Delete it from the dataframe if its a charge from the current month
if charge['PostingDate'] >= cur_month:
count += 1
charges.drop(index, inplace=True)
continue
# Delete the payments if they were applied in the current month
if charge['PrimaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['PrimaryInsuranceInsurancePayment']
charge['PrimaryInsurancePayment'] = 0
if charge['SecondaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['SecondaryInsuranceInsurancePayment']
charge['SecondaryInsurancePayment'] = 0
if charge['TertiaryInsurancePaymentPostingDate'] >= cur_month:
charge['TotalBalance'] = charge['TotalBalance'] + charge['TertiaryInsuranceInsurancePayment']
charge['TertiaryInsurancePayment'] = 0
# Delete duplicate payments
if charge['AdjustedCharges'] - (charge['PrimaryInsuranceInsurancePayment'] + charge['SecondaryInsuranceInsurancePayment'] +
charge['TertiaryInsuranceInsurancePayment'] + charge['PatientPaymentAmount']) != charge['TotalBalance']:
charge['SecondaryInsurancePayment'] = 0
charges = pd.concat(charges_split)
charges.to_csv('csv/updated_charges.csv', index=False)
The total size of all_charges.csv is about 270,000 rows, but I am running into an issue where it will process the first 10,000 rows very quickly, and then slow way down. Approximate timing is 5 seconds for the first 10,000 and then about 2 minutes for every thousand after that. This was an issue when I was working on the full set as one dataframe, and when I split it out into 10 as you can see in my code now. I don't see anything that would be the cause of this, my code may not be 100% optimized but I feel like I'm not doing anything incredibly stupid. My computer is also only running at 15% CPU usage and 40% Memory usage, so I don't think its a hardware issue.
I would appreciate any help I can get to figure out why this is running so slowly!
Dropping records from a dataframe is reported to be slow so it would be better to use pandas filtering features.
Generating a 70000 records csv and processing only first 10000
def clean_charges(charges):
flt_date = datetime(2024, 9, 1)
count = 0
total = 0
# for cur_charge in charges_split:
for index, charge in charges.iterrows():
if total % 1000 == 0:
print(total)
total += 1
# Delete it from the dataframe if its a charge from the current month
if charge['PostingDate'] >= flt_date:
count += 1
charges.drop(index, inplace=True)
continue
if total == 10000:
break
charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')
clean_charges(charges)
print(f'df length: {len(charges.index)}')
Running it
time filter.py
It takes 40s to process only 10k rows
df length: 70000
0
1000
2000
...
9000
df length: 66694
real 0m40.134s
user 0m40.555s
sys 0m0.096s
Using pandas filtering
charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')
flt_date = datetime(2024, 9, 1)
charges_flt = charges[charges['PostingDate'] <= flt_date]
print(f'df length: {len(charges_flt.index)}')
Result
df length: 70000
df length: 46908
real 0m0.542s
user 0m1.023s
sys 0m0.040s