Pandas Slowing way down after processing 10,000 rows

I am working on a small function to do a simple cleanup of a csv using pandas. Here is the code:

def clean_charges(conn, cur):
    charges = pd.read_csv('csv/all_charges.csv', parse_dates=['CreatedDate', 'PostingDate', 
                                                            'PrimaryInsurancePaymentPostingDate', 
                                                            'SecondaryInsurancePaymentPostingDate', 
                                                            'TertiaryInsurancePaymentPostingDate'])
    
    # Split charges into 10 equal sized dataframes
    num_splits = 10
    charges_split = np.array_split(charges, num_splits)
    
    cur_month = datetime.combine(datetime.now().date().replace(day=1), datetime.min.time())

    count = 0
    total = 0
    for cur_charge in charges_split:
        for index, charge in cur_charge.iterrows():
            if total % 1000 == 0:
                print(total)
            total += 1
            # Delete it from the dataframe if its a charge from the current month
            if charge['PostingDate'] >= cur_month:
                count += 1
                charges.drop(index, inplace=True)
                continue
            # Delete the payments if they were applied in the current month
            if charge['PrimaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['PrimaryInsuranceInsurancePayment']
                charge['PrimaryInsurancePayment'] = 0
            if charge['SecondaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['SecondaryInsuranceInsurancePayment']
                charge['SecondaryInsurancePayment'] = 0
            if charge['TertiaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['TertiaryInsuranceInsurancePayment']
                charge['TertiaryInsurancePayment'] = 0
            # Delete duplicate payments
            if charge['AdjustedCharges'] - (charge['PrimaryInsuranceInsurancePayment'] + charge['SecondaryInsuranceInsurancePayment'] + 
                                            charge['TertiaryInsuranceInsurancePayment'] + charge['PatientPaymentAmount']) != charge['TotalBalance']:
                charge['SecondaryInsurancePayment'] = 0

    charges = pd.concat(charges_split)
    
    charges.to_csv('csv/updated_charges.csv', index=False)

The total size of all_charges.csv is about 270,000 rows, but I am running into an issue where it will process the first 10,000 rows very quickly, and then slow way down. Approximate timing is 5 seconds for the first 10,000 and then about 2 minutes for every thousand after that. This was an issue when I was working on the full set as one dataframe, and when I split it out into 10 as you can see in my code now. I don't see anything that would be the cause of this, my code may not be 100% optimized but I feel like I'm not doing anything incredibly stupid. My computer is also only running at 15% CPU usage and 40% Memory usage, so I don't think its a hardware issue.

I would appreciate any help I can get to figure out why this is running so slowly!

Solution

Dropping records from a dataframe is reported to be slow so it would be better to use pandas filtering features.

Generating a 70000 records csv and processing only first 10000

def clean_charges(charges):

    flt_date = datetime(2024, 9, 1)

    count = 0
    total = 0
    # for cur_charge in charges_split:
    for index, charge in charges.iterrows():
        if total % 1000 == 0:
            print(total)
        total += 1
        # Delete it from the dataframe if its a charge from the current month
        if charge['PostingDate'] >= flt_date:
            count += 1
            charges.drop(index, inplace=True)
            continue
        if total == 10000:
            break

charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')
clean_charges(charges)
print(f'df length: {len(charges.index)}')

Running it

time filter.py

It takes 40s to process only 10k rows

df length: 70000
0
1000
2000
...
9000
df length: 66694

real    0m40.134s
user    0m40.555s
sys     0m0.096s

Using pandas filtering

charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
print(f'df length: {len(charges.index)}')

flt_date = datetime(2024, 9, 1)
charges_flt = charges[charges['PostingDate'] <= flt_date]
print(f'df length: {len(charges_flt.index)}')

Result

df length: 70000
df length: 46908

real    0m0.542s
user    0m1.023s
sys     0m0.040s