pythonpandas

Pandas Slowing way down after processing 10,000 rows


I am working on a small function to do a simple cleanup of a csv using pandas. Here is the code:

def clean_charges(conn, cur):
    charges = pd.read_csv('csv/all_charges.csv', parse_dates=['CreatedDate', 'PostingDate', 
                                                            'PrimaryInsurancePaymentPostingDate', 
                                                            'SecondaryInsurancePaymentPostingDate', 
                                                            'TertiaryInsurancePaymentPostingDate'])
    
    # Split charges into 10 equal sized dataframes
    num_splits = 10
    charges_split = np.array_split(charges, num_splits)
    
    cur_month = datetime.combine(datetime.now().date().replace(day=1), datetime.min.time())

    count = 0
    total = 0
    for cur_charge in charges_split:
        for index, charge in cur_charge.iterrows():
            if total % 1000 == 0:
                print(total)
            total += 1
            # Delete it from the dataframe if its a charge from the current month
            if charge['PostingDate'] >= cur_month:
                count += 1
                charges.drop(index, inplace=True)
                continue
            # Delete the payments if they were applied in the current month
            if charge['PrimaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['PrimaryInsuranceInsurancePayment']
                charge['PrimaryInsurancePayment'] = 0
            if charge['SecondaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['SecondaryInsuranceInsurancePayment']
                charge['SecondaryInsurancePayment'] = 0
            if charge['TertiaryInsurancePaymentPostingDate'] >= cur_month:
                charge['TotalBalance'] = charge['TotalBalance'] + charge['TertiaryInsuranceInsurancePayment']
                charge['TertiaryInsurancePayment'] = 0
            # Delete duplicate payments
            if charge['AdjustedCharges'] - (charge['PrimaryInsuranceInsurancePayment'] + charge['SecondaryInsuranceInsurancePayment'] + 
                                            charge['TertiaryInsuranceInsurancePayment'] + charge['PatientPaymentAmount']) != charge['TotalBalance']:
                charge['SecondaryInsurancePayment'] = 0

    charges = pd.concat(charges_split)
    
    charges.to_csv('csv/updated_charges.csv', index=False)

The total size of all_charges.csv is about 270,000 rows, but I am running into an issue where it will process the first 10,000 rows very quickly, and then slow way down. Approximate timing is 5 seconds for the first 10,000 and then about 2 minutes for every thousand after that. This was an issue when I was working on the full set as one dataframe, and when I split it out into 10 as you can see in my code now. I don't see anything that would be the cause of this, my code may not be 100% optimized but I feel like I'm not doing anything incredibly stupid. My computer is also only running at 15% CPU usage and 40% Memory usage, so I don't think its a hardware issue.

I would appreciate any help I can get to figure out why this is running so slowly!


Solution

  • Dropping records from a dataframe is reported to be slow so it would be better to use pandas filtering features.

    Generating a 70000 records csv and processing only first 10000

    def clean_charges(charges):
    
        flt_date = datetime(2024, 9, 1)
    
        count = 0
        total = 0
        # for cur_charge in charges_split:
        for index, charge in charges.iterrows():
            if total % 1000 == 0:
                print(total)
            total += 1
            # Delete it from the dataframe if its a charge from the current month
            if charge['PostingDate'] >= flt_date:
                count += 1
                charges.drop(index, inplace=True)
                continue
            if total == 10000:
                break
    
    charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
    print(f'df length: {len(charges.index)}')
    clean_charges(charges)
    print(f'df length: {len(charges.index)}')
    

    Running it

    time filter.py
    

    It takes 40s to process only 10k rows

    df length: 70000
    0
    1000
    2000
    ...
    9000
    df length: 66694
    
    real    0m40.134s
    user    0m40.555s
    sys     0m0.096s
    

    Using pandas filtering

    charges = pd.read_csv('faker_data_70000.csv', parse_dates=['PostingDate'])
    print(f'df length: {len(charges.index)}')
    
    flt_date = datetime(2024, 9, 1)
    charges_flt = charges[charges['PostingDate'] <= flt_date]
    print(f'df length: {len(charges_flt.index)}')
    

    Result

    df length: 70000
    df length: 46908
    
    real    0m0.542s
    user    0m1.023s
    sys     0m0.040s