pythonpandasperformanceapache-spark

What is the fastest way to calculate a daily balance with compound interest in Pandas or Spark?


I have a DataFrame (DF) with deposits and withdrawals aggregated by day, and I want to know what is the fastest way to calculate the balance for each day. Because it must be able to scale. Answers in both Pandas and Spark are welcome! Here is an example of how the input DF looks like:

Input

date deposit withdrawal
2024-01-01 100.00 0.00
2024-01-02 0.00 0.00
2024-01-03 50.00 30.00
2024-01-04 0.00 0.00
2024-01-05 0.00 200.00
2024-01-06 20.00 0.00
2024-01-07 20.00 0.00
2024-01-08 0.00 0.00

These deposits and withdrawals are from an investment account that yields 10% per day. Unless the balance is negative. In this case, the daily return must be zero. The pseudo-code calculations to get the daily_return and balance columns are:

Movements = Previous day balance + Deposit - Withdrawal
Interest = 0.1 if Movements > 0 else 0
Daily return = Movements * Interest
Balance = Movements + Daily return

And below is an example of the desired output DF:

Desired output

date deposit withdrawal daily_return balance
2024-01-01 100.00 0.00 10.00 110.00
2024-01-02 0.00 0.00 11.00 121.00
2024-01-03 50.00 30.00 14.10 155.10
2024-01-04 0.00 0.00 15.51 170.61
2024-01-05 0.00 200.00 0.00 -29.39
2024-01-06 20.00 0.00 0.00 -9.39
2024-01-07 20.00 0.00 1.06 11.67
2024-01-08 0.00 0.00 1.17 12.84

What I have

I have a solution in Pandas that achieves the desired output, however it iterates over every line of the DF, i.e. it's slow. Is there a way to vectorize this calculation to speed it up? Or maybe another approach? Here is my implementation:

import pandas as pd

df = pd.DataFrame({
    "date": pd.date_range(start="2024-01-01", end="2024-01-08"),
    "deposit": [100.0, 0.0, 50.0, 0.0, 0.0, 20.0, 20.0, 0.0],
    "withdrawal": [0.0, 0.0, 30.0, 0.0, 200.0, 0.0, 0.0, 0.0]
})

daily_returns = []
balances = []
prev_balance = 0

for _, row in df.iterrows():

    movements = prev_balance + row["deposit"] - row["withdrawal"] 
    interest = 0.1 if movements > 0 else 0
    daily_return = movements * interest
    balance = movements + daily_return
    
    daily_returns.append(daily_return)
    balances.append(balance)
    
    prev_balance = balance

df["daily_return"] = daily_returns
df["balance"] = balances

Solution

  • For this type of computations I'd use , e.g.:

    from numba import njit
    
    
    @njit
    def calculate(deposits, withdrawals, out_daily_return, out_balance):
        prev_balance = 0
    
        for i, (deposit, withdrawal) in enumerate(zip(deposits, withdrawals)):
            movements = prev_balance + deposit - withdrawal
            interest = 0.1 if movements > 0 else 0
            daily_return = movements * interest
            balance = movements + daily_return
    
            out_daily_return[i] = daily_return
            out_balance[i] = balance
    
            prev_balance = balance
    
    
    df["daily_return"] = 0.0
    df["balance"] = 0.0
    
    calculate(
        df["deposit"].values,
        df["withdrawal"].values,
        df["daily_return"].values,
        df["balance"].values,
    )
    
    print(df)
    

    Prints:

            date  deposit  withdrawal  daily_return   balance
    0 2024-01-01    100.0         0.0       10.0000  110.0000
    1 2024-01-02      0.0         0.0       11.0000  121.0000
    2 2024-01-03     50.0        30.0       14.1000  155.1000
    3 2024-01-04      0.0         0.0       15.5100  170.6100
    4 2024-01-05      0.0       200.0       -0.0000  -29.3900
    5 2024-01-06     20.0         0.0       -0.0000   -9.3900
    6 2024-01-07     20.0         0.0        1.0610   11.6710
    7 2024-01-08      0.0         0.0        1.1671   12.8381
    

    Quick benchmark:

    from time import monotonic
    
    df = pd.DataFrame(
        {
            "date": pd.date_range(start="2024-01-01", end="2024-01-08"),
            "deposit": [100.0, 0.0, 50.0, 0.0, 0.0, 20.0, 20.0, 0.0],
            "withdrawal": [0.0, 0.0, 30.0, 0.0, 200.0, 0.0, 0.0, 0.0],
        }
    )
    
    df = pd.concat([df] * 1_000_000)
    print(f"{len(df)=}")
    
    start_time = monotonic()
    
    df["daily_return"] = 0.0
    df["balance"] = 0.0
    
    calculate(
        df["deposit"].values,
        df["withdrawal"].values,
        df["daily_return"].values,
        df["balance"].values,
    )
    
    print("Time =", monotonic() - start_time)
    

    Prints on my AMD 5700x:

    len(df)=8000000
    Time = 0.11215395800536498