pythonpandasrolling-computation

"window must be an integer 0 or greater" issue with '30D' style rolling calculations


I've had a look and can't seem to find a solution to this issue. I'm wanting to calculate the rolling sum of the previous 30 days' worth of data at each date in the dataframe - by subgroup - for a set of data that isn't daily - it's spaced fairly irregularly. I've been attempting to use ChatGPT which is getting in a twist over it.

Initially the suggestion was that I'd not converted the Date column to datetime format to allow for the rolling calculation, but now from the code below:

import pandas as pd
from datetime import datetime, timedelta
import numpy as np

# Create a dataset with irregularly spaced dates spanning two years
np.random.seed(42)
date_rng = pd.date_range(start='2022-01-01', end='2023-12-31', freq='10D')  # Every 10 days
data = {'Date': np.random.choice(date_rng, size=30),
        'Group': np.random.choice(['A', 'B'], size=30),
        'Value': np.random.randint(1, 30, size=30)}

df = pd.DataFrame(data)

# Sort DataFrame by date
df.sort_values(by='Date', inplace=True)

df['Date'] = pd.to_datetime(df['Date'])

# Calculate cumulative sum by group within the previous 30 days from each day
df['RollingSum_Last30Days'] = df.groupby('Group')['Value'].transform(lambda x: x.rolling(window='30D', min_periods=1).sum())

I'm getting an error of:

ValueError: window must be an integer 0 or greater

I've found conflicting comments online as to whether the format '30D' works in rolling windows but I'm none the wiser as to a solution to this. Any help appreciated.

Running in VSCode in Python 3.11.8.


Solution

  • The issue if that you need to specify which column to use as Date but don't have access to the Date with groupby.transform.

    You could use groupby.apply:

    # Calculate cumulative sum by group within the previous 30 days from each day
    df['RollingSum_Last30Days'] = (df.groupby('Group', group_keys=False)
                                     .apply(lambda x: x.rolling(window='30D', on='Date', min_periods=1)['Value'].sum())
                                  )
    

    Output:

             Date Group  Value  RollingSum_Last30Days
    9  2022-01-11     A     22                   22.0
    12 2022-01-11     A     22                   44.0
    6  2022-01-21     A      4                   48.0
    1  2022-05-21     B     14                   14.0
    23 2022-05-21     A      8                    8.0
    15 2022-07-20     B     26                   26.0
    4  2022-07-20     A     18                   18.0
    18 2022-07-30     B     10                   36.0
    7  2022-07-30     A      2                   20.0
    5  2022-08-19     A      8                   10.0
    10 2022-10-18     B     10                   10.0
    16 2022-11-17     B     12                   12.0
    11 2023-01-06     B      4                    4.0
    21 2023-02-15     B     16                   16.0
    26 2023-04-06     B     28                   28.0
    19 2023-04-26     A      4                    4.0
    28 2023-05-16     B      8                    8.0
    0  2023-05-26     B      3                   11.0
    8  2023-06-05     A      6                    6.0
    29 2023-06-25     A     21                   27.0
    17 2023-07-25     A      2                    2.0
    20 2023-08-04     B     14                   14.0
    22 2023-08-14     B     15                   29.0
    14 2023-08-14     B     18                   47.0
    3  2023-08-24     A      4                    4.0
    24 2023-09-03     B     14                   47.0
    25 2023-09-03     A     23                   27.0
    27 2023-09-03     A     25                   52.0
    13 2023-09-23     B     29                   43.0
    2  2023-12-12     A     17                   17.0