pythonpandasdataframegroup-byrolling-computation

Pandas rolling sum within a group


I am trying to calculate a rolling sum or any other statistic (e.g. mean), within each group. Below I am giving an example where the window is 2 and the statistic is sum.

df = pd.DataFrame.from_dict({'class': ['a', 'b', 'b', 'c', 'c', 'c', 'b', 'a', 'b'],
                        'val': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df['sum2_per_class'] = [1, 2, 5, 4, 9, 11, 10, 9, 16] # I want to compute this column
# df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum() # what I tried

    class  val  sum2_per_class
 0     a    1               1
 1     b    2               2
 2     b    3               5
 3     c    4               4
 4     c    5               9
 5     c    6              11
 6     b    7              10
 7     a    8               9
 8     b    9              16

Here's what I tried and the corresponding error:

df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum()

TypeError: incompatible index of inserted column with frame index

Solution

  • As the error message conveys, the rolling sum operation returns a pandas Series with a MultiIndex, which can't be directly assigned to a single column in a dataframe.

    A possible fix is to use reset_index() to convert the MultiIndex to a normal index like the following:

    df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum().reset_index(level=0, drop=True)
    

    However, after running the above code a few of the I was getting unexpected NaN values in the 'sum2_per_class' column as follows: [NaN, NaN, 5, NaN, 9, 11, 10, 9, 16] while other values are as expected.

    After investigating the NaN issues I came to the following conclusion:

    Update: To solve the NaN issues you can specify min_periods=1 in the Rolling function like below:

    df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2, min_periods=1).sum().reset_index(level=0, drop=True)