I am trying to calculate a rolling sum or any other statistic (e.g. mean), within each group. Below I am giving an example where the window is 2 and the statistic is sum.
df = pd.DataFrame.from_dict({'class': ['a', 'b', 'b', 'c', 'c', 'c', 'b', 'a', 'b'],
'val': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df['sum2_per_class'] = [1, 2, 5, 4, 9, 11, 10, 9, 16] # I want to compute this column
# df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum() # what I tried
class val sum2_per_class
0 a 1 1
1 b 2 2
2 b 3 5
3 c 4 4
4 c 5 9
5 c 6 11
6 b 7 10
7 a 8 9
8 b 9 16
Here's what I tried and the corresponding error:
df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum()
TypeError: incompatible index of inserted column with frame index
As the error message conveys, the rolling sum operation returns a pandas Series with a MultiIndex, which can't be directly assigned to a single column in a dataframe.
A possible fix is to use reset_index()
to convert the MultiIndex to a normal index like the following:
df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum().reset_index(level=0, drop=True)
However, after running the above code a few of the I was getting unexpected NaN values in the 'sum2_per_class' column as follows: [NaN, NaN, 5, NaN, 9, 11, 10, 9, 16]
while other values are as expected.
After investigating the NaN issues I came to the following conclusion:
Row 0
with val1=1
and Row 7
with val=8
you expect the rolling sum to be 1 + 8 = 9
while these rows are not consecutive and will result in NaN. For other groups where we got the expected rolling sum the grouped rows are consecutive. For example for group 'c' we have: Row 3
, Row 4
, and Row 5
.Update:
To solve the NaN issues you can specify min_periods=1
in the Rolling
function like below:
df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2, min_periods=1).sum().reset_index(level=0, drop=True)