I am trying to use a the pandas rolling function with window size 2 with groupby. This would be pretty standard other than that I also want the window to include the current value and the proceeding value.
Specifically, given
df = pd.DataFrame({'groups':['a','a','a','a','a','b','b','b','b','b'],
'info': [i for i in range(10)]})
I want
pd.DataFrame({'groups':['a','a','a','a','a','b','b','b','b','b'],
'info': [i for i in range(10)],
'groupsum':[1, 3, 5, 7, nan, 11, 13, 15, 17, nan]})
I have tried 2 strategies, both of which did not work. I first tried
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=2)
df['groupsum'] = df.groupby('groups')['info'].rolling(window=indexer).mean().values
This way leads to kernel crashing, even for this toy dataframe. Very curious as to why.
My second way is to reverse the dataframe then use a regular groupby rolling operation:
df = df.iloc[::-1].copy()
df.index = range(df.shape[0])
df['groupsum'] = df.groupby('groups')['info'].rolling(2).sum().values
While the kernel does not crash with this method, it does not yield the dataframe I'd hoped for; it yields
pd.DataFrame({'groups':['a','a','a','a','a','b','b','b','b','b'],
'info': [i for i in range(10)],
'groupsum':[nan, 7., 5., 3., 1., nan, 17., 15., 13., 11.]})
I suppose there is an obvious solution here that I just don't know. Any help is appreciated!
If assign numpy array instead Series it not aligned correctly, never do it for avoid this problems. Need remove first level of MultiIndex
by Series.reset_index
with drop=True
and then change order by indexing:
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=2)
df['groupsum'] = df.groupby('groups')['info'].rolling(window=indexer).sum().reset_index(level=0, drop=True)
print (df)
groups info groupsum
0 a 0 1.0
1 a 1 3.0
2 a 2 5.0
3 a 3 7.0
4 a 4 NaN
5 b 5 11.0
6 b 6 13.0
7 b 7 15.0
8 b 8 17.0
9 b 9 NaN
df = df.iloc[::-1].copy()
df.index = range(df.shape[0])
df['groupsum'] = df.groupby('groups')['info'].rolling(2).sum().reset_index(level=0, drop=True)
df = df.iloc[::-1]
print (df)
groups info groupsum
9 a 0 1.0
8 a 1 3.0
7 a 2 5.0
6 a 3 7.0
5 a 4 NaN
4 b 5 11.0
3 b 6 13.0
2 b 7 15.0
1 b 8 17.0
0 b 9 NaN