pythonpandasdataframerolling-computation

How to group by on multiple columns and retain the original index in a pandas dataframe?


I need to group by multiple columns on a dataframe and calculate the rolling mean in the group. But the original index needs to be preserved.

Simple python code below :

data = {'values': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
        'type':['A','B','A','B','A','B','A','B','A','B','A','B','A','B','A'],
       'type2':['C','D','C','D','C','D','C','D','C','D','C','D','C','D','C']}
df = pd.DataFrame(data)

window_size = 3

df['mean'] = (df.groupby(['type','type2'])['values'].rolling(window=3).mean().reset_index(drop=True))

print(df)

Output :

    values type type2  mean
0        1    A     C   NaN
1        2    B     D   NaN
2        3    A     C   3.0
3        4    B     D   5.0
4        5    A     C   7.0
5        6    B     D   9.0
6        7    A     C  11.0
7        8    B     D  13.0
8        9    A     C   NaN
9       10    B     D   NaN
10      11    A     C   4.0
11      12    B     D   6.0
12      13    A     C   8.0
13      14    B     D  10.0
14      15    A     C  12.0

What I need :

      values type type2  mean
0        1    A     C   1
1        2    B     D   2
2        3    A     C   2
3        4    B     D   3
4        5    A     C   3
5        6    B     D   4
6        7    A     C   5
7        8    B     D   6
8        9    A     C   7
9       10    B     D   8
10      11    A     C   9
11      12    B     D   10
12      13    A     C   11
13      14    B     D   12
14      15    A     C   13

The requirement is very simple. The mean has to be calcluated in the groups. So last row is group (A,C) . So it is 15+ 13(previous) + 11 (previous to previous because window is 3) = 39 /3 =13

Same for other rows.

But when I do this with level = 0 with below code I get

df['mean'] = (df.groupby(['type','type2'])['values'].rolling(window=3).mean().reset_index(level=0,drop=True))


raised in MultiIndex.from_tuples, see test_insert_error_msmgs
  12690         if not value.index.is_unique:
  12691             # duplicate axis
  12692             raise err
  12693 
> 12694         raise TypeError(
  12695             "incompatible index of inserted column with frame index"
  12696         ) from err
  12697     return reindexed_value, None

TypeError: incompatible index of inserted column with frame index

How to go about this simple requirement ?


Solution

  • You should use droplevel:

    cols = ['type', 'type2']
    df['mean'] = (df.groupby(cols)['values']
                    .rolling(window=3).mean()
                    .droplevel(cols)
                 )
    

    Output:

        values type type2  mean
    0        1    A     C   NaN
    1        2    B     D   NaN
    2        3    A     C   NaN
    3        4    B     D   NaN
    4        5    A     C   3.0
    5        6    B     D   4.0
    6        7    A     C   5.0
    7        8    B     D   6.0
    8        9    A     C   7.0
    9       10    B     D   8.0
    10      11    A     C   9.0
    11      12    B     D  10.0
    12      13    A     C  11.0
    13      14    B     D  12.0
    14      15    A     C  13.0
    

    And to avoid the NaNs, add min_periods=1:

    cols = ['type', 'type2']
    df['mean'] = (df.groupby(cols)['values']
                    .rolling(window=3, min_periods=1).mean()
                    .droplevel(cols)
                 )
    

    Output:

        values type type2  mean
    0        1    A     C   1.0
    1        2    B     D   2.0
    2        3    A     C   2.0
    3        4    B     D   3.0
    4        5    A     C   3.0
    5        6    B     D   4.0
    6        7    A     C   5.0
    7        8    B     D   6.0
    8        9    A     C   7.0
    9       10    B     D   8.0
    10      11    A     C   9.0
    11      12    B     D  10.0
    12      13    A     C  11.0
    13      14    B     D  12.0
    14      15    A     C  13.0