python-2.7pandas

Doing a groupby and rolling window on a Pandas Dataframe with a multilevel index leads to a duplicated index entry


If I do a groupby() followed by a rolling() calculation with a multi-level index, one of the levels in the index is repeated - most odd. I am using Pandas 0.18.1

import pandas as pd
df = pd.DataFrame(data=[[1, 1, 10, 20], [1, 2, 30, 40], [1, 3, 50, 60],
                        [2, 1, 11, 21], [2, 2, 31, 41], [2, 3, 51, 61]],
                  columns=['id', 'date', 'd1', 'd2'])

df.set_index(['id', 'date'], inplace=True)
df = df.groupby(level='id').rolling(window=2)['d1'].sum()
print(df)
print(df.index)

The output is as follows

id  id  date
1   1   1        NaN
        2       40.0
        3       80.0
2   2   1        NaN
        2       42.0
        3       82.0
Name: d1, dtype: float64
MultiIndex(levels=[[1, 2], [1, 2], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[u'id', u'id', u'date'])

What is odd is that the id column now shows up twice in the multi-index. Moving the ['d1'] column selection around doesn't make any difference.

Any help would be much appreciated.

Thanks Paul


Solution

  • It is bug.

    But version with apply works nice, this alternative is here (only d1 was moved to apply):

    df = df.groupby(level='id').d1.apply(lambda x: x.rolling(window=2).sum())
    print(df)
    id  date
    1   1        NaN
        2       40.0
        3       80.0
    2   1        NaN
        2       42.0
        3       82.0
    Name: d1, dtype: float64