I am trying to do a groupby on a DataFrame which has multiindex columns using a Series (without multiindex) as an input to group by. Specifically, given the below DataFrame
>>> df
X Y
A B C A B C
2020-01-01 9 1 2 1 6 5
2020-01-02 5 7 8 0 6 9
2020-01-03 6 3 4 8 6 1
2020-01-06 0 0 9 0 5 1
2020-01-07 8 7 4 8 3 1
and the Series representing the groups
>>> groups
A D
B D
C E
dtype: object
I try to run the following
>>> df.groupby(groups, axis=1, level=1).sum()
and expect to get
X Y
D E D E
2020-01-01 10 2 7 5
2020-01-02 12 8 6 9
2020-01-03 9 4 14 1
2020-01-06 0 9 5 1
2020-01-07 15 4 11 1
Instead however I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/frame.py", line 6717, in groupby
return DataFrameGroupBy(
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 560, in __init__
grouper, exclusions, obj = get_grouper(
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 828, in get_grouper
Grouping(
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 485, in __init__
) = index._get_grouper_for_level(self.grouper, level)
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/indexes/multi.py", line 1487, in _get_grouper_for_level
grouper = level_values.map(mapper)
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5098, in map
new_values = super()._map_values(mapper, na_action=na_action)
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/base.py", line 937, in _map_values
new_values = map_f(values, mapper)
File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
TypeError: 'numpy.ndarray' object is not callable
I'm using Python 3.8.8 and Pandas version 1.2.3.
One way I found to achieve the above is with the following code, but I'm specifically wondering whether there is a cleaner way to do so. If not, why not? To me the above attempt would be the expected behaviour of the groupby method, but it seems I'm misunderstanding the logic behind it.
>>> df, groups = df.align(groups, axis=1, level=1)
>>> df.groupby(groups, axis=1).apply(lambda x: x.sum(axis=1, level=0)).swaplevel(axis=1).sort_index(axis=1)
X Y
D E D E
2020-01-01 10 2 7 5
2020-01-02 12 8 6 9
2020-01-03 9 4 14 1
2020-01-06 0 9 5 1
2020-01-07 15 4 11 1
You can use rename
by second level of MultiIndex
and then aggregate by both levels:
df = df.rename(columns=groups, level=1).sum(axis=1, level=[0,1])
#working like
#df = df.rename(columns=groups, level=1).groupby(axis=1, level=[0,1]).sum()
print (df)
X Y
D E D E
2020-01-01 10 2 7 5
2020-01-02 12 8 6 9
2020-01-03 9 4 14 1
2020-01-06 0 9 5 1
2020-01-07 15 4 11 1
Your solution should be changed by lambda function, but output is different:
df = df.groupby(lambda x: groups[x], axis=1, level=1).sum()
print (df)
D E
2020-01-01 17 7
2020-01-02 18 17
2020-01-03 23 5
2020-01-06 5 10
2020-01-07 26 5