pythonpandasdataframegroup-bymulti-index

Group Pandas DataFrame on criteria from another DataFrame to multi-index


I have the following two DataFrames:

df
   100  101  102  103  104  105  106  107  108  109
0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
1  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0  10.0
2  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
5  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
6  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
7  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
8  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
9  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
df2
  crit1 crit2
110     a     A
109     a     B
108     a     A
107     b     B
106     b     A
105     a     A
104     a     B
103     a     A
102     b     B
101     b     A
100     b     A
99      b     A

df contains data for ten entities 100-109 and df2 describes two criteria categorizing the ten entities 100-109 (and others, in a different order). I'd like to group df on a two-level column index (crit1,crit2) with one value per combination of (crit1,crit2), being the sum of all columns with this combination.

For example, the new column with the index ('a','A') would contain the sum of columns [108,105,103].

expected result:

crit1     a           b      
crit2     A     B     A     B
0       3.0   2.0   3.0   2.0
1      19.0  15.0  10.0  11.0
2       3.0   2.0   3.0   2.0
3       3.0   2.0   3.0   2.0
4       3.0   2.0   3.0   2.0
5       3.0   2.0   3.0   2.0
6       3.0   2.0   3.0   2.0
7       3.0   2.0   3.0   2.0
8       3.0   2.0   3.0   2.0
9       3.0   2.0   3.0   2.0

To reproduce the DataFrames:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.ones((10,10)), index=np.arange(10), columns=np.arange(100,110))
df2 = pd.DataFrame(np.array([['a','A'],['a','B'],['a','A'],['b','B'],['b','A'],['a','A'],['a','B'],['a','A'],['b','B'],['b','A'],['b','A'],['b','A']]), index=np.arange(110,98,-1), columns=['crit1','crit2'])
df.iloc[1] = np.arange(1,11)

Solution

  • You can reindex, set_axis with MultiIndex.from_frame then groupby.sum:

    out = (df.reindex(columns=df2.index)
             .set_axis(pd.MultiIndex.from_frame(df2), axis=1)
             .groupby(axis=1, level=[0, 1]).sum()
          )
    

    For the latest version of pandas, using the transpose as intermediate:

    out = (df.T
             .reindex(df2.index)
             .set_axis(pd.MultiIndex.from_frame(df2))
             .groupby(level=[0, 1]).sum().T
          )
    

    Output:

    crit1    a         b     
    crit2    A    B    A    B
    0      4.0  2.0  2.0  2.0
    1      4.0  2.0  2.0  2.0
    2      4.0  2.0  2.0  2.0
    3      4.0  2.0  2.0  2.0
    4      4.0  2.0  2.0  2.0
    5      4.0  2.0  2.0  2.0
    6      4.0  2.0  2.0  2.0
    7      4.0  2.0  2.0  2.0
    8      4.0  2.0  2.0  2.0
    9      4.0  2.0  2.0  2.0
    

    Alternatively, pre-aggregating the indices, then building the output with concat:

    groups = df2.reset_index().groupby(list(df2))['index'].agg(list)
    
    out = pd.concat({k: df.reindex(columns=lst).sum(axis=1)
                     for k, lst in groups.items()}, axis=1)
    

    Output:

         a         b     
         A    B    A    B
    0  4.0  2.0  2.0  2.0
    1  4.0  2.0  2.0  2.0
    2  4.0  2.0  2.0  2.0
    3  4.0  2.0  2.0  2.0
    4  4.0  2.0  2.0  2.0
    5  4.0  2.0  2.0  2.0
    6  4.0  2.0  2.0  2.0
    7  4.0  2.0  2.0  2.0
    8  4.0  2.0  2.0  2.0
    9  4.0  2.0  2.0  2.0
    

    Intermediate groups:

    crit1  crit2
    a      A        [100, 102, 105, 107]
           B                  [101, 106]
    b      A                  [104, 109]
           B                  [103, 108]
    Name: index, dtype: object