pythonpandasdataframeaggregatemulti-index

How to aggregate a MultiIndex DataFrame by a column level


Have been struggling with this for some time. I have the following multi-indexed by column Dataframe:

Category     Sweet                                                    
Destination    NWE                         MED                        
Week            W1     W2     W3     W4     W1      W2      W3      W4
2022-06-01     0.0    0.0    0.0   70.0   70.0    70.0    70.0    70.0
2022-07-01   142.0  142.0  779.0  779.0  945.0   945.0  1015.0  1015.0
2022-08-01   236.0  661.0  732.0  914.0  996.0  1044.0  1044.0  1096.0

I want to get the sum of NWE + MED columns by Week in one slice, which would look like this:

Category     Sweet                                                    
Destination    (NWE+MED)                                                
Week            W1     W2     W3     W4     
2022-06-01     70.0    70.0    70.0   140.0   
2022-07-01   1087.0  1087.0  1794.0  1794.0 
2022-08-01   1232.0  1705.0  1776.0  2010.0  

Will be doing it for a larger list so slicing each country then adding them together would be a solution but also very cumbersome in terms of code. This is similar to an excel pivot table where you can select several different sub categories (like countries) and you could get the sums as values. (In my case the cumulative sums).


Solution

  • You can try the stack() + (do something) + unstack() trick to sum the columns.

    stack() lets you reshape the wide df into a long one so that you can sum() the data by week; this creates a Series. After it's done, convert it into a frame by calling to_frame() and call unstack() to reshape the long frame into a wide one.

    # input
    df = (
        pd.DataFrame({
            ('Sweet', 'NWE', 'W1'): {'2022-06-01': 0.0, '2022-07-01': 142.0, '2022-08-01': 236.0}, 
            ('Sweet', 'NWE', 'W2'): {'2022-06-01': 0.0, '2022-07-01': 142.0, '2022-08-01': 661.0}, 
            ('Sweet', 'NWE', 'W3'): {'2022-06-01': 0.0, '2022-07-01': 779.0, '2022-08-01': 732.0}, 
            ('Sweet', 'NWE', 'W4'): {'2022-06-01': 70.0, '2022-07-01': 779.0, '2022-08-01': 914.0}, 
            ('Sweet', 'MED', 'W1'): {'2022-06-01': 70.0, '2022-07-01': 945.0, '2022-08-01': 996.0}, 
            ('Sweet', 'MED', 'W2'): {'2022-06-01': 70.0, '2022-07-01': 945.0, '2022-08-01': 1044.0},
            ('Sweet', 'MED', 'W3'): {'2022-06-01': 70.0, '2022-07-01': 1015.0, '2022-08-01': 1044.0}, 
            ('Sweet', 'MED', 'W4'): {'2022-06-01': 70.0, '2022-07-01': 1015.0, '2022-08-01': 1096.0}})
        .rename_axis(columns=['Category', 'Destination', 'Week'])
    )
    
    # stack the df to make it long, sum along columns, add column names and unstack back
    df = df.stack().sum(axis=1).to_frame(name=('Sweet', 'NWE+MED')).unstack()
    # output
    df
    

    Another option is to simply index the columns and add them together. Using tuples lets us index levels of a multi-index column.

    df = df.sort_index(axis=1)
    df = df[("Sweet", "NWE")] + df[("Sweet", "MED")]
    df.columns = pd.MultiIndex.from_product(
        (["Sweet"], ["MED+NWE"], df.columns),
        names=["Category", "Destination", "Week"]
    )
    

    Both of the above options produce the following output:

    result