pandascolumnheader

Groupby and sum based on column name


I have a dataframe:

df = pd.DataFrame({ 
    'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'], 
    'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
    '201901': [100, 120, 130, 200, 190, 210],
    '201902': [100, 120, 130, 200, 190, 210],
    '201903': [200, 250, 450, 120, 180, 190],
    '202001': [200, 250, 450, 120, 180, 190],
    '202002': [200, 250, 450, 120, 180, 190],
    '202003': [200, 250, 450, 120, 180, 190]
})

The columns represent years and months respectively. I would like to sum the columns for months into a new columns for the year. The result should look like the following:

df = pd.DataFrame({ 
    'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'], 
    'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
    '201901': [100, 120, 130, 200, 190, 210],
    '201902': [100, 120, 130, 200, 190, 210],
    '201903': [200, 250, 450, 120, 180, 190],
    '202001': [200, 250, 450, 120, 180, 190],
    '202002': [200, 250, 450, 120, 180, 190],
    '202003': [200, 250, 450, 120, 180, 190],
    '2019': [400, 490, 710, 520, 560, 610],
    '2020': [600, 750, 1350, 360, 540, 570]
})

My actual dataset has a number of years and has 12 months for each year. Hoping not to have to add the columns manually.


Solution

  • Try creating a DataFrame that contains the year columns and convert the column names to_datetime :

    data_df = df.iloc[:, 2:]
    data_df.columns = pd.to_datetime(data_df.columns, format='%Y%m')
    
       2019-01-01  2019-02-01  2019-03-01  2020-01-01  2020-02-01  2020-03-01
    0         100         100         200         200         200         200
    1         120         120         250         250         250         250
    2         130         130         450         450         450         450
    3         200         200         120         120         120         120
    4         190         190         180         180         180         180
    5         210         210         190         190         190         190
    

    resample sum the columns by Year and rename columns to just the year values:

    data_df = (
        data_df.resample('Y', axis=1).sum().rename(columns=lambda c: c.year)
    )
    
       2019  2020
    0   400   600
    1   490   750
    2   710  1350
    3   520   360
    4   560   540
    5   610   570
    

    Then join back to the original DataFrame:

    new_df = df.join(data_df)
    

    new_df:

       BU Line_Item  201901  201902  201903  202001  202002  202003  2019  2020
    0  AA  Revenues     100     100     200     200     200     200   400   600
    1  AA       EBT     120     120     250     250     250     250   490   750
    2  AA  Expenses     130     130     450     450     450     450   710  1350
    3  BB  Revenues     200     200     120     120     120     120   520   360
    4  BB       EBT     190     190     180     180     180     180   560   540
    5  BB  Expenses     210     210     190     190     190     190   610   570