python-3.xpandasdataframegroup-bycumsum

Cumsum in Pandas Dataframe but reset to zero on a specific date every new year


I have a dataframe that is in descending order of time and indexed by Race_ID and Student_ID.

Race_ID   Date           Student_ID      Mark     
1         1/10/2023      1               5        
1         1/10/2023      2               8        
1         1/10/2023      3               7        
8         1/10/2023      4               4        
8         1/1/2023       1               9        
8         1/1/2023       2               3        
8         1/1/2023       3               5        
8         1/1/2023       4               10       
2         11/9/2022      1               2        
2         11/9/2022      2               4        
2         11/9/2022      3               9        
3         17/4/2022      5               3        
3         17/4/2022      2               4        
3         17/4/2022      3               3        
3         17/4/2022      4               7        
4         1/3/2022       1               4        
4         1/3/2022       2               9        
5         1/1/2021       1               6        
5         1/1/2021       2               1        
5         1/1/2021       3               8        

I want to create a new column Seasonal_Mark which is the cumsum of Mark by each student, with the caveat that the sum is set to zero on every 1 February. So the desired outcome looks like:

Race_ID   Date           Student_ID      Mark     Seasonal_Mark 
1         1/10/2023      1               5        5
1         1/10/2023      2               8        8
1         1/10/2023      3               7        7
8         1/10/2023      4               4        4
8         1/1/2023       1               9        15 (4+2+9)
8         1/1/2023       2               3        20 (4+9+4+3)
8         1/1/2023       3               5        17 (9+3+5)
8         1/1/2023       4               10       17 (7+10)                  
2         11/9/2022      1               2        6  (4+2)
2         11/9/2022      2               4        17 (4+9+4)
2         11/9/2022      3               9        12 (9+3)
3         17/4/2022      5               3        3
3         17/4/2022      2               4        13 (4+9)
3         17/4/2022      3               3        3
3         17/4/2022      4               7        7
4         1/3/2022       1               4        4
4         1/3/2022       2               9        9
5         1/1/2021       1               6        6
5         1/1/2021       2               1        1
5         1/1/2021       3               8        8

Thank you so much in advance.


Solution

  • Another possible solution, which creates an adjusted year column that corresponds to the year of the date, if the date is February or later (in the year), and to the year minus 1, if the date is earlier than February (in the year). This new column is afterwards used inside groupby.

    df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
    
    m = df['Date'].dt.month
    y = df['Date'].dt.year
    
    df.assign(
        Seasonal_Mark = (df.assign(
            adjusted_year = np.where(m >= 2, y, y - 1))
        .sort_values(by='Date')
        .groupby(['Student_ID', 'adjusted_year'])['Mark'].cumsum()))
    

    Output:

        Race_ID       Date  Student_ID  Mark  Seasonal_Mark
    0         1 2023-10-01           1     5              5
    1         1 2023-10-01           2     8              8
    2         1 2023-10-01           3     7              7
    3         8 2023-10-01           4     4              4
    4         8 2023-01-01           1     9             15
    5         8 2023-01-01           2     3             20
    6         8 2023-01-01           3     5             17
    7         8 2023-01-01           4    10             17
    8         2 2022-09-11           1     2              6
    9         2 2022-09-11           2     4             17
    10        2 2022-09-11           3     9             12
    11        3 2022-04-17           5     3              3
    12        3 2022-04-17           2     4             13
    13        3 2022-04-17           3     3              3
    14        3 2022-04-17           4     7              7
    15        4 2022-03-01           1     4              4
    16        4 2022-03-01           2     9              9
    17        5 2021-01-01           1     6              6
    18        5 2021-01-01           2     1              1
    19        5 2021-01-01           3     8              8