I have a data frame that looks like
Date Student_ID Exam_Score
2020-12-24 1 79
2020-12-24 3 100
2020-12-24 4 88
2021-01-19 1 100
2021-01-19 2 100
2021-01-19 3 99
2021-01-19 4 72
2022-09-30 3 100
2022-09-30 2 100
2022-09-30 1 100
2022-09-30 5 46
2023-04-23 3 100
2023-04-23 2 97
2023-04-23 1 100
2024-07-19 2 89
2024-07-19 1 100
2024-07-19 4 93
2024-07-19 3 100
2024-09-19 1 100
2024-09-19 2 80
2024-09-19 3 100
2024-09-19 4 80
2024-10-20 1 80
2024-10-20 3 99
2024-10-20 2 80
And I would like to compute a new column called Recent_Full_Marks
which uses the following logic: for each Student_ID
, compute the number of times that student getting 100 marks on the exam before the current date and after the Sept-1 one year before. So for example, on 2024-11-22, Student 1 has gotten 2 full marks before 2024-11-22 and after 2024-09-01. And the desired column looks like:
Date Student_ID Exam_Score Recent_Full_Marks
2020-12-24 1 79 0
2020-12-24 3 100 0
2020-12-24 4 88 0
2021-01-19 1 100 0
2021-01-19 2 100 0
2021-01-19 3 99 1
2021-01-19 4 72 0
2022-09-30 3 100 0
2022-09-30 2 100 0
2022-09-30 1 100 0
2022-09-30 5 46 0
2023-11-23 3 100 0
2023-11-23 2 97 0
2023-11-23 1 100 0
2024-07-19 2 89 0
2024-07-19 1 100 1
2024-07-19 4 93 0
2024-07-19 3 100 1
2024-09-19 1 100 0
2024-09-19 2 80 0
2024-09-19 3 100 0
2024-09-19 4 80 0
2024-10-20 1 100 1
2024-10-20 3 99 1
2024-10-20 2 80 0
2024-11-22 1 70 2
2024-11-22 3 100 1
2024-11-22 2 78 0
Here is what I have tried:
Date = pd.to_datetime(df['Date'], dayfirst=True)
full = (df.assign(Date=Date)
.sort_values(['Student_ID','Date'], ascending=[True,True])
['Exam_Score'].eq(100))
df['Recent_Full_Marks']=(full.groupby([df['Student_ID'], Date.dt.year], group_keys=False).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
However, the above method only counts the number of full marks after every year start and not every sept-1 and I was unable to modify it to work.
You could use a yearly period that ends on AUG-31 (Y-AUG
) and groupby.transform
with shift
+cumsum
:
df['Recent_Full_Marks'] = (df
.sort_values('Date')['Exam_Score'].eq(100).astype(int)
.groupby([df['Student_ID'],
pd.to_datetime(df['Date']).dt.to_period('Y-AUG')])
.transform(lambda x: x.shift(fill_value=0).cumsum())
)
Output:
Date Student_ID Exam_Score Recent_Full_Marks
0 2020-12-24 1 79 0
1 2020-12-24 3 100 0
2 2020-12-24 4 88 0
3 2021-01-19 1 100 0
4 2021-01-19 2 100 0
5 2021-01-19 3 99 1
6 2021-01-19 4 72 0
7 2022-09-30 3 100 0
8 2022-09-30 2 100 0
9 2022-09-30 1 100 0
10 2022-09-30 5 46 0
11 2023-11-23 3 100 0
12 2023-11-23 2 97 0
13 2023-11-23 1 100 0
14 2024-07-19 2 89 0
15 2024-07-19 1 100 1
16 2024-07-19 4 93 0
17 2024-07-19 3 100 1
18 2024-09-19 1 100 0
19 2024-09-19 2 80 0
20 2024-09-19 3 100 0
21 2024-09-19 4 80 0
22 2024-10-20 1 100 1
23 2024-10-20 3 99 1
24 2024-10-20 2 80 0
25 2024-11-22 1 70 2
26 2024-11-22 3 100 1
27 2024-11-22 2 78 0