I have a pandas dataframe that looks like
data = {
'Date': ['2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-03-14','2024-03-14','2024-03-14','2024-02-14','2024-02-10','2024-02-10','2024-02-10','2024-04-13','2024-04-13','2023-02-11','2023-02-11','2023-02-11','2011-10-11','2011-05-02','2011-05-02'],
'Test_Number': [5,4,3,2,1,3,2,1,4,3,2,1,2,1,3,2,1,1,2,1],
'Student_ID': [2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
'Place': [3,5,7,3,1,9,6,3,7,8,2,1,3,4,2,1,5,6,2,7]
}
df = pd.DataFrame(data)
and I would like to create three new columns 'student_rec_1', 'student_rec_2', 'student_rec_3' using the following method:
for each Student_ID, student_rec_1 is equal to the Place of that student in the last test in the closest last date, and is equal to np.nan if it does not exist.
Similarly, student_rec_2 is equal to the Place of that student in the second last test in the closest last date, and is equal to np.nan if it does not exist,
student_rec_3 is equal to the Place of that student in the third last test in the closest last date, and is equal to np.nan if it does not exist. So the desired outcome looks like
data_new = {
'Date': ['2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-03-14','2024-03-14','2024-03-14','2024-02-14','2024-02-10','2024-02-10','2024-02-10','2024-04-13','2024-04-13','2023-02-11','2023-02-11','2023-02-11','2011-10-11','2011-05-02','2011-05-02'],
'Test_Number': [5,4,3,2,1,3,2,1,4,3,2,1,2,1,3,2,1,1,2,1],
'Student_ID': [2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
'Place': [3,5,7,3,1,9,6,3,7,8,2,1,3,4,2,1,5,6,2,7],
'student_rec_1': [9,9,9,9,9,7,7,7,8,np.nan,np.nan,np.nan,2,2,6,6,6,2,np.nan,np.nan],
'student_rec_2': [6,6,6,6,6,8,8,8,2,np.nan,np.nan,np.nan,1,1,2,2,2,7,np.nan,np.nan],
'student_rec_3': [3,3,3,3,3,2,2,2,1,np.nan,np.nan,np.nan,5,5,7,7,7,np.nan,np.nan,np.nan]
}
df_new = pd.DataFrame(data_new)
That's what I have tried:
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Date', 'Test_Number'], ascending=[False, False])
def get_last_n_records(group, n): return group['Place'].shift(-n)
df['student_rec_1'] = df.groupby('Student_ID').apply(get_last_n_records, 1).reset_index(level=0, drop=True) df['student_rec_2'] = df.groupby('Student_ID').apply(get_last_n_records, 2).reset_index(level=0, drop=True) df['student_rec_3'] = df.groupby('Student_ID').apply(get_last_n_records, 3).reset_index(level=0, drop=True)
but it just shifted the student's place for each student and didn't account for the "last day" aspect and would just shift the Place irregardless.
First convert column Date
by to_datetime
, create helper DataFrame
with rename columns df_cand
so possible use left join to original (for avoid remove original index is used rename
). Then filter
by datetimes, sorting and create counter by GroupBy.cumcount
for get 3
last values, which are merged to original df
:
df['Date'] = pd.to_datetime(df['Date'])
df = df.reset_index().rename(columns={'index':'orig_index'})
df_cand = (df.rename(columns={'Date':'cand_Date',
'Test_Number':'cand_Test_Number',
'Place':'cand_Place'})
.drop(['orig_index'], axis=1))
merged = df.merge(df_cand, on='Student_ID', how='left')
merged = merged[merged['cand_Date'].lt(merged['Date'])]
merged = merged.sort_values(['Student_ID','orig_index','cand_Date','cand_Test_Number'],
ascending=[True,True,False,False])
merged['cand_rank'] = merged.groupby('orig_index').cumcount().add(1)
pivot = (merged[merged['cand_rank'].le(3)]
.pivot(index='orig_index',columns='cand_rank',values='cand_Place')
.add_prefix('student_rec'))
out = df.join(pivot).drop('orig_index', axis=1)
print(out)
Date Test_Number Student_ID Place student_rec_1 student_rec_2 \
0 2024-07-14 5 2 3 9.0 6.0
1 2024-07-14 4 2 5 9.0 6.0
2 2024-07-14 3 2 7 9.0 6.0
3 2024-07-14 2 2 3 9.0 6.0
4 2024-07-14 1 2 1 9.0 6.0
5 2024-03-14 3 2 9 7.0 8.0
6 2024-03-14 2 2 6 7.0 8.0
7 2024-03-14 1 2 3 7.0 8.0
8 2024-02-14 4 2 7 8.0 2.0
9 2024-02-10 3 2 8 NaN NaN
10 2024-02-10 2 2 2 NaN NaN
11 2024-02-10 1 2 1 NaN NaN
12 2024-04-13 2 1 3 2.0 1.0
13 2024-04-13 1 1 4 2.0 1.0
14 2023-02-11 3 1 2 6.0 2.0
15 2023-02-11 2 1 1 6.0 2.0
16 2023-02-11 1 1 5 6.0 2.0
17 2011-10-11 1 1 6 2.0 7.0
18 2011-05-02 2 1 2 NaN NaN
19 2011-05-02 1 1 7 NaN NaN
student_rec_3
0 3.0
1 3.0
2 3.0
3 3.0
4 3.0
5 2.0
6 2.0
7 2.0
8 1.0
9 NaN
10 NaN
11 NaN
12 5.0
13 5.0
14 7.0
15 7.0
16 7.0
17 NaN
18 NaN
19 NaN
EDIT: For better performance is possible use solution working per groups with numpy - compare dates for all previous to mask
, create order by cumulative sum by numpy.cumsum
, so possible get N
top ordering with numpy.argmax
. Because there is possible some values not exist is necessary add condition with numpy.any
and return necessary columns:
df['Date'] = pd.to_datetime(df['Date'])
N = 3
def f(x):
dates = x['Date'].to_numpy()
places = x['Place'].astype(float).to_numpy()
mask = dates < dates[:, None]
cs = np.cumsum(mask, axis=1)
targets = np.array(range(1, N+1))[None, :]
cs_ext = cs[..., None]
cond = cs_ext == targets
first_idx = np.argmax(cond, axis=1)
m = np.any(cond, axis=1)
arr = places[first_idx]
arr[~m] = np.nan
return pd.DataFrame(arr,
index=x.index,
columns=[f'student_rec_{i+1}' for i in range(N)])
out = df.join(df.groupby('Student_ID', group_keys=False)[['Place','Date']].apply(f))
print(out)
Date Test_Number Student_ID Place student_rec_1 student_rec_2 \
0 2024-07-14 5 2 3 9.0 6.0
1 2024-07-14 4 2 5 9.0 6.0
2 2024-07-14 3 2 7 9.0 6.0
3 2024-07-14 2 2 3 9.0 6.0
4 2024-07-14 1 2 1 9.0 6.0
5 2024-03-14 3 2 9 7.0 8.0
6 2024-03-14 2 2 6 7.0 8.0
7 2024-03-14 1 2 3 7.0 8.0
8 2024-02-14 4 2 7 8.0 2.0
9 2024-02-10 3 2 8 NaN NaN
10 2024-02-10 2 2 2 NaN NaN
11 2024-02-10 1 2 1 NaN NaN
12 2024-04-13 2 1 3 2.0 1.0
13 2024-04-13 1 1 4 2.0 1.0
14 2023-02-11 3 1 2 6.0 2.0
15 2023-02-11 2 1 1 6.0 2.0
16 2023-02-11 1 1 5 6.0 2.0
17 2011-10-11 1 1 6 2.0 7.0
18 2011-05-02 2 1 2 NaN NaN
19 2011-05-02 1 1 7 NaN NaN
student_rec_3
0 3.0
1 3.0
2 3.0
3 3.0
4 3.0
5 2.0
6 2.0
7 2.0
8 1.0
9 NaN
10 NaN
11 NaN
12 5.0
13 5.0
14 7.0
15 7.0
16 7.0
17 NaN
18 NaN
19 NaN