pandasdataframegroup-by

Creating new column according to the closest last date in Pandas dataframe


I have a pandas dataframe that looks like

data = {
'Date': ['2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-03-14','2024-03-14','2024-03-14','2024-02-14','2024-02-10','2024-02-10','2024-02-10','2024-04-13','2024-04-13','2023-02-11','2023-02-11','2023-02-11','2011-10-11','2011-05-02','2011-05-02'],
'Test_Number': [5,4,3,2,1,3,2,1,4,3,2,1,2,1,3,2,1,1,2,1],
'Student_ID': [2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
'Place': [3,5,7,3,1,9,6,3,7,8,2,1,3,4,2,1,5,6,2,7]
}
df = pd.DataFrame(data)

and I would like to create three new columns 'student_rec_1', 'student_rec_2', 'student_rec_3' using the following method:

for each Student_ID, student_rec_1 is equal to the Place of that student in the last test in the closest last date, and is equal to np.nan if it does not exist.

Similarly, student_rec_2 is equal to the Place of that student in the second last test in the closest last date, and is equal to np.nan if it does not exist,

student_rec_3 is equal to the Place of that student in the third last test in the closest last date, and is equal to np.nan if it does not exist. So the desired outcome looks like

data_new = {
'Date': ['2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-07-14','2024-03-14','2024-03-14','2024-03-14','2024-02-14','2024-02-10','2024-02-10','2024-02-10','2024-04-13','2024-04-13','2023-02-11','2023-02-11','2023-02-11','2011-10-11','2011-05-02','2011-05-02'],
'Test_Number': [5,4,3,2,1,3,2,1,4,3,2,1,2,1,3,2,1,1,2,1],
'Student_ID': [2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
'Place': [3,5,7,3,1,9,6,3,7,8,2,1,3,4,2,1,5,6,2,7],
'student_rec_1': [9,9,9,9,9,7,7,7,8,np.nan,np.nan,np.nan,2,2,6,6,6,2,np.nan,np.nan],
'student_rec_2': [6,6,6,6,6,8,8,8,2,np.nan,np.nan,np.nan,1,1,2,2,2,7,np.nan,np.nan],
'student_rec_3': [3,3,3,3,3,2,2,2,1,np.nan,np.nan,np.nan,5,5,7,7,7,np.nan,np.nan,np.nan]
}
df_new = pd.DataFrame(data_new)

That's what I have tried:

df['Date'] = pd.to_datetime(df['Date'])

df = df.sort_values(['Date', 'Test_Number'], ascending=[False, False])

def get_last_n_records(group, n): return group['Place'].shift(-n)

df['student_rec_1'] = df.groupby('Student_ID').apply(get_last_n_records, 1).reset_index(level=0, drop=True) df['student_rec_2'] = df.groupby('Student_ID').apply(get_last_n_records, 2).reset_index(level=0, drop=True) df['student_rec_3'] = df.groupby('Student_ID').apply(get_last_n_records, 3).reset_index(level=0, drop=True)

but it just shifted the student's place for each student and didn't account for the "last day" aspect and would just shift the Place irregardless.


Solution

  • First convert column Date by to_datetime, create helper DataFrame with rename columns df_cand so possible use left join to original (for avoid remove original index is used rename). Then filter by datetimes, sorting and create counter by GroupBy.cumcount for get 3 last values, which are merged to original df:

    df['Date'] = pd.to_datetime(df['Date'])
    
    df = df.reset_index().rename(columns={'index':'orig_index'})
    
    df_cand = (df.rename(columns={'Date':'cand_Date',
                                 'Test_Number':'cand_Test_Number',
                                 'Place':'cand_Place'})
                 .drop(['orig_index'], axis=1))
    
    merged = df.merge(df_cand, on='Student_ID', how='left')
    
    merged = merged[merged['cand_Date'].lt(merged['Date'])]
    merged = merged.sort_values(['Student_ID','orig_index','cand_Date','cand_Test_Number'],
                                 ascending=[True,True,False,False])
    
    merged['cand_rank'] = merged.groupby('orig_index').cumcount().add(1)
    
    pivot = (merged[merged['cand_rank'].le(3)]
              .pivot(index='orig_index',columns='cand_rank',values='cand_Place')
              .add_prefix('student_rec'))
    
    out = df.join(pivot).drop('orig_index', axis=1)
    

    print(out)
    
             Date  Test_Number  Student_ID  Place  student_rec_1  student_rec_2  \
    0  2024-07-14            5           2      3            9.0            6.0   
    1  2024-07-14            4           2      5            9.0            6.0   
    2  2024-07-14            3           2      7            9.0            6.0   
    3  2024-07-14            2           2      3            9.0            6.0   
    4  2024-07-14            1           2      1            9.0            6.0   
    5  2024-03-14            3           2      9            7.0            8.0   
    6  2024-03-14            2           2      6            7.0            8.0   
    7  2024-03-14            1           2      3            7.0            8.0   
    8  2024-02-14            4           2      7            8.0            2.0   
    9  2024-02-10            3           2      8            NaN            NaN   
    10 2024-02-10            2           2      2            NaN            NaN   
    11 2024-02-10            1           2      1            NaN            NaN   
    12 2024-04-13            2           1      3            2.0            1.0   
    13 2024-04-13            1           1      4            2.0            1.0   
    14 2023-02-11            3           1      2            6.0            2.0   
    15 2023-02-11            2           1      1            6.0            2.0   
    16 2023-02-11            1           1      5            6.0            2.0   
    17 2011-10-11            1           1      6            2.0            7.0   
    18 2011-05-02            2           1      2            NaN            NaN   
    19 2011-05-02            1           1      7            NaN            NaN   
    
        student_rec_3  
    0             3.0  
    1             3.0  
    2             3.0  
    3             3.0  
    4             3.0  
    5             2.0  
    6             2.0  
    7             2.0  
    8             1.0  
    9             NaN  
    10            NaN  
    11            NaN  
    12            5.0  
    13            5.0  
    14            7.0  
    15            7.0  
    16            7.0  
    17            NaN  
    18            NaN  
    19            NaN  
    

    EDIT: For better performance is possible use solution working per groups with numpy - compare dates for all previous to mask, create order by cumulative sum by numpy.cumsum, so possible get N top ordering with numpy.argmax. Because there is possible some values not exist is necessary add condition with numpy.any and return necessary columns:

    df['Date'] = pd.to_datetime(df['Date'])
    
    N = 3
    
    def f(x):
    
        dates = x['Date'].to_numpy()        
        places = x['Place'].astype(float).to_numpy() 
    
        mask = dates < dates[:, None]  
        cs = np.cumsum(mask, axis=1) 
        targets = np.array(range(1, N+1))[None, :] 
        cs_ext = cs[..., None]
    
        cond = cs_ext == targets
        first_idx = np.argmax(cond, axis=1)
        m = np.any(cond, axis=1) 
    
        arr = places[first_idx]  
        arr[~m] = np.nan
    
        return pd.DataFrame(arr, 
                            index=x.index, 
                            columns=[f'student_rec_{i+1}' for i in range(N)])
    
    
    out = df.join(df.groupby('Student_ID', group_keys=False)[['Place','Date']].apply(f))
    

    print(out)
             Date  Test_Number  Student_ID  Place  student_rec_1  student_rec_2  \
    0  2024-07-14            5           2      3            9.0            6.0   
    1  2024-07-14            4           2      5            9.0            6.0   
    2  2024-07-14            3           2      7            9.0            6.0   
    3  2024-07-14            2           2      3            9.0            6.0   
    4  2024-07-14            1           2      1            9.0            6.0   
    5  2024-03-14            3           2      9            7.0            8.0   
    6  2024-03-14            2           2      6            7.0            8.0   
    7  2024-03-14            1           2      3            7.0            8.0   
    8  2024-02-14            4           2      7            8.0            2.0   
    9  2024-02-10            3           2      8            NaN            NaN   
    10 2024-02-10            2           2      2            NaN            NaN   
    11 2024-02-10            1           2      1            NaN            NaN   
    12 2024-04-13            2           1      3            2.0            1.0   
    13 2024-04-13            1           1      4            2.0            1.0   
    14 2023-02-11            3           1      2            6.0            2.0   
    15 2023-02-11            2           1      1            6.0            2.0   
    16 2023-02-11            1           1      5            6.0            2.0   
    17 2011-10-11            1           1      6            2.0            7.0   
    18 2011-05-02            2           1      2            NaN            NaN   
    19 2011-05-02            1           1      7            NaN            NaN   
    
        student_rec_3  
    0             3.0  
    1             3.0  
    2             3.0  
    3             3.0  
    4             3.0  
    5             2.0  
    6             2.0  
    7             2.0  
    8             1.0  
    9             NaN  
    10            NaN  
    11            NaN  
    12            5.0  
    13            5.0  
    14            7.0  
    15            7.0  
    16            7.0  
    17            NaN  
    18            NaN  
    19            NaN