pythonpython-3.xpandasdataframegroup-by

Comparing value of one column with the next row in another column in pandas dataframe


I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.

Race_ID   Date           adv          C_k
1         1/1/2023       2.5          2.7     
1         1/1/2023       1.4          2.6     
1         1/1/2023       1.3          1.9     
1         1/1/2023       1.1          1.2     
2         11/9/2022      1.4          1.1     
2         11/9/2022      1.3          1.2     
2         11/9/2022      1.0          0.4     
3         17/4/2022      0.9          0.2     
3         17/4/2022      0.8          0.4     
3         17/4/2022      0.7          0.5     
3         17/4/2022      0.6          0.2     
3         17/4/2022      0.5          0.4     

That is grouped by Group_ID, the values in adv is sorted in descending value by groups. For each group, I want to define a value

t= min{n| adv_(n+1) <= C_n} and C_t be the corresponding value in the C_k column. In words, I want to create a new column C_t for each group, where the value is equal to the first C_k such that C_k is greater or equal to the adv value in the next row. And if no such value exists, then set C_t = 1 (for example, in Race_ID = 3)

So the desired column looks like this:

Race_ID   Date           adv          C_k     C_t
1         1/1/2023       2.5          2.7     1.9      
1         1/1/2023       1.4          2.6     1.9     
1         1/1/2023       1.3          1.9     1.9     
1         1/1/2023       1.1          1.2     1.9     
2         11/9/2022      1.4          1.1     1.2     
2         11/9/2022      1.3          1.2     1.2     
2         11/9/2022      1.0          0.4     1.2     
3         17/4/2022      0.9          0.2     1.0     
3         17/4/2022      0.8          0.4     1.0     
3         17/4/2022      0.7          0.5     1.0     
3         17/4/2022      0.6          0.2     1.0     
3         17/4/2022      0.5          0.4     1.0     

Thanks in advance.


Solution

  • Compare shifted values per groups by DataFrameGroupBy.shift with Series.le, filter by boolean indexing, remove duplicates with keep last value by DataFrame.drop_duplicates and mapping by Series.map:

    s = (df[df.groupby('Race_ID')['adv'].shift(-1).le(df['C_k'])]
             .drop_duplicates(subset=['Race_ID'], keep='last')
             .set_index('Race_ID')['C_k'])
    
    df['C_t'] = df['Race_ID'].map(s).fillna(1)
    print (df)
        Race_ID       Date  adv  C_k  C_t
    0         1   1/1/2023  2.5  2.7  1.9
    1         1   1/1/2023  1.4  2.6  1.9
    2         1   1/1/2023  1.3  1.9  1.9
    3         1   1/1/2023  1.1  1.2  1.9
    4         2  11/9/2022  1.4  1.1  1.2
    5         2  11/9/2022  1.3  1.2  1.2
    6         2  11/9/2022  1.0  0.4  1.2
    7         3  17/4/2022  0.9  0.2  1.0
    8         3  17/4/2022  0.8  0.4  1.0
    9         3  17/4/2022  0.7  0.5  1.0
    10        3  17/4/2022  0.6  0.2  1.0
    11        3  17/4/2022  0.5  0.4  1.0
    

    Or use Series.where for set NaNs for not matched values and get last non missing value per group by GroupBy.transform with last:

    df['C_t'] = (df['C_k'].where(df.groupby('Race_ID')['adv'].shift(-1).le(df['C_k']))
                          .groupby(df['Race_ID'])
                          .transform('last')
                          .fillna(1))
    print (df)
        Race_ID       Date  adv  C_k  C_t
    0         1   1/1/2023  2.5  2.7  1.9
    1         1   1/1/2023  1.4  2.6  1.9
    2         1   1/1/2023  1.3  1.9  1.9
    3         1   1/1/2023  1.1  1.2  1.9
    4         2  11/9/2022  1.4  1.1  1.2
    5         2  11/9/2022  1.3  1.2  1.2
    6         2  11/9/2022  1.0  0.4  1.2
    7         3  17/4/2022  0.9  0.2  1.0
    8         3  17/4/2022  0.8  0.4  1.0
    9         3  17/4/2022  0.7  0.5  1.0
    10        3  17/4/2022  0.6  0.2  1.0
    11        3  17/4/2022  0.5  0.4  1.0