pythonpandasdataframe

How can I check previous values of a column to find the value that is greater than the selected row in another column?


This is my DataFrame:

import pandas as pd
df = pd.DataFrame({
    'a': [10, 20, 30, 1, 20, 3, 4, 0],
    'b': [30, 3, 11, 25, 24, 31, 29, 2],
    'c': [True, True, True, False, False, True, True, True]
})

Expected output is creating column d:

    a   b      c    d
0  10  30   True    NaN
1  20   3   True    10
2  30  11   True    20
3   1  25  False    NaN
4  20  24  False    NaN
5   3  31   True    NaN
6   4  29   True    30
7   0   2   True    4

First of all the values in b that their c are True are selected. I explain the process from row 1 because it is easier to understand.

The value in b is 3, then all values above it should be checked. And the NEAREST value in a that is greater than 3 should be selected. So 10 is selected.

for row number 2, the value is 11. The nearest value to this one that is greater than it in a is 20.

For rows 3 and 4 since c is False. NaN should be selected.

For row 5, since there are no previous values that are greater than 31 in a, NaN is selected.

For row 6, the nearest value in a that is greater than 29 is 30.

This is what I have tried so far. It doesn't give me the output. I think the approach that I'm taking might be correct.

t = df['a'].to_numpy()
h = df['b'].to_numpy()

m2 = t < h[:, None]
df['d'] = np.nanmax(np.where(m2, t, np.nan), axis=1)

Solution

  • An efficient approach would be to use janitor's conditional_join:

    #  pip install pyjanitor
    import janitor
    
    tmp = df.reset_index()
    
    df.loc[df['c'], 'd'] = (tmp[['index', 'b']]
                            .conditional_join(tmp[['index', 'a']],
                                              ('index', 'index', '>'),
                                              ('b', 'a', '<='),
                                              keep='last', how='left',
                                              right_columns=['a'])
                            .set_index('index')['a']
                           )
    

    Output:

        a   b      c     d
    0  10  30   True   NaN
    1  20   3   True  10.0
    2  30  11   True  20.0
    3   1  25  False   NaN
    4  20  24  False   NaN
    5   3  31   True   NaN
    6   4  29   True  30.0
    7   0   2   True   4.0