This is my DataFrame:
import pandas as pd
df = pd.DataFrame({
'a': [10, 20, 30, 1, 20, 3, 4, 0],
'b': [30, 3, 11, 25, 24, 31, 29, 2],
'c': [True, True, True, False, False, True, True, True]
})
Expected output is creating column d
:
a b c d
0 10 30 True NaN
1 20 3 True 10
2 30 11 True 20
3 1 25 False NaN
4 20 24 False NaN
5 3 31 True NaN
6 4 29 True 30
7 0 2 True 4
First of all the values in b
that their c
are True
are selected. I explain the process from row 1
because it is easier to understand.
The value in b
is 3, then all values above it should be checked. And the NEAREST value in a
that is greater than 3 should be selected. So 10 is selected.
for row number 2
, the value is 11. The nearest value to this one that is greater than it in a
is 20.
For rows 3 and 4 since c
is False
. NaN
should be selected.
For row 5
, since there are no previous values that are greater than 31 in a
, NaN
is selected.
For row 6
, the nearest value in a
that is greater than 29 is 30.
This is what I have tried so far. It doesn't give me the output. I think the approach that I'm taking might be correct.
t = df['a'].to_numpy()
h = df['b'].to_numpy()
m2 = t < h[:, None]
df['d'] = np.nanmax(np.where(m2, t, np.nan), axis=1)
An efficient approach would be to use janitor
's conditional_join
:
# pip install pyjanitor
import janitor
tmp = df.reset_index()
df.loc[df['c'], 'd'] = (tmp[['index', 'b']]
.conditional_join(tmp[['index', 'a']],
('index', 'index', '>'),
('b', 'a', '<='),
keep='last', how='left',
right_columns=['a'])
.set_index('index')['a']
)
Output:
a b c d
0 10 30 True NaN
1 20 3 True 10.0
2 30 11 True 20.0
3 1 25 False NaN
4 20 24 False NaN
5 3 31 True NaN
6 4 29 True 30.0
7 0 2 True 4.0