There is a dataframe like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'x':np.arange(1,29),'y':[5.69, 6.03, 6.03, 6.03, 6.03, 6.03, 6.03, 5.38, 5.21, 5.4 , 5.24,
5.4 , 5.36, 5.47, 5.58, 5.5 , 5.61, 5.53, 5.4 , 5.51, 5.47, 5.44,5.39, 5.27, 5.38, 5.35, 5.32, 5.09],
'valley':[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'peak':[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0]})
>>> df
x y valley peak
0 1 5.69 1 0
1 2 6.03 0 0
2 3 6.03 0 1
3 4 6.03 0 0
4 5 6.03 0 0
5 6 6.03 0 0
6 7 6.03 0 0
7 8 5.38 0 0
8 9 5.21 1 0
9 10 5.40 0 0
10 11 5.24 0 0
11 12 5.40 0 0
12 13 5.36 0 0
13 14 5.47 0 0
14 15 5.58 0 0
15 16 5.50 0 0
16 17 5.61 0 1
17 18 5.53 0 0
18 19 5.40 0 0
19 20 5.51 0 0
20 21 5.47 0 0
21 22 5.44 0 0
22 23 5.39 0 0
23 24 5.27 0 0
24 25 5.38 0 0
25 26 5.35 0 0
26 27 5.32 0 0
27 28 5.09 1 0
I hope to add a new column 'grp' to this dataframe, with the requirement that for each row that starts with "1" in the valley column and ends with "1" in the peak column, the value in the added column is "A", and conversely, for each row that starts with "1" in the peak column and ends with "1" in the valley column, the value in the added column is 'B'.
The desire result is:
>>> out
x y valley peak grp
0 1 5.69 1 0 A
1 2 6.03 0 0 A
2 3 6.03 0 1 B
3 4 6.03 0 0 B
4 5 6.03 0 0 B
5 6 6.03 0 0 B
6 7 6.03 0 0 B
7 8 5.38 0 0 B
8 9 5.21 1 0 A
9 10 5.40 0 0 A
10 11 5.24 0 0 A
11 12 5.40 0 0 A
12 13 5.36 0 0 A
13 14 5.47 0 0 A
14 15 5.58 0 0 A
15 16 5.50 0 0 A
16 17 5.61 0 1 B
17 18 5.53 0 0 B
18 19 5.40 0 0 B
19 20 5.51 0 0 B
20 21 5.47 0 0 B
21 22 5.44 0 0 B
22 23 5.39 0 0 B
23 24 5.27 0 0 B
24 25 5.38 0 0 B
25 26 5.35 0 0 B
26 27 5.32 0 0 B
27 28 5.09 1 0 A
If we don't use apply with a function and for-loops, is there a native way to achieve by use pandas?
Given your description, you could use np.select
and ffill
. This way you'll ensure that even if you have multiple peaks before a valley or conversely this will keep the order:
m1 = df['valley'].eq(1)
m2 = df['peak'].eq(1)
df['grp'] = pd.Series(np.select([m1, m2], ['A', 'B'], pd.NA),
index=df.index).ffill()
Variant with case_when
:
m1 = df['valley'].eq(1)
m2 = df['peak'].eq(1)
df['grp'] = df['valley'].case_when([(m1, 'A'), (m2, 'B'), (~(m1|m2), pd.NA)]
).ffill()
Or with from_dummies
after adding a new column:
df['grp'] = (pd.from_dummies(df[['valley', 'peak']]
.assign(other=df[['valley', 'peak']]
.sum(axis=1).rsub(1)))
.squeeze().map({'valley': 'A', 'peak': 'B'}).ffill()
)
Or with reshaping:
df['grp'] = (df[['valley', 'peak']]
.rename_axis(columns='col')
.replace(0, pd.NA).stack().reset_index(-1, name='val')
.replace({'col': {'valley': 'A', 'peak': 'B'}})['col']
.reindex(df.index).ffill()
)
Alternatively, if there is always a valley then peak then valley... you could use cumsum
+mod
and map
the group after identify which one of peak/valley is the first:
df['grp'] = (df[['valley', 'peak']].max(axis=1).cumsum()
.add(df[['valley', 'peak']].idxmax().idxmin() == 'peak')
.mod(2).map({0: 'B', 1: 'A'})
)
Output:
x y valley peak grp
0 1 5.69 1 0 A
1 2 6.03 0 0 A
2 3 6.03 0 1 B
3 4 6.03 0 0 B
4 5 6.03 0 0 B
5 6 6.03 0 0 B
6 7 6.03 0 0 B
7 8 5.38 0 0 B
8 9 5.21 1 0 A
9 10 5.40 0 0 A
10 11 5.24 0 0 A
11 12 5.40 0 0 A
12 13 5.36 0 0 A
13 14 5.47 0 0 A
14 15 5.58 0 0 A
15 16 5.50 0 0 A
16 17 5.61 0 1 B
17 18 5.53 0 0 B
18 19 5.40 0 0 B
19 20 5.51 0 0 B
20 21 5.47 0 0 B
21 22 5.44 0 0 B
22 23 5.39 0 0 B
23 24 5.27 0 0 B
24 25 5.38 0 0 B
25 26 5.35 0 0 B
26 27 5.32 0 0 B
27 28 5.09 1 0 A