pandasdataframe

Pandas dataframe add a mark columns by two other columns condition


There is a dataframe like this:

import numpy as np
import pandas as pd

df = pd.DataFrame({'x':np.arange(1,29),'y':[5.69, 6.03, 6.03, 6.03, 6.03, 6.03, 6.03, 5.38, 5.21, 5.4 , 5.24,
       5.4 , 5.36, 5.47, 5.58, 5.5 , 5.61, 5.53, 5.4 , 5.51, 5.47, 5.44,5.39, 5.27, 5.38, 5.35, 5.32, 5.09],
          'valley':[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
          'peak':[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0]})

>>> df
     x     y  valley  peak
0    1  5.69       1     0
1    2  6.03       0     0
2    3  6.03       0     1
3    4  6.03       0     0
4    5  6.03       0     0
5    6  6.03       0     0
6    7  6.03       0     0
7    8  5.38       0     0
8    9  5.21       1     0
9   10  5.40       0     0
10  11  5.24       0     0
11  12  5.40       0     0
12  13  5.36       0     0
13  14  5.47       0     0
14  15  5.58       0     0
15  16  5.50       0     0
16  17  5.61       0     1
17  18  5.53       0     0
18  19  5.40       0     0
19  20  5.51       0     0
20  21  5.47       0     0
21  22  5.44       0     0
22  23  5.39       0     0
23  24  5.27       0     0
24  25  5.38       0     0
25  26  5.35       0     0
26  27  5.32       0     0
27  28  5.09       1     0

I hope to add a new column 'grp' to this dataframe, with the requirement that for each row that starts with "1" in the valley column and ends with "1" in the peak column, the value in the added column is "A", and conversely, for each row that starts with "1" in the peak column and ends with "1" in the valley column, the value in the added column is 'B'.

The desire result is:

>>> out
     x     y  valley  peak  grp
0    1  5.69       1     0  A
1    2  6.03       0     0  A
2    3  6.03       0     1  B
3    4  6.03       0     0  B
4    5  6.03       0     0  B
5    6  6.03       0     0  B
6    7  6.03       0     0  B
7    8  5.38       0     0  B
8    9  5.21       1     0  A
9   10  5.40       0     0  A
10  11  5.24       0     0  A
11  12  5.40       0     0  A
12  13  5.36       0     0  A
13  14  5.47       0     0  A
14  15  5.58       0     0  A
15  16  5.50       0     0  A
16  17  5.61       0     1  B
17  18  5.53       0     0  B
18  19  5.40       0     0  B
19  20  5.51       0     0  B
20  21  5.47       0     0  B
21  22  5.44       0     0  B
22  23  5.39       0     0  B
23  24  5.27       0     0  B
24  25  5.38       0     0  B
25  26  5.35       0     0  B
26  27  5.32       0     0  B
27  28  5.09       1     0  A

If we don't use apply with a function and for-loops, is there a native way to achieve by use pandas?


Solution

  • Given your description, you could use np.select and ffill. This way you'll ensure that even if you have multiple peaks before a valley or conversely this will keep the order:

    m1 = df['valley'].eq(1)
    m2 = df['peak'].eq(1)
    df['grp'] = pd.Series(np.select([m1, m2], ['A', 'B'], pd.NA),
                          index=df.index).ffill()
    

    Variant with case_when:

    m1 = df['valley'].eq(1)
    m2 = df['peak'].eq(1)
    df['grp'] = df['valley'].case_when([(m1, 'A'), (m2, 'B'), (~(m1|m2), pd.NA)]
                                       ).ffill()
    

    Or with from_dummies after adding a new column:

    df['grp'] = (pd.from_dummies(df[['valley', 'peak']]
                                 .assign(other=df[['valley', 'peak']]
                                         .sum(axis=1).rsub(1)))
                   .squeeze().map({'valley': 'A', 'peak': 'B'}).ffill()
                )
    

    Or with reshaping:

    df['grp'] = (df[['valley', 'peak']]
     .rename_axis(columns='col')
     .replace(0, pd.NA).stack().reset_index(-1, name='val')
     .replace({'col': {'valley': 'A', 'peak': 'B'}})['col']
     .reindex(df.index).ffill()
    )
    

    Alternatively, if there is always a valley then peak then valley... you could use cumsum+mod and map the group after identify which one of peak/valley is the first:

    df['grp'] = (df[['valley', 'peak']].max(axis=1).cumsum()
                .add(df[['valley', 'peak']].idxmax().idxmin() == 'peak')
                .mod(2).map({0: 'B', 1: 'A'})
               )
    

    Output:

         x     y  valley  peak grp
    0    1  5.69       1     0   A
    1    2  6.03       0     0   A
    2    3  6.03       0     1   B
    3    4  6.03       0     0   B
    4    5  6.03       0     0   B
    5    6  6.03       0     0   B
    6    7  6.03       0     0   B
    7    8  5.38       0     0   B
    8    9  5.21       1     0   A
    9   10  5.40       0     0   A
    10  11  5.24       0     0   A
    11  12  5.40       0     0   A
    12  13  5.36       0     0   A
    13  14  5.47       0     0   A
    14  15  5.58       0     0   A
    15  16  5.50       0     0   A
    16  17  5.61       0     1   B
    17  18  5.53       0     0   B
    18  19  5.40       0     0   B
    19  20  5.51       0     0   B
    20  21  5.47       0     0   B
    21  22  5.44       0     0   B
    22  23  5.39       0     0   B
    23  24  5.27       0     0   B
    24  25  5.38       0     0   B
    25  26  5.35       0     0   B
    26  27  5.32       0     0   B
    27  28  5.09       1     0   A