pandasgroup-byrow

group by, order and detect change between rows


I have a dataframe. I would like group by col1, order by col3 and detect changes from row to row in col2.

Here is my example:

import pandas as pd
import datetime

my_df = pd.DataFrame({'col1': ['a', 'a', 'a', 'b', 'b', 'b'],
                      'col2': [2, 2, 3, 5, 5, 5],
                      'col3': [datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1),
                               datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 4, 1)]})

my_df.sort_values(by=['col3'], inplace=True)
my_df_temp = my_df.groupby('col1')['col2'].apply(
    lambda x: x != x.shift(1)
).reset_index(name='col2_change')

Here is how my dataframe looks:

  col1  col2        col3
0    a     2  2023-02-01
1    a     2  2023-03-01
2    a     3  2023-04-01
3    b     5  2023-02-01
4    b     5  2023-03-01
5    b     5  2023-04-01

Here is how result looks like:

  col1  level_1  col2_change
0    a        0         True
1    a        1        False
2    a        2         True
3    b        3         True
4    b        4        False
5    b        5        False

This is clearly incorrect. What am I doing wrong?


Solution

  • First of all, your issue is not obvious, you should provide the expected output for clarity.

    I imagine that you want to add a new column and keep the original existing columns unchanged. For that you would need to use groupby.transform:

    my_df['col2_change'] = (my_df
                            .groupby('col1')['col2']
                            .transform(lambda x: x != x.shift())
                           )
    

    Variant with groupby.shift:

    my_df['col2_change'] = (my_df
                            .groupby('col1')['col2']
                            .shift().ne(my_df['col2'])
                           )
    

    In addition, if you don't want to map the first value of a group as True you could perform a double shift:

    my_df['col2_change2'] = (my_df
                             .groupby('col1')['col2']
                             .transform(lambda x: x.ne(x.shift(-1))
                                                   .shift(fill_value=False))
                            )
    

    NB. a double shift is preferred to bfill that would incorrectly fill internal NaNs if any.

    Or using duplicated and where:

    my_df['col2_change2'] = (my_df
                             .groupby('col1')['col2']
                             .transform(lambda x: x != x.shift())
                             .where(my_df['col1'].duplicated(), False)
                           )
    

    Output:

      col1  col2        col3  col2_change  col2_change2
    0    a     2  2023-02-01         True         False
    3    b     5  2023-02-01         True         False
    1    a     2  2023-03-01        False         False
    4    b     5  2023-03-01        False         False
    2    a     3  2023-04-01         True          True
    5    b     5  2023-04-01        False         False