How to use vectorized calculations in pandas to find out where a value or category is changing with corrected first row?

With a dataset with millions of records, I have items with various categories and measurements, and I'm trying to figure out how many of the records have changed, in particular when the category or measurement goes to NaN (or NULL from the database query) during the sequence.

In SQL, I'd use some PARTITION style OLAP functions to do this, but seems like it should fairly straightforward in Python with Pandas, but I can't quite wrap my head around the vectorized notation.

I've tried various df.groupby clauses and lambda functions but nothing quite gets it in the required format - basically, the df.groupby('item')['measure'] in this example, the first row of the grouped subset of item & measure always returns True, where I'd like to it to be False or NaN. Simply put, they are false positives. I understand from pandas' perspective, it's a change since the first x.shift() would be NaN, but I can't figure out how to filter that or handle it in the lambda function.

Sample Code:

import pandas as pd
import numpy as np

test_df = pd.DataFrame({'item': [20, 20, 20, 20, 20, 20, 20, 20, 30, 30, 30, 30, 30, 30, 30, 30, 40, 40, 40, 40, 40, 40, 40, 40 ],
                        'measure': [1, 1, 1, 3, 3, 3, 3, 3, 6, 6, 6, 6, 6, 7, 7, 7, 10, 10, 10, 10, 10, 10, 10, 10 ],
                        'cat': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e']})

test_df['measure_change'] = test_df.groupby('item')['measure'].transform(lambda x: x.shift() != x)
test_df['cat_change'] = test_df.groupby('item')['cat'].transform(lambda x: x.shift() != x)

In the output below, as an example, rows 0, 8, and 16, the measure_change should be False. So all of item 40 would have measure_change == False and that would indicate no changes with that item. Any & all suggestions are appreciated.

(cat_change set up the same way)

#	item	measure	measure_change
0	20	1	True
1	20	1	False
2	20	1	False
3	20	3	True
4	20	3	False
5	20	3	False
6	20	3	False
7	20	3	False
8	30	6	True
9	30	6	False
10	30	6	False
11	30	6	False
12	30	6	False
13	30	7	True
14	30	7	False
15	30	7	False
16	40	10	True
17	40	10	False
18	40	10	False
19	40	10	False
20	40	10	False
21	40	10	False
22	40	10	False
23	40	10	False

Solution

You can use a combination of groupby.diff and fillna to achieve this. We compare the row difference with 0 to find any rows where measure changed:

test_df['measure_change'] = test_df.groupby('item')['measure'].diff().fillna(0) != 0

Result:

    item  measure  measure_change
0     20        1           False
1     20        1           False
2     20        1           False
3     20        3            True
4     20        3           False
5     20        3           False
6     20        3           False
7     20        3           False
8     30        6           False
9     30        6           False
10    30        6           False
11    30        6           False
12    30        6           False
13    30        7            True
14    30        7           False
15    30        7           False
16    40       10           False
17    40       10           False
18    40       10           False
19    40       10           False
20    40       10           False
21    40       10           False
22    40       10           False
23    40       10           False

Alternativly, if you have strings to compare as well you can add a secondary condition checking the shift value for nans: x.shift().notna().

test_df['measure_change'] = test_df.groupby('item')['measure'].transform(lambda x: (x != x.shift()) & (x.shift().notna()))