pythonpandasdataframelambdagroup-by

How to use vectorized calculations in pandas to find out where a value or category is changing with corrected first row?


With a dataset with millions of records, I have items with various categories and measurements, and I'm trying to figure out how many of the records have changed, in particular when the category or measurement goes to NaN (or NULL from the database query) during the sequence.

In SQL, I'd use some PARTITION style OLAP functions to do this, but seems like it should fairly straightforward in Python with Pandas, but I can't quite wrap my head around the vectorized notation.

I've tried various df.groupby clauses and lambda functions but nothing quite gets it in the required format - basically, the df.groupby('item')['measure'] in this example, the first row of the grouped subset of item & measure always returns True, where I'd like to it to be False or NaN. Simply put, they are false positives. I understand from pandas' perspective, it's a change since the first x.shift() would be NaN, but I can't figure out how to filter that or handle it in the lambda function.

Sample Code:

import pandas as pd
import numpy as np

test_df = pd.DataFrame({'item': [20, 20, 20, 20, 20, 20, 20, 20, 30, 30, 30, 30, 30, 30, 30, 30, 40, 40, 40, 40, 40, 40, 40, 40 ],
                        'measure': [1, 1, 1, 3, 3, 3, 3, 3, 6, 6, 6, 6, 6, 7, 7, 7, 10, 10, 10, 10, 10, 10, 10, 10 ],
                        'cat': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e']})

test_df['measure_change'] = test_df.groupby('item')['measure'].transform(lambda x: x.shift() != x)
test_df['cat_change'] = test_df.groupby('item')['cat'].transform(lambda x: x.shift() != x)

In the output below, as an example, rows 0, 8, and 16, the measure_change should be False. So all of item 40 would have measure_change == False and that would indicate no changes with that item. Any & all suggestions are appreciated.

(cat_change set up the same way)

# item measure measure_change
0 20 1 True
1 20 1 False
2 20 1 False
3 20 3 True
4 20 3 False
5 20 3 False
6 20 3 False
7 20 3 False
8 30 6 True
9 30 6 False
10 30 6 False
11 30 6 False
12 30 6 False
13 30 7 True
14 30 7 False
15 30 7 False
16 40 10 True
17 40 10 False
18 40 10 False
19 40 10 False
20 40 10 False
21 40 10 False
22 40 10 False
23 40 10 False

Solution

  • You can use a combination of groupby.diff and fillna to achieve this. We compare the row difference with 0 to find any rows where measure changed:

    test_df['measure_change'] = test_df.groupby('item')['measure'].diff().fillna(0) != 0
    

    Result:

        item  measure  measure_change
    0     20        1           False
    1     20        1           False
    2     20        1           False
    3     20        3            True
    4     20        3           False
    5     20        3           False
    6     20        3           False
    7     20        3           False
    8     30        6           False
    9     30        6           False
    10    30        6           False
    11    30        6           False
    12    30        6           False
    13    30        7            True
    14    30        7           False
    15    30        7           False
    16    40       10           False
    17    40       10           False
    18    40       10           False
    19    40       10           False
    20    40       10           False
    21    40       10           False
    22    40       10           False
    23    40       10           False
    

    Alternativly, if you have strings to compare as well you can add a secondary condition checking the shift value for nans: x.shift().notna().

    test_df['measure_change'] = test_df.groupby('item')['measure'].transform(lambda x: (x != x.shift()) & (x.shift().notna()))