With a dataset with millions of records, I have items with various categories and measurements, and I'm trying to figure out how many of the records have changed, in particular when the category or measurement goes to NaN (or NULL from the database query) during the sequence.
In SQL, I'd use some PARTITION style OLAP functions to do this, but seems like it should fairly straightforward in Python with Pandas, but I can't quite wrap my head around the vectorized notation.
I've tried various df.groupby
clauses and lambda functions but nothing quite gets it in the required format - basically, the df.groupby('item')['measure']
in this example, the first row of the grouped subset of item & measure always returns True
, where I'd like to it to be False
or NaN
. Simply put, they are false positives. I understand from pandas' perspective, it's a change since the first x.shift()
would be NaN, but I can't figure out how to filter that or handle it in the lambda function.
Sample Code:
import pandas as pd
import numpy as np
test_df = pd.DataFrame({'item': [20, 20, 20, 20, 20, 20, 20, 20, 30, 30, 30, 30, 30, 30, 30, 30, 40, 40, 40, 40, 40, 40, 40, 40 ],
'measure': [1, 1, 1, 3, 3, 3, 3, 3, 6, 6, 6, 6, 6, 7, 7, 7, 10, 10, 10, 10, 10, 10, 10, 10 ],
'cat': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e']})
test_df['measure_change'] = test_df.groupby('item')['measure'].transform(lambda x: x.shift() != x)
test_df['cat_change'] = test_df.groupby('item')['cat'].transform(lambda x: x.shift() != x)
In the output below, as an example, rows 0, 8, and 16, the measure_change
should be False. So all of item 40 would have measure_change == False
and that would indicate no changes with that item. Any & all suggestions are appreciated.
(cat_change set up the same way)
# | item | measure | measure_change |
---|---|---|---|
0 | 20 | 1 | True |
1 | 20 | 1 | False |
2 | 20 | 1 | False |
3 | 20 | 3 | True |
4 | 20 | 3 | False |
5 | 20 | 3 | False |
6 | 20 | 3 | False |
7 | 20 | 3 | False |
8 | 30 | 6 | True |
9 | 30 | 6 | False |
10 | 30 | 6 | False |
11 | 30 | 6 | False |
12 | 30 | 6 | False |
13 | 30 | 7 | True |
14 | 30 | 7 | False |
15 | 30 | 7 | False |
16 | 40 | 10 | True |
17 | 40 | 10 | False |
18 | 40 | 10 | False |
19 | 40 | 10 | False |
20 | 40 | 10 | False |
21 | 40 | 10 | False |
22 | 40 | 10 | False |
23 | 40 | 10 | False |
You can use a combination of groupby.diff
and fillna
to achieve this. We compare the row difference with 0 to find any rows where measure
changed:
test_df['measure_change'] = test_df.groupby('item')['measure'].diff().fillna(0) != 0
Result:
item measure measure_change
0 20 1 False
1 20 1 False
2 20 1 False
3 20 3 True
4 20 3 False
5 20 3 False
6 20 3 False
7 20 3 False
8 30 6 False
9 30 6 False
10 30 6 False
11 30 6 False
12 30 6 False
13 30 7 True
14 30 7 False
15 30 7 False
16 40 10 False
17 40 10 False
18 40 10 False
19 40 10 False
20 40 10 False
21 40 10 False
22 40 10 False
23 40 10 False
Alternativly, if you have strings to compare as well you can add a secondary condition checking the shift
value for nans: x.shift().notna()
.
test_df['measure_change'] = test_df.groupby('item')['measure'].transform(lambda x: (x != x.shift()) & (x.shift().notna()))