I'm trying to calculate the balance (the level) of an inventory over time and have incoming and outgoing quantities as input (and a category for each type of inventory). Usually I would calculate incoming - outgoing
and carry over to the next period (cumulative sum), but in this case an added difficulty is that the balance can be overridden at various points in time which "resets" the balance to these values (and incoming/outgoings need to be added to these overrides from this point in time forward).
I came up with a way to calculate this by offsetting the calculated balance (=cumsum(incoming-outgoing)) when there's an override balance (by the negative calculated cumsum; i.e. setting the inventory to 0 when there's an override balance), but that doesn't work when there's multiple overrides at different times.
This is my current approach which works fine for the given dataframe (=only one override (bal
) per category (cat
)).
>>> df = pd.DataFrame({
... 'cat': ['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'b'],
... 'time': [1, 2, 1, 2, 4, 5, 6, 7, 8, 9],
... 'in': [None, 10, None, None, None, 20, 11, 9, 10, None],
... 'out': [10, None, None, 20, 10, 5, None, 30, None, None],
... 'bal': [None, None, None, None, 50, None, None, None, None, None]
^ at this time, the balance should be set to 50, irrespective of prior `in` and `out`.
... })
>>>
>>> # cumsum goes by row, so order matters
>>> df = df.sort_values(by=['time'])
>>> df
cat time in out bal
0 a 1 NaN 10.0 NaN
2 b 1 NaN NaN NaN
1 a 2 10.0 NaN NaN
3 b 2 NaN 20.0 NaN
4 a 4 NaN 10.0 50.0
5 a 5 20.0 5.0 NaN
6 a 6 11.0 NaN NaN
7 a 7 9.0 30.0 NaN
8 a 8 10.0 NaN NaN
9 b 9 NaN NaN NaN
>>>
>>>
>>> # Calculate the balance as if 'bal' (the override) wasn't there (cumsum(in - out))
>>> df['inout'] = df['in'].fillna(0) - df['out'].fillna(0)
>>> df['cumsum'] = df[['cat', 'inout']].groupby(['cat']).cumsum()
>>> df
cat time in out bal inout cumsum
0 a 1 NaN 10.0 NaN -10.0 -10.0
2 b 1 NaN NaN NaN 0.0 0.0
1 a 2 10.0 NaN NaN 10.0 0.0
3 b 2 NaN 20.0 NaN -20.0 -20.0
4 a 4 NaN 10.0 50.0 -10.0 -10.0 <-- we want to override this with the value from 'bal' (50) and continue the calculation
5 a 5 20.0 5.0 NaN 15.0 5.0
6 a 6 11.0 NaN NaN 11.0 16.0
7 a 7 9.0 30.0 NaN -21.0 -5.0
8 a 8 10.0 NaN NaN 10.0 5.0
9 b 9 NaN NaN NaN 0.0 -20.0
>>>
>>> # Find the positions where a balance would override the calculated balance
>>> df['correction'] = -df.loc[pd.notnull(df['bal']), 'cumsum']
>>> df
cat time in out bal inout cumsum correction
0 a 1 NaN 10.0 NaN -10.0 -10.0 NaN
2 b 1 NaN NaN NaN 0.0 0.0 NaN
1 a 2 10.0 NaN NaN 10.0 0.0 NaN
3 b 2 NaN 20.0 NaN -20.0 -20.0 NaN
4 a 4 NaN 10.0 50.0 -10.0 -10.0 10.0
5 a 5 20.0 5.0 NaN 15.0 5.0 NaN
6 a 6 11.0 NaN NaN 11.0 16.0 NaN
7 a 7 9.0 30.0 NaN -21.0 -5.0 NaN
8 a 8 10.0 NaN NaN 10.0 5.0 NaN
9 b 9 NaN NaN NaN 0.0 -20.0 NaN
>>>
>>>
>>> # Calculate with the corrected balance
>>> df['inout2'] = df['in'].fillna(0) - df['out'].fillna(0) + df['bal'].fillna(0) + df['correction'].fillna(0)
>>> df['cumsum2'] = df[['cat', 'inout2']].groupby(['cat']).cumsum()
>>> df
cat time in out bal inout cumsum correction inout2 cumsum2
0 a 1 NaN 10.0 NaN -10.0 -10.0 NaN -10.0 -10.0
2 b 1 NaN NaN NaN 0.0 0.0 NaN 0.0 0.0
1 a 2 10.0 NaN NaN 10.0 0.0 NaN 10.0 0.0
3 b 2 NaN 20.0 NaN -20.0 -20.0 NaN -20.0 -20.0
4 a 4 NaN 10.0 50.0 -10.0 -10.0 10.0 50.0 50.0 (override from 'bal')
5 a 5 20.0 5.0 NaN 15.0 5.0 NaN 15.0 65.0 <--- 50 (override) +15 (in-out)
6 a 6 11.0 NaN NaN 11.0 16.0 NaN 11.0 76.0
7 a 7 9.0 30.0 NaN -21.0 -5.0 NaN -21.0 55.0
8 a 8 10.0 NaN NaN 10.0 5.0 NaN 10.0 65.0
9 b 9 NaN NaN NaN 0.0 -20.0 NaN 0.0 -20.0
>>>
>>>
>>> df[df['cat'] == 'a']
cat time in out bal inout cumsum correction inout2 cumsum2
0 a 1 NaN 10.0 NaN -10.0 -10.0 NaN -10.0 -10.0
1 a 2 10.0 NaN NaN 10.0 0.0 NaN 10.0 0.0
4 a 4 NaN 10.0 50.0 -10.0 -10.0 10.0 50.0 50.0
5 a 5 20.0 5.0 NaN 15.0 5.0 NaN 15.0 65.0
6 a 6 11.0 NaN NaN 11.0 16.0 NaN 11.0 76.0
7 a 7 9.0 30.0 NaN -21.0 -5.0 NaN -21.0 55.0
8 a 8 10.0 NaN NaN 10.0 5.0 NaN 10.0 65.0
That looks good. At index 4, the simple balance calculation is overridden (was -10, now is 50 as expected) and subsequent period inout flows are added as expected.
However, when I introduce another override the above algorithm breaks.
df = pd.DataFrame({
'cat': ['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'b'],
'time': [1, 2, 1, 2, 4, 5, 6, 7, 8, 9],
'in': [None, 10, None, None, None, 20, 11, 9, 10, None],
'out': [10, None, None, 20, 10, 5, None, 30, None, None],
'bal': [None, None, None, None, 50, None, None, 30, None, None]
# ^
})
... same pipeline as before
>>> df
cat time in out bal inout cumsum correction inout2 cumsum2
0 a 1 NaN 10.0 NaN -10.0 -10.0 NaN -10.0 -10.0
2 b 1 NaN NaN NaN 0.0 0.0 NaN 0.0 0.0
1 a 2 10.0 NaN NaN 10.0 0.0 NaN 10.0 0.0
3 b 2 NaN 20.0 NaN -20.0 -20.0 NaN -20.0 -20.0
4 a 4 NaN 10.0 50.0 -10.0 -10.0 10.0 50.0 50.0 # still ok
5 a 5 20.0 5.0 NaN 15.0 5.0 NaN 15.0 65.0
6 a 6 11.0 NaN NaN 11.0 16.0 NaN 11.0 76.0
7 a 7 9.0 30.0 30.0 -21.0 -5.0 5.0 14.0 90.0 # expect 30
8 a 8 10.0 NaN NaN 10.0 5.0 NaN 10.0 100.0 # expect 30 + 10 = 40
9 b 9 NaN NaN NaN 0.0 -20.0 NaN 0.0 -20.0
I'd like to modify the algorithm to keep with the simplicity of using cumsum
(functional), but can't work out how to proceed. It's almost like I need a conditional cumsum which replaces the intermediate values when there's a condition being met (in this case, a value in bal
). However, I'd much rather calculate yet another correcting column (or fix the existing one) and add it (but I hit a wall as I probably looked at it for too long). Any help is greatly appreciated.
Code
cond = df['bal'].notna()
df['cumsum2'] = (
df['in'].fillna(0).sub(df['out'].fillna(0)).mask(cond, df['bal'])
.groupby([df['cat'], cond.groupby(df['cat']).cumsum()]).cumsum()
)
df
(your first example)
cat time in out bal cumsum2
0 a 1 NaN 10.0 NaN -10.0
1 a 2 10.0 NaN NaN 0.0
2 b 1 NaN NaN NaN 0.0
3 b 2 NaN 20.0 NaN -20.0
4 a 4 NaN 10.0 50.0 50.0
5 a 5 20.0 5.0 NaN 65.0
6 a 6 11.0 NaN NaN 76.0
7 a 7 9.0 30.0 NaN 55.0
8 a 8 10.0 NaN NaN 65.0
9 b 9 NaN NaN NaN -20.0
df
(your second example)
cat time in out bal cumsum2
0 a 1 NaN 10.0 NaN -10.0
1 a 2 10.0 NaN NaN 0.0
2 b 1 NaN NaN NaN 0.0
3 b 2 NaN 20.0 NaN -20.0
4 a 4 NaN 10.0 50.0 50.0
5 a 5 20.0 5.0 NaN 65.0
6 a 6 11.0 NaN NaN 76.0
7 a 7 9.0 30.0 30.0 30.0
8 a 8 10.0 NaN NaN 40.0
9 b 9 NaN NaN NaN -20.0