pythonpandasdataframecumulative-sum

Cumulative sum with overrides on condition


I'm trying to calculate the balance (the level) of an inventory over time and have incoming and outgoing quantities as input (and a category for each type of inventory). Usually I would calculate incoming - outgoing and carry over to the next period (cumulative sum), but in this case an added difficulty is that the balance can be overridden at various points in time which "resets" the balance to these values (and incoming/outgoings need to be added to these overrides from this point in time forward).

I came up with a way to calculate this by offsetting the calculated balance (=cumsum(incoming-outgoing)) when there's an override balance (by the negative calculated cumsum; i.e. setting the inventory to 0 when there's an override balance), but that doesn't work when there's multiple overrides at different times.

This is my current approach which works fine for the given dataframe (=only one override (bal) per category (cat)).

>>> df = pd.DataFrame({
...   'cat': ['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'b'],
...   'time': [1, 2, 1, 2, 4, 5, 6, 7, 8, 9],
...   'in': [None, 10, None, None, None, 20, 11, 9, 10, None],
...   'out': [10, None, None, 20, 10, 5, None, 30, None, None],
...   'bal': [None, None, None, None, 50, None, None, None, None, None]
                                       ^ at this time, the balance should be set to 50, irrespective of prior `in` and `out`.
... })
>>> 
>>> # cumsum goes by row, so order matters
>>> df = df.sort_values(by=['time'])
>>> df
  cat  time    in   out   bal
0   a     1   NaN  10.0   NaN
2   b     1   NaN   NaN   NaN
1   a     2  10.0   NaN   NaN
3   b     2   NaN  20.0   NaN
4   a     4   NaN  10.0  50.0
5   a     5  20.0   5.0   NaN
6   a     6  11.0   NaN   NaN
7   a     7   9.0  30.0   NaN
8   a     8  10.0   NaN   NaN
9   b     9   NaN   NaN   NaN
>>> 
>>> 
>>> # Calculate the balance as if 'bal' (the override) wasn't there (cumsum(in - out))
>>> df['inout'] = df['in'].fillna(0) - df['out'].fillna(0)
>>> df['cumsum'] = df[['cat', 'inout']].groupby(['cat']).cumsum()
>>> df
  cat  time    in   out   bal  inout  cumsum
0   a     1   NaN  10.0   NaN  -10.0   -10.0
2   b     1   NaN   NaN   NaN    0.0     0.0
1   a     2  10.0   NaN   NaN   10.0     0.0
3   b     2   NaN  20.0   NaN  -20.0   -20.0
4   a     4   NaN  10.0  50.0  -10.0   -10.0   <-- we want to override this with the value from 'bal' (50) and continue the calculation
5   a     5  20.0   5.0   NaN   15.0     5.0
6   a     6  11.0   NaN   NaN   11.0    16.0
7   a     7   9.0  30.0   NaN  -21.0    -5.0
8   a     8  10.0   NaN   NaN   10.0     5.0
9   b     9   NaN   NaN   NaN    0.0   -20.0
>>> 
>>> # Find the positions where a balance would override the calculated balance
>>> df['correction'] = -df.loc[pd.notnull(df['bal']), 'cumsum']
>>> df
  cat  time    in   out   bal  inout  cumsum  correction
0   a     1   NaN  10.0   NaN  -10.0   -10.0         NaN
2   b     1   NaN   NaN   NaN    0.0     0.0         NaN
1   a     2  10.0   NaN   NaN   10.0     0.0         NaN
3   b     2   NaN  20.0   NaN  -20.0   -20.0         NaN
4   a     4   NaN  10.0  50.0  -10.0   -10.0        10.0
5   a     5  20.0   5.0   NaN   15.0     5.0         NaN
6   a     6  11.0   NaN   NaN   11.0    16.0         NaN
7   a     7   9.0  30.0   NaN  -21.0    -5.0         NaN
8   a     8  10.0   NaN   NaN   10.0     5.0         NaN
9   b     9   NaN   NaN   NaN    0.0   -20.0         NaN
>>> 
>>> 
>>> # Calculate with the corrected balance
>>> df['inout2'] = df['in'].fillna(0) - df['out'].fillna(0) + df['bal'].fillna(0) + df['correction'].fillna(0)
>>> df['cumsum2'] = df[['cat', 'inout2']].groupby(['cat']).cumsum()
>>> df
  cat  time    in   out   bal  inout  cumsum  correction  inout2  cumsum2
0   a     1   NaN  10.0   NaN  -10.0   -10.0         NaN   -10.0    -10.0
2   b     1   NaN   NaN   NaN    0.0     0.0         NaN     0.0      0.0
1   a     2  10.0   NaN   NaN   10.0     0.0         NaN    10.0      0.0
3   b     2   NaN  20.0   NaN  -20.0   -20.0         NaN   -20.0    -20.0
4   a     4   NaN  10.0  50.0  -10.0   -10.0        10.0    50.0     50.0 (override from 'bal')
5   a     5  20.0   5.0   NaN   15.0     5.0         NaN    15.0     65.0 <--- 50 (override) +15 (in-out)
6   a     6  11.0   NaN   NaN   11.0    16.0         NaN    11.0     76.0
7   a     7   9.0  30.0   NaN  -21.0    -5.0         NaN   -21.0     55.0
8   a     8  10.0   NaN   NaN   10.0     5.0         NaN    10.0     65.0
9   b     9   NaN   NaN   NaN    0.0   -20.0         NaN     0.0    -20.0
>>> 
>>> 
>>> df[df['cat'] == 'a']
  cat  time    in   out   bal  inout  cumsum  correction  inout2  cumsum2
0   a     1   NaN  10.0   NaN  -10.0   -10.0         NaN   -10.0    -10.0
1   a     2  10.0   NaN   NaN   10.0     0.0         NaN    10.0      0.0
4   a     4   NaN  10.0  50.0  -10.0   -10.0        10.0    50.0     50.0
5   a     5  20.0   5.0   NaN   15.0     5.0         NaN    15.0     65.0
6   a     6  11.0   NaN   NaN   11.0    16.0         NaN    11.0     76.0
7   a     7   9.0  30.0   NaN  -21.0    -5.0         NaN   -21.0     55.0
8   a     8  10.0   NaN   NaN   10.0     5.0         NaN    10.0     65.0
 

That looks good. At index 4, the simple balance calculation is overridden (was -10, now is 50 as expected) and subsequent period inout flows are added as expected.

However, when I introduce another override the above algorithm breaks.

df = pd.DataFrame({
  'cat': ['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'b'],
  'time': [1, 2, 1, 2, 4, 5, 6, 7, 8, 9],
  'in': [None, 10, None, None, None, 20, 11, 9, 10, None],
  'out': [10, None, None, 20, 10, 5, None, 30, None, None],
  'bal': [None, None, None, None, 50, None, None, 30, None, None]
  #                                                ^
})

... same pipeline as before

>>> df
  cat  time    in   out   bal  inout  cumsum  correction  inout2  cumsum2
0   a     1   NaN  10.0   NaN  -10.0   -10.0         NaN   -10.0    -10.0
2   b     1   NaN   NaN   NaN    0.0     0.0         NaN     0.0      0.0
1   a     2  10.0   NaN   NaN   10.0     0.0         NaN    10.0      0.0
3   b     2   NaN  20.0   NaN  -20.0   -20.0         NaN   -20.0    -20.0
4   a     4   NaN  10.0  50.0  -10.0   -10.0        10.0    50.0     50.0  # still ok
5   a     5  20.0   5.0   NaN   15.0     5.0         NaN    15.0     65.0
6   a     6  11.0   NaN   NaN   11.0    16.0         NaN    11.0     76.0
7   a     7   9.0  30.0  30.0  -21.0    -5.0         5.0    14.0     90.0  # expect 30
8   a     8  10.0   NaN   NaN   10.0     5.0         NaN    10.0    100.0  # expect 30 + 10 = 40
9   b     9   NaN   NaN   NaN    0.0   -20.0         NaN     0.0    -20.0

I'd like to modify the algorithm to keep with the simplicity of using cumsum (functional), but can't work out how to proceed. It's almost like I need a conditional cumsum which replaces the intermediate values when there's a condition being met (in this case, a value in bal). However, I'd much rather calculate yet another correcting column (or fix the existing one) and add it (but I hit a wall as I probably looked at it for too long). Any help is greatly appreciated.


Solution

  • Code

    cond = df['bal'].notna()
    df['cumsum2'] = (
        df['in'].fillna(0).sub(df['out'].fillna(0)).mask(cond, df['bal'])
        .groupby([df['cat'], cond.groupby(df['cat']).cumsum()]).cumsum()
    )
    

    df(your first example)

      cat  time    in   out   bal  cumsum2
    0   a     1   NaN  10.0   NaN    -10.0
    1   a     2  10.0   NaN   NaN      0.0
    2   b     1   NaN   NaN   NaN      0.0
    3   b     2   NaN  20.0   NaN    -20.0
    4   a     4   NaN  10.0  50.0     50.0
    5   a     5  20.0   5.0   NaN     65.0
    6   a     6  11.0   NaN   NaN     76.0
    7   a     7   9.0  30.0   NaN     55.0
    8   a     8  10.0   NaN   NaN     65.0
    9   b     9   NaN   NaN   NaN    -20.0
    

    df(your second example)

      cat  time    in   out   bal  cumsum2
    0   a     1   NaN  10.0   NaN    -10.0
    1   a     2  10.0   NaN   NaN      0.0
    2   b     1   NaN   NaN   NaN      0.0
    3   b     2   NaN  20.0   NaN    -20.0
    4   a     4   NaN  10.0  50.0     50.0
    5   a     5  20.0   5.0   NaN     65.0
    6   a     6  11.0   NaN   NaN     76.0
    7   a     7   9.0  30.0  30.0     30.0
    8   a     8  10.0   NaN   NaN     40.0
    9   b     9   NaN   NaN   NaN    -20.0