pythonpandasdataframeduplicatescumsum

How do I perform pandas cumsum while skipping rows that are duplicated in another field?


I am trying to use the pandas.cumsum() function, but in a way that ignores rows with a value in the ID column that is duplicated and specifically only adds the last value to the cumulative sum, ignoring all earlier values. Example code below (I couldn't share the real code, which is for work).

import pandas as pd, numpy as np
import random as rand
id = ['a','b','c','a','b','e','f','a','b','k']
value = [12,14,3,13,16,7,4,6,10,18]

df = pd.DataFrame({'id':id, 'value':value})
df["cumsum_of_value"] = df['value'].cumsum()
df["desired_output"] = [
    12,26,29,30,32,39,43,36,30,48
]
df["comments"] = [""]*len(df)
df.loc[df.index==0, "comments"]="standard cumsum"
df.loc[df.index==1, "comments"]="standard cumsum"
df.loc[df.index==2, "comments"]="standard cumsum"
df.loc[df.index==3, "comments"]="cumsum of rows 1-3, ignore row 0"
df.loc[df.index==4, "comments"]="cumsum of rows 2-4, ignore rows 0, 1"
df.loc[df.index==5, "comments"]="cumsum of rows 2-5, ignore rows 0, 1"
df.loc[df.index==6, "comments"]="cumsum of rows 2-6, ignore rows 0, 1"
df.loc[df.index==7, "comments"]="cumsum of rows 2,4-7, ignore rows 0, 1, 3"
df.loc[df.index==8, "comments"]="cumsum of rows 2,5-8, ignore rows 0, 1, 3, 4"
df.loc[df.index==9, "comments"]="cumsum of rows 2,5-9, ignore rows 0, 1, 3, 4"
print(df)

In this example, there are seven (7) unique values in the ID column (a, b, c ,d, e, f, g), so the cumsum should only ever sum a max of seven (7) records as its output on any row.

Is this possible using combinations of functions such as cumsum(), groupby(), duplicated(), drop_duplicates(), and avoiding the use of an iterative loop?

I've tried the below

df["duped"] = np.where(df["id"].duplicated(keep='last'),0,1)
df["value_duped"] = df["duped"] * df["value"]
df["desired_output_attempt"] = df["cumsum_of_value"] - df["value_duped"]

But it doesn't come close to the correct answer. I can't think of how to get something like this to result in the desired output without iterating.


Solution

  • Try:

    df["out"] = (
        df.groupby("id")["value"].transform("diff").fillna(df["value"]).cumsum().astype(int)
    )
    
    print(df)
    

    Prints:

      id  value  cumsum_of_value  desired_output  out
    0  a     12               12              12   12
    1  b     14               26              26   26
    2  c      3               29              29   29
    3  a     13               42              30   30
    4  b     16               58              32   32
    5  e      7               65              39   39
    6  f      4               69              43   43
    7  a      6               75              36   36
    8  b     10               85              30   30
    9  k     18              103              48   48