pythonpandasdataframegroup-byitertools-groupby

Groupby using a sequence and computing sum


I've a dataframe like this,

  Name COST      Timestamp
0   c   8   2023-09-20 15:14:46
1   a   8   2023-09-20 15:14:48
2   c   9   2023-09-20 15:14:55
3   b   10  2023-09-20 15:15:00
4   c   4   2023-09-20 15:15:02
5   a   9   2023-09-20 15:15:04
6   b   3   2023-09-20 15:15:12
7   a   3   2023-09-20 15:15:17
8   c   6   2023-09-20 15:15:20
9   c   6   2023-09-20 15:15:29

What I want is to create a new dataframe. It will look for the sequence of a,b,c (order doesn't matter here) and sum_cost is sum of cost of a,b,c and it's time_stamp will be the last time_stamp of a,b,c whichever you get. Incase of multiple entries of a same name say like b,c,c,b,b,a then take the last c then then the last b and a.

An detailed example:

output should be from the given dataframe will be constructed like this, for first one it will take 'a' of index 1, 'c' of index 2 (as this came after 'c' of index 0) and 'b' of index 3, the sum_cost will be 27 for this group and timestamp will be 2023-09-20 00:14:26 which is of 'b' of index 3 as it came last for this group. And next group it will take 'c' of index 4, 'a' of index 5 and 'b' of index 6 and sum_cost will be 16 and timestamp will be 2023-09-20 00:06:51 which is of 'b' of index 6 as it came last of this sequence.

The output will look like this,

 sum_cost   Timestamp
0   27  2023-09-20 15:15:00
1   16  2023-09-20 15:15:12

Please help me with this. Thank you!


Solution

  • You need to build a custom grouper, for that a loop is required, here using a custom function:

    def group_consecutive(s, target):
        out = []
        i = 0
    
        g = 1
        while i < len(s)-len(target)+1:
            if target == set(s.iloc[i:i+len(target)]):
                out.extend([g]*len(target))
                g += 1
                i += len(target)
            else:
                out.append(0)
                i += 1
        out.extend([0]*(len(target)-1))
    
        return pd.Series(out, index=s.index)
    
    group = group_consecutive(df['Name'], {'a', 'b', 'c'})
    
    out = df[group>0].groupby(group).agg({'COST': 'sum', 'Timestamp': 'last'})
    

    Output:

       COST            Timestamp
    1    27  2023-09-20 15:15:00
    2    16  2023-09-20 15:15:12
    

    Intermediate with group:

      Name  COST            Timestamp  group
    0    c     8  2023-09-20 15:14:46      0
    1    a     8  2023-09-20 15:14:48      1
    2    c     9  2023-09-20 15:14:55      1
    3    b    10  2023-09-20 15:15:00      1
    4    c     4  2023-09-20 15:15:02      2
    5    a     9  2023-09-20 15:15:04      2
    6    b     3  2023-09-20 15:15:12      2
    7    a     3  2023-09-20 15:15:17      0
    8    c     6  2023-09-20 15:15:20      0
    9    c     6  2023-09-20 15:15:29      0