pythonpandasdataframegroup-by

Pandas groupby transform mean with date before current row for huge dataframe


I have a Pandas dataframe that looks like

df = pd.DataFrame([['John', 'A', '1/1/2017', '10'],
                   ['John', 'A', '2/2/2017', '15'],
                   ['John', 'A', '2/2/2017', '20'],
                   ['John', 'A', '3/3/2017', '30'],
                   ['Sue', 'B', '1/1/2017', '10'],
                   ['Sue', 'B', '2/2/2017', '15'],
                   ['Sue', 'B', '3/2/2017', '20'],
                   ['Sue', 'B', '3/3/2017', '7'],
                   ['Sue', 'B', '4/4/2017', '20']],
                  columns=['Customer', 'Group', 'Deposit_Date', 'DPD'])

And I want to create a new row called PreviousMean. This column is the year to date average of DPD for that customer. i.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.

So the desired outcome looks like

  Customer  Group  Deposit_Date  DPD  PreviousMean
0     John      A    2017-01-01   10           NaN
1     John      A    2017-02-02   15          10.0
2     John      A    2017-02-02   20          10.0
3     John      A    2017-03-03   30          15.0
4      Sue      B    2017-01-01   10           NaN
5      Sue      B    2017-02-02   15          10.0
6      Sue      B    2017-03-02   20          12.5
7      Sue      B    2017-03-03    7          15.0
8      Sue      B    2017-04-04   20          13.0

And after some researching on the site and internet here is one solution:

df['PreviousMean'] = df.apply(
    lambda x: df[(df.Customer == x.Customer) & 
                 (df.Group == x.Group) & 
                 (df.Deposit_Date < x.Deposit_Date)].DPD.mean(), 
axis=1)

And it works fine. However, my actual dataframe is much larger (~1 million rows) and the above code is very slow.

I have asked a similar question before: Pandas groupby transform mean with date before current row for huge huge dataframe

except that this time the groupby is done on two columns and hence the solutions do not work and I failed to try to generalize it. Is there any better way to do it? Thanks


Solution

  • The linked solution works fine, but you have to carefully add all the groups in groupby and then remove the matching levels in droplevel:

    df['Deposit_Date'] = pd.to_datetime(df['Deposit_Date'])
    
    groups = ['Customer', 'Group']
    
    df['PreviousMean'] = (df.groupby(groups)
                            .apply(lambda s: s['DPD'].expanding().mean().shift()
                                                     .mask(s['Deposit_Date'].duplicated())
                                                     .ffill(),
                                   include_groups=False)
                            .droplevel(groups)
                         )
    

    Output:

      Customer Group Deposit_Date  DPD  PreviousMean
    0     John     A   2017-01-01   10           NaN
    1     John     A   2017-02-02   15          10.0
    2     John     A   2017-02-02   20          10.0
    3     John     A   2017-03-03   30          15.0
    4      Sue     B   2017-01-01   10           NaN
    5      Sue     B   2017-02-02   15          10.0
    6      Sue     B   2017-03-02   20          12.5
    7      Sue     B   2017-03-03    7          15.0
    8      Sue     B   2017-04-04   20          13.0