pythonpandasdataframegroup-by

Pandas groupby transform mean with date before current row for huge huge dataframe


I have a Pandas dataframe that looks like

df = pd.DataFrame([['John', '1/1/2017','10'],
                   ['John', '2/2/2017','15'],
                   ['John', '2/2/2017','20'],
                   ['John', '3/3/2017','30'],
                   ['Sue', '1/1/2017','10'],
                   ['Sue', '2/2/2017','15'],
                   ['Sue', '3/2/2017','20'],
                   ['Sue', '3/3/2017','7'],
                   ['Sue', '4/4/2017','20']],
                   columns=['Customer', 'Deposit_Date','DPD'])

And I want to create a new row called PreviousMean. This column is the year to date average of DPD for that customer. i.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.

So the desired outcome looks like

  Customer Deposit_Date  DPD  PreviousMean
0     John   2017-01-01   10           NaN
1     John   2017-02-02   15          10.0
2     John   2017-02-02   20          10.0
3     John   2017-03-03   30          15.0
4      Sue   2017-01-01   10           NaN
5      Sue   2017-02-02   15          10.0
6      Sue   2017-03-02   20          12.5
7      Sue   2017-03-03    7          15.0
8      Sue   2017-04-04   20          13.0

And after some researching on the site and internet here is one solution:

df['PreviousMean'] = df.apply(
    lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(), 
axis=1)

And it works fine. However, my actual datafram is much larger (~1 million rows) and the above code is very slow. Is there any better way to do it? Thanks


Solution

  • You could use a custom groupby.apply with expanding.mean and a mask on the duplicated date to ffill the output:

    df['Deposit_Date'] = pd.to_datetime(df['Deposit_Date'])
    df['PreviousMean'] = (df.groupby('Customer')
                            .apply(lambda s: s['DPD'].expanding().mean().shift()
                                                     .mask(s['Deposit_Date'].duplicated())
                                                     .ffill(),
                                   include_groups=False)
                            .droplevel(0)
                         )
    

    NB. this is assuming the dates are sorted.

    Output:

      Customer Deposit_Date DPD  PreviousMean
    0     John   2017-01-01  10           NaN
    1     John   2017-02-02  15          10.0
    2     John   2017-02-02  20          10.0
    3     John   2017-03-03  30          15.0
    4      Sue   2017-01-01  10           NaN
    5      Sue   2017-02-02  15          10.0
    6      Sue   2017-03-02  20          12.5
    7      Sue   2017-03-03   7          15.0
    8      Sue   2017-04-04  20          13.0
    

    Intermediates:

      Customer Deposit_Date DPD  expanding.mean  shift  duplicated  mask  PreviousMean
    0     John   2017-01-01  10           10.00    NaN       False   NaN           NaN
    1     John   2017-02-02  15           12.50   10.0       False  10.0          10.0
    2     John   2017-02-02  20           15.00   12.5        True   NaN          10.0
    3     John   2017-03-03  30           18.75   15.0       False  15.0          15.0
    4      Sue   2017-01-01  10           10.00    NaN       False   NaN           NaN
    5      Sue   2017-02-02  15           12.50   10.0       False  10.0          10.0
    6      Sue   2017-03-02  20           15.00   12.5       False  12.5          12.5
    7      Sue   2017-03-03   7           13.00   15.0       False  15.0          15.0
    8      Sue   2017-04-04  20           14.40   13.0       False  13.0          13.0
    

    Note: the generalization of this answer to multiple groups is discussed here.