I have a Pandas dataframe that looks like
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']],
columns=['Customer', 'Deposit_Date','DPD'])
And I want to create a new row called PreviousMean
. This column is the year to date average of DPD for that customer. i.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
So the desired outcome looks like
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
And after some researching on the site and internet here is one solution:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
And it works fine. However, my actual datafram is much larger (~1 million rows) and the above code is very slow. Is there any better way to do it? Thanks
You could use a custom groupby.apply
with expanding.mean
and a mask
on the duplicated
date to ffill
the output:
df['Deposit_Date'] = pd.to_datetime(df['Deposit_Date'])
df['PreviousMean'] = (df.groupby('Customer')
.apply(lambda s: s['DPD'].expanding().mean().shift()
.mask(s['Deposit_Date'].duplicated())
.ffill(),
include_groups=False)
.droplevel(0)
)
NB. this is assuming the dates are sorted.
Output:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Intermediates:
Customer Deposit_Date DPD expanding.mean shift duplicated mask PreviousMean
0 John 2017-01-01 10 10.00 NaN False NaN NaN
1 John 2017-02-02 15 12.50 10.0 False 10.0 10.0
2 John 2017-02-02 20 15.00 12.5 True NaN 10.0
3 John 2017-03-03 30 18.75 15.0 False 15.0 15.0
4 Sue 2017-01-01 10 10.00 NaN False NaN NaN
5 Sue 2017-02-02 15 12.50 10.0 False 10.0 10.0
6 Sue 2017-03-02 20 15.00 12.5 False 12.5 12.5
7 Sue 2017-03-03 7 13.00 15.0 False 15.0 15.0
8 Sue 2017-04-04 20 14.40 13.0 False 13.0 13.0
Note: the generalization of this answer to multiple groups is discussed here.