I have a Pandas dataframe that looks like
df = pd.DataFrame([['John', 'A', '1/1/2017', '10'],
['John', 'A', '2/2/2017', '15'],
['John', 'A', '2/2/2017', '20'],
['John', 'A', '3/3/2017', '30'],
['Sue', 'B', '1/1/2017', '10'],
['Sue', 'B', '2/2/2017', '15'],
['Sue', 'B', '3/2/2017', '20'],
['Sue', 'B', '3/3/2017', '7'],
['Sue', 'B', '4/4/2017', '20']],
columns=['Customer', 'Group', 'Deposit_Date', 'DPD'])
And I want to create a new row called PreviousMean
. This column is the year to date average of DPD for that customer. i.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
So the desired outcome looks like
Customer Group Deposit_Date DPD PreviousMean
0 John A 2017-01-01 10 NaN
1 John A 2017-02-02 15 10.0
2 John A 2017-02-02 20 10.0
3 John A 2017-03-03 30 15.0
4 Sue B 2017-01-01 10 NaN
5 Sue B 2017-02-02 15 10.0
6 Sue B 2017-03-02 20 12.5
7 Sue B 2017-03-03 7 15.0
8 Sue B 2017-04-04 20 13.0
And after some researching on the site and internet here is one solution:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) &
(df.Group == x.Group) &
(df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
And it works fine. However, my actual dataframe is much larger (~1 million rows) and the above code is very slow.
I have asked a similar question before: Pandas groupby transform mean with date before current row for huge huge dataframe
except that this time the groupby is done on two columns and hence the solutions do not work and I failed to try to generalize it. Is there any better way to do it? Thanks
The linked solution works fine, but you have to carefully add all the groups in groupby
and then remove the matching levels in droplevel
:
df['Deposit_Date'] = pd.to_datetime(df['Deposit_Date'])
groups = ['Customer', 'Group']
df['PreviousMean'] = (df.groupby(groups)
.apply(lambda s: s['DPD'].expanding().mean().shift()
.mask(s['Deposit_Date'].duplicated())
.ffill(),
include_groups=False)
.droplevel(groups)
)
Output:
Customer Group Deposit_Date DPD PreviousMean
0 John A 2017-01-01 10 NaN
1 John A 2017-02-02 15 10.0
2 John A 2017-02-02 20 10.0
3 John A 2017-03-03 30 15.0
4 Sue B 2017-01-01 10 NaN
5 Sue B 2017-02-02 15 10.0
6 Sue B 2017-03-02 20 12.5
7 Sue B 2017-03-03 7 15.0
8 Sue B 2017-04-04 20 13.0