For each row, I need to get a) the number of and b) the mean score of all rows 28 days prior to that row's date
with the same member_ID
.
df:
member_ID score date past28
0 93 0.341937 20090515 20090417
1 496 0.075530 20090515 20090417
2 742 0.072468 20090515 20090417
3 668 NaN 20090515 20090417
4 351 0.581490 20090515 20090417
... ... ... ... ...
799995 792 NaN 20230225 20230128
799996 180 0.251769 20230225 20230128
799997 413 0.195291 20230225 20230128
799998 786 0.389303 20230225 20230128
799999 235 0.545911 20230225 20230128
Rows with NaN score values I want to include in the main loop (ie. get the row's qualifying rows), but I exclude these rows from qualifying_rows for the purpose of calculating valid score_counts and score_averages.
score_counts = pd.Series([0] * len(df))
score_averages = pd.Series([None] * len(df))
for id, id_rows in df.groupby('member_ID'):
for date, date_rows in id_rows.groupby('date'):
#There can be multiple member_ID on the same date, so I use the first index of date_rows to ensure I exclude current date from past_member_rows
past_member_rows = id_rows.loc[:date_rows.index[0]-1]
#For any given row, its past28 value is its date value minus 28 days
qualifying_rows = past_member_rows.loc[(past_member_rows['date'] >= date_rows['past28'].iloc[0]) & (~past_member_rows['score'].isnull())]
score_counts[date_rows.index] = len(qualifying_rows)
score_averages[date_rows.index] = qualifying_rows['score'].mean()
With this many .locs it obviously take a very long time and I'd love to hear some recommendations for how to re-think this. I usually employ vectorization but with this relatively complex lookup on a row by row basis I haven't worked out how to do so here.
You can in fact avoid the id_date
column. Here is a version of your code that is just slightly shorter because you can group by two columns and implement merge
to combine results at the end. You may choose to test the speed with map
vs. merge
as well.
As far as I can tell, you need the second dataframe because of the duplicate dates unfortunately. If a way to handle the duplicate dates is found I will update. For now though, my understanding is that to use the time-aware rolling
, you have to use a date as the index, but you also need unique indices. So, conundrum!
Note that your choice of left
for rolling
means the last data point is excluded in the sum and count (and therefore mean). You may change to include it with a different choice for the closed
parameter.
# use datetime type
df['date'] = pd.to_datetime(df['date'])
# create columns for each member ID & date with a sum and count for that combo
df['sum_date'] = df.groupby(['member_ID', 'date'])['score'].transform('sum')
df['ct_date'] = df.groupby(['member_ID', 'date'])['score'].transform('count')
# sort by date for rolling function
df.sort_values(by=['date', 'member_ID'], inplace=True)
# copy data, drop duplicates, reset index with date
df2 = df.copy().drop_duplicates(['member_ID', 'date'])
df2.set_index('date', inplace=True)
# perform rolling calculations for group
grouped = df2.groupby('member_ID')['sum_date']
rolling_sum = grouped.rolling('28D',closed='left').sum().reset_index().rename(columns={'sum_date':'rolling_sum'})
grouped = df2.groupby('member_ID')['ct_date']
rolling_count = grouped.rolling('28D',closed='left').sum().reset_index().rename(columns={'ct_date':'rolling_ct'})
# merge data, calculate mean, drop extra columns
df = df.merge(rolling_sum, how='left', on = ['member_ID', 'date']).merge(rolling_count, how='left', on=['member_ID', 'date'])
df['rolling_mean'] = df['rolling_sum'] / df['rolling_ct']
df = df.drop(columns=['sum_date', 'ct_date', 'rolling_sum', 'rolling_ct'])
Edit to add: In my test I did not include data with NA
, so you'd want to double check that. If you need to drop them, groupby
has options for that, or you could move setting df2
to before sum_date
and ct_date
, and drop rows with NA
values there