python pandas dataframe vectorization pandas-loc

Faster method for complex .loc in large Pandas dataframe?

For each row, I need to get a) the number of and b) the mean score of all rows 28 days prior to that row's date with the same member_ID.

df:

        member_ID     score      date    past28
0              93  0.341937  20090515  20090417
1             496  0.075530  20090515  20090417
2             742  0.072468  20090515  20090417
3             668       NaN  20090515  20090417
4             351  0.581490  20090515  20090417
...           ...       ...       ...       ...
799995        792       NaN  20230225  20230128
799996        180  0.251769  20230225  20230128
799997        413  0.195291  20230225  20230128
799998        786  0.389303  20230225  20230128
799999        235  0.545911  20230225  20230128

Rows with NaN score values I want to include in the main loop (ie. get the row's qualifying rows), but I exclude these rows from qualifying_rows for the purpose of calculating valid score_counts and score_averages.

score_counts = pd.Series([0] * len(df))
score_averages = pd.Series([None] * len(df))

for id, id_rows in df.groupby('member_ID'):
   for date, date_rows in id_rows.groupby('date'):
      #There can be multiple member_ID on the same date, so I use the first index of date_rows to ensure I exclude current date from past_member_rows
      past_member_rows = id_rows.loc[:date_rows.index[0]-1]
      #For any given row, its past28 value is its date value minus 28 days
      qualifying_rows = past_member_rows.loc[(past_member_rows['date'] >= date_rows['past28'].iloc[0]) & (~past_member_rows['score'].isnull())]
   score_counts[date_rows.index] = len(qualifying_rows)
   score_averages[date_rows.index] = qualifying_rows['score'].mean()

With this many .locs it obviously take a very long time and I'd love to hear some recommendations for how to re-think this. I usually employ vectorization but with this relatively complex lookup on a row by row basis I haven't worked out how to do so here.

Solution

You can in fact avoid the id_date column. Here is a version of your code that is just slightly shorter because you can group by two columns and implement merge to combine results at the end. You may choose to test the speed with map vs. merge as well.

As far as I can tell, you need the second dataframe because of the duplicate dates unfortunately. If a way to handle the duplicate dates is found I will update. For now though, my understanding is that to use the time-aware rolling, you have to use a date as the index, but you also need unique indices. So, conundrum!

Note that your choice of left for rolling means the last data point is excluded in the sum and count (and therefore mean). You may change to include it with a different choice for the closed parameter.

# use datetime type
df['date'] = pd.to_datetime(df['date'])

# create columns for each member ID & date with a sum and count for that combo
df['sum_date'] = df.groupby(['member_ID', 'date'])['score'].transform('sum')
df['ct_date'] = df.groupby(['member_ID', 'date'])['score'].transform('count')

# sort by date for rolling function
df.sort_values(by=['date', 'member_ID'], inplace=True)

# copy data, drop duplicates, reset index with date
df2 = df.copy().drop_duplicates(['member_ID', 'date'])
df2.set_index('date', inplace=True)

# perform rolling calculations for group
grouped = df2.groupby('member_ID')['sum_date']
rolling_sum = grouped.rolling('28D',closed='left').sum().reset_index().rename(columns={'sum_date':'rolling_sum'})
grouped = df2.groupby('member_ID')['ct_date']
rolling_count = grouped.rolling('28D',closed='left').sum().reset_index().rename(columns={'ct_date':'rolling_ct'})

# merge data, calculate mean, drop extra columns
df = df.merge(rolling_sum, how='left', on = ['member_ID', 'date']).merge(rolling_count, how='left', on=['member_ID', 'date'])
df['rolling_mean'] = df['rolling_sum'] / df['rolling_ct']
df = df.drop(columns=['sum_date', 'ct_date', 'rolling_sum', 'rolling_ct'])

Edit to add: In my test I did not include data with NA, so you'd want to double check that. If you need to drop them, groupby has options for that, or you could move setting df2 to before sum_date and ct_date, and drop rows with NA values there