Please help to find an optimal solution for this task.
We have a pandas dataframe with two main date columns and many others (and >20mln rows).
Here is a toy example of the dataset:
df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
'clients': ['client1', 'client2', 'client1', 'client2'],
'products': ['product1', 'product2', 'product3', 'product4']})
})
We need to groupby and transform the dataframe on the level of clients but with condition that we work with certain window for each client: only if date1 + 12m <= sample_date.
The outcome would be a new column in df dataframe with these values: The result:
Below is my very slow code, it works but it is very slow: Please help to optimize it using pandas methods, which I'm not still aware of!
# initialzing outcome column
df['count_products'] = np.nan
for i in range(df.shape[0]):
df_temp = df[(df['date1'] + pd.DateOffset(months=12)) <= df['sample_date'].iloc[i]]
df_temp = df_temp[df_temp['clients'] == df['clients'].iloc[i]]
df['count_products'][i] = df_temp.groupby('clients')['products'].count()
I would appreciate any help!
Latest update 31.05.2023: Additional dataset:
df = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
'clients': ['client1', 'client1', 'client1', 'client2'],
'products': ['product1', 'product2', 'product3', 'product4']})
The result
Latest update: The logic is to count for each client & sample_date pair, the number of products whose date1 is at least or equal to 12M ago.
The answer is the following:
you can use iterrows(). Iterate in pandas only with iterrows()
there is a better way:
def calc_func(partition):
partition['count_products'] = partition.apply(lambda row: partition.loc[partition['date1'] <= row['sample_date'],'products'].count(), axis = 1)
return partition
result_df = df.groupby('clients').apply(calc_func)
result_df.groupby(['clients', 'sample_date'], as_index = False)['count_products'].first(), on = ['clients', 'sample_date'])