pythonpandasgroup-byarray-broadcastingnumpy-ufunc

Groupby and transform in pandas based on window conditions


Please help to find an optimal solution for this task.

We have a pandas dataframe with two main date columns and many others (and >20mln rows).

Here is a toy example of the dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
                   'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
                  'clients': ['client1', 'client2', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})
})

The input df

We need to groupby and transform the dataframe on the level of clients but with condition that we work with certain window for each client: only if date1 + 12m <= sample_date.

The outcome would be a new column in df dataframe with these values: The result:

Below is my very slow code, it works but it is very slow: Please help to optimize it using pandas methods, which I'm not still aware of!

# initialzing outcome column
df['count_products'] = np.nan

for i in range(df.shape[0]):
    df_temp = df[(df['date1'] + pd.DateOffset(months=12)) <= df['sample_date'].iloc[i]]
    df_temp = df_temp[df_temp['clients'] == df['clients'].iloc[i]]
    df['count_products'][i] = df_temp.groupby('clients')['products'].count()

I would appreciate any help!

Latest update 31.05.2023: Additional dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
                   'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
                  'clients': ['client1', 'client1', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

The result

The result should be this:

Latest update: The logic is to count for each client & sample_date pair, the number of products whose date1 is at least or equal to 12M ago.


Solution

  • The answer is the following:

    1. you can use iterrows(). Iterate in pandas only with iterrows()

    2. there is a better way:

    def calc_func(partition):
        partition['count_products'] = partition.apply(lambda row: partition.loc[partition['date1'] <= row['sample_date'],'products'].count(), axis = 1)
        return partition
        
    result_df = df.groupby('clients').apply(calc_func)
    result_df.groupby(['clients', 'sample_date'], as_index = False)['count_products'].first(), on = ['clients', 'sample_date'])