I have the pandas dataframe with id, number and date. I want to create a new column which shows how many dates are in range of last 2 years of current id's date with same number(group by number). You can assume the data has no duplicates.
import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5,6],'number':['a', 'a', 'a', 'a', 'b', 'c'], 'date' : ['2023-06-20', '2022-06-20', '2021-06-20', '2021-07-20', '2022-06-20', '2023-06-20']})
df['date'] = pd.to_datetime(df['date'] )
print(df)
Initial data:
id number date
0 1 a 2023-06-20
1 2 a 2022-06-20
2 3 a 2021-06-20
3 4 a 2021-07-20
4 5 b 2022-06-20
5 6 c 2023-06-20
I have tried using pandas group by with filters but I have 300k+ rows and it is taking 2+ hours to execute below code. How can I optimize it?
df['before_2_yrs'] = df.apply(lambda x : x.date - pd.DateOffset(years=2), axis = 1)
print(df)
df['results'] = df.apply(lambda x:len(df[(df['number'] == x['number']) & (df['date'] >= x['before_2_yrs']) & (df['date'] < x['date'])]), axis = 1)
print(df)
After adding dateoffset:
id number date before_2_yrs
0 1 a 2023-06-20 2021-06-20
1 2 a 2022-06-20 2020-06-20
2 3 a 2021-06-20 2019-06-20
3 4 a 2021-07-20 2019-07-20
4 5 b 2022-06-20 2020-06-20
5 6 c 2023-06-20 2021-06-20
final data:
id number date before_2_yrs results
0 1 a 2023-06-20 2021-06-20 3
1 2 a 2022-06-20 2020-06-20 2
2 3 a 2021-06-20 2019-06-20 0
3 4 a 2021-07-20 2019-07-20 1
4 5 b 2022-06-20 2020-06-20 0
5 6 c 2023-06-20 2021-06-20 0
Perhaps you could try to use .groupby
to not compare for the same number
each iteration?
def fn(x):
out = []
for d, b in zip(x['date'], x['before_2_yrs']):
out.append(((x['date'] >= b) & (x['date'] < d)).sum())
return pd.Series(out, index=x.index)
df['before_2_yrs'] = df['date'] - pd.DateOffset(years=2)
df['results'] = df.groupby('number', group_keys=False).apply(fn)
print(df)
Prints:
id number date before_2_yrs results
0 1 a 2023-06-20 2021-06-20 3
1 2 a 2022-06-20 2020-06-20 2
2 3 a 2021-06-20 2019-06-20 0
3 4 a 2021-07-20 2019-07-20 1
4 5 b 2022-06-20 2020-06-20 0
5 6 c 2023-06-20 2021-06-20 0