I have the pandas dataframe with id, number and date. I want to create a new column which shows how many dates are lower than current id's date with same number(group by number).
import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5,6],'number':['a', 'a', 'a', 'a', 'b', 'c'], 'date' : ['2023-06-20', '2022-06-20', '2021-06-20', '2021-07-20', '2022-06-20', '2023-06-20']})
print(df)
id number date
0 1 a 2023-06-20
1 2 a 2022-06-20
2 3 a 2021-06-20
3 4 a 2021-07-20
4 5 b 2022-06-20
5 6 c 2023-06-20
I have used pandas apply function with filtering. The issue is I have 300k+ rows and it is taking around 2 hours to run below code. How do I optimize this?
df['result'] = df.apply(lambda x:len(df[(df['number'] == x['number']) & (df['date'] < x['date'])]), axis = 1)
print(df)
id number date result
0 1 a 2023-06-20 3
1 2 a 2022-06-20 2
2 3 a 2021-06-20 0
3 4 a 2021-07-20 1
4 5 b 2022-06-20 0
5 6 c 2023-06-20 0
You can sort your dataframe by date
then compute the cumulative count for each group:
df['result'] = df.sort_values('date').groupby('number').cumcount()
print(df)
# Output
id number date result
0 1 a 2023-06-20 3
1 2 a 2022-06-20 2
2 3 a 2021-06-20 0
3 4 a 2021-07-20 1
4 5 b 2022-06-20 0
5 6 c 2023-06-20 0
Note: it works only because date
are unique in your dataframe.