pythonpandasgroup-bydata-preprocessing

Pandas group by number and find dates lowers than current date


I have the pandas dataframe with id, number and date. I want to create a new column which shows how many dates are lower than current id's date with same number(group by number).

import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5,6],'number':['a', 'a', 'a', 'a', 'b', 'c'], 'date' : ['2023-06-20', '2022-06-20', '2021-06-20', '2021-07-20', '2022-06-20', '2023-06-20']})
print(df)

  id number        date
0   1      a  2023-06-20
1   2      a  2022-06-20
2   3      a  2021-06-20
3   4      a  2021-07-20
4   5      b  2022-06-20
5   6      c  2023-06-20

I have used pandas apply function with filtering. The issue is I have 300k+ rows and it is taking around 2 hours to run below code. How do I optimize this?

df['result'] = df.apply(lambda x:len(df[(df['number'] == x['number']) & (df['date'] < x['date'])]), axis = 1)
print(df)
 id number        date    result
0   1      a  2023-06-20     3
1   2      a  2022-06-20     2
2   3      a  2021-06-20     0
3   4      a  2021-07-20     1
4   5      b  2022-06-20     0
5   6      c  2023-06-20     0

Solution

  • You can sort your dataframe by date then compute the cumulative count for each group:

    df['result'] = df.sort_values('date').groupby('number').cumcount()
    print(df)
    
    # Output
       id number        date  result
    0   1      a  2023-06-20       3
    1   2      a  2022-06-20       2
    2   3      a  2021-06-20       0
    3   4      a  2021-07-20       1
    4   5      b  2022-06-20       0
    5   6      c  2023-06-20       0
    

    Note: it works only because date are unique in your dataframe.