pythonpandasgroup-bydata-preprocessing

Pandas group by number and find dates in range of past 2 years


I have the pandas dataframe with id, number and date. I want to create a new column which shows how many dates are in range of last 2 years of current id's date with same number(group by number). You can assume the data has no duplicates.

import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5,6],'number':['a', 'a', 'a', 'a', 'b', 'c'], 'date' : ['2023-06-20', '2022-06-20', '2021-06-20', '2021-07-20', '2022-06-20', '2023-06-20']})
df['date'] = pd.to_datetime(df['date'] )
print(df)
Initial data: 
    id number       date
0   1      a 2023-06-20
1   2      a 2022-06-20
2   3      a 2021-06-20
3   4      a 2021-07-20
4   5      b 2022-06-20
5   6      c 2023-06-20

I have tried using pandas group by with filters but I have 300k+ rows and it is taking 2+ hours to execute below code. How can I optimize it?

df['before_2_yrs'] = df.apply(lambda x : x.date - pd.DateOffset(years=2), axis = 1) 
print(df)
df['results'] = df.apply(lambda x:len(df[(df['number'] == x['number']) & (df['date'] >= x['before_2_yrs']) & (df['date'] < x['date'])]), axis = 1)
print(df)

After adding dateoffset: 
    id number       date before_2_yrs
0   1      a 2023-06-20   2021-06-20
1   2      a 2022-06-20   2020-06-20
2   3      a 2021-06-20   2019-06-20
3   4      a 2021-07-20   2019-07-20
4   5      b 2022-06-20   2020-06-20
5   6      c 2023-06-20   2021-06-20
final data: 
    id number       date before_2_yrs  results
0   1      a 2023-06-20   2021-06-20        3
1   2      a 2022-06-20   2020-06-20        2
2   3      a 2021-06-20   2019-06-20        0
3   4      a 2021-07-20   2019-07-20        1
4   5      b 2022-06-20   2020-06-20        0
5   6      c 2023-06-20   2021-06-20        0

Solution

  • Perhaps you could try to use .groupby to not compare for the same number each iteration?

    def fn(x):
        out = []
        for d, b in zip(x['date'], x['before_2_yrs']):
            out.append(((x['date'] >= b) & (x['date'] < d)).sum())
        return pd.Series(out, index=x.index)
    
    df['before_2_yrs'] = df['date'] - pd.DateOffset(years=2)
    df['results'] = df.groupby('number', group_keys=False).apply(fn)
    print(df)
    

    Prints:

       id number       date before_2_yrs  results
    0   1      a 2023-06-20   2021-06-20        3
    1   2      a 2022-06-20   2020-06-20        2
    2   3      a 2021-06-20   2019-06-20        0
    3   4      a 2021-07-20   2019-07-20        1
    4   5      b 2022-06-20   2020-06-20        0
    5   6      c 2023-06-20   2021-06-20        0