pandasdategroup-bymerge

Nearest date between 2 column dates, generate new column


I have this toydataset:

df = pd.DataFrame({'user':[1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4],
                  'd1':['1995-09-01','1995-09-02','1995-10-03','1995-10-04','1995-10-05','1995-11-07','1995-11-08','1995-11-09','1995-11-10','1995-11-15','1995-12-18','1995-12-19','1995-12-20','1995-12-23','1995-12-26','1995-12-30'],
                  'd2':['1995-10-05','1995-10-05','1995-10-05',\
                        '1995-11-08','1995-11-08','1995-11-08','1995-11-08',\
                        '1995-12-10','1995-12-10','1995-12-10','1995-12-10',\
                        '1995-12-27','1995-12-27','1995-12-27','1995-12-27','1995-12-27'],})

When sorted by user and d1 (df = df.sort_values(['user', 'd1'])) yields :

  user      d1         d2
    1   1995-09-01  1995-10-05
    1   1995-09-02  1995-10-05
    1   1995-10-03  1995-10-05
    2   1995-10-04  1995-11-08
    2   1995-10-05  1995-11-08
    2   1995-11-07  1995-11-08
    2   1995-11-08  1995-11-08
    3   1995-11-09  1995-12-10
    3   1995-11-10  1995-12-10
    3   1995-11-15  1995-12-10
    3   1995-12-18  1995-12-10
    4   1995-12-19  1995-12-27
    4   1995-12-20  1995-12-27
    4   1995-12-23  1995-12-27
    4   1995-12-26  1995-12-27
    4   1995-12-30  1995-12-27

Need to generate a new column [d3] with the nearest date on d1 to d2 column. Such as if d2 date is present in d1, d3 shows d2 date. Else show nearest date.

Note that results are grouped by user.

Following dataframe is the desired result:

  user      d1          d2         d3
    1   1995-09-01  1995-10-05  1995-10-03
    1   1995-09-02  1995-10-05  1995-10-03
    1   1995-10-03  1995-10-05  1995-10-03
    2   1995-10-04  1995-11-08  1995-11-08
    2   1995-10-05  1995-11-08  1995-11-08
    2   1995-11-07  1995-11-08  1995-11-08
    2   1995-11-08  1995-11-08  1995-11-08
    3   1995-11-09  1995-12-10  1995-12-18
    3   1995-11-10  1995-12-10  1995-12-18
    3   1995-11-15  1995-12-10  1995-12-18
    3   1995-12-18  1995-12-10  1995-12-18
    4   1995-12-19  1995-12-27  1995-12-26
    4   1995-12-20  1995-12-27  1995-12-26
    4   1995-12-23  1995-12-27  1995-12-26
    4   1995-12-26  1995-12-27  1995-12-26
    4   1995-12-30  1995-12-27  1995-12-26

I tried adapting the solution from this post, and this other one, but it did not work.


Solution

  • You can compute the absolute difference between both dates, get the min per group with idxmin and map the values:

    df[['d1', 'd2']] = df[['d1', 'd2']].apply(pd.to_datetime)
    
    # get index with min difference per user
    idx = df['d2'].sub(df['d1']).abs().groupby(df['user']).idxmin()
    
    df['d3'] = df['user'] .map(df.loc[idx, 'd1'].set_axis(idx.index))
    

    NB. this approach assumes that all d2 are identical for a given user.

    Or using a merge_asof:

    df[['d1', 'd2']] = df[['d1', 'd2']].apply(pd.to_datetime)
    df['d3'] = (pd.merge_asof(df[['user', 'd2']].reset_index().sort_values(by='d2'),
                              df[['user', 'd1']].sort_values(by='d1'),
                              by='user', left_on='d2', right_on='d1',
                              direction='nearest')
                  .set_index('index')['d1']
               )
    

    NB. this approach does not assume that all d2 are identical for a given user.

    Output:

        user         d1         d2         d3
    0      1 1995-09-01 1995-10-05 1995-10-03
    1      1 1995-09-02 1995-10-05 1995-10-03
    2      1 1995-10-03 1995-10-05 1995-10-03
    3      2 1995-10-04 1995-11-08 1995-11-08
    4      2 1995-10-05 1995-11-08 1995-11-08
    5      2 1995-11-07 1995-11-08 1995-11-08
    6      2 1995-11-08 1995-11-08 1995-11-08
    7      3 1995-11-09 1995-12-10 1995-12-18
    8      3 1995-11-10 1995-12-10 1995-12-18
    9      3 1995-11-15 1995-12-10 1995-12-18
    10     3 1995-12-18 1995-12-10 1995-12-18
    11     4 1995-12-19 1995-12-27 1995-12-26
    12     4 1995-12-20 1995-12-27 1995-12-26
    13     4 1995-12-23 1995-12-27 1995-12-26
    14     4 1995-12-26 1995-12-27 1995-12-26
    15     4 1995-12-30 1995-12-27 1995-12-26