pythonpandasmerge

Merge asof missing values


Consider the following pandas:

df1:

user_id create_at
0   1   2023-01-01
1   1   2023-01-02
2   2   2023-01-01
3   2   2023-01-03
4   2   2023-01-05

df2:

user_id withdrawal_request_date
0   1   2023-01-03
1   2   2023-01-02
2   3   2023-01-01

I want to do a merge_asof BUT when I do, I lose the user with user_id=3. He is not in df1 so I lose him as you see below:

pd.merge_asof(df1.sort_values('create_at'), 
              df2.sort_values('withdrawal_request_date'), 
              by = 'user_id', 
              left_on = 'create_at', 
              right_on = 'withdrawal_request_date', 
              direction = 'forward')

user_id create_at   withdrawal_request_date
0   1   2023-01-01  2023-01-03
1   2   2023-01-01  2023-01-02
2   1   2023-01-02  2023-01-03
3   2   2023-01-03  NaT
4   2   2023-01-05  NaT

However, I would like to keep him in the final dataframe. The expected output should look like:

user_id create_at   withdrawal_request_date
0   1   2023-01-01  2023-01-03
1   2   2023-01-01  2023-01-02
2   1   2023-01-02  2023-01-03
3   2   2023-01-03  NaT
4   2   2023-01-05  NaT
5   3   NaT         2023-01-01

NOTE that my dataset is huge with thousands of users each doing a lot of activity so doing something like outer joins will not work here.

Any ideas?

DATA

df1 = pd.DataFrame({
    'user_id': [1, 1, 2, 2, 2],
    'create_at': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-03', '2023-01-05'])
})

df2 = pd.DataFrame({
    'user_id': [1, 2, 3],
    'withdrawal_request_date': pd.to_datetime(['2023-01-03', '2023-01-02', '2023-01-01'])
})

Solution

  • You can't perform what you want directly with a merge_asof. A merge_asof is anyway a left join.

    What you can do however is post-process the output of the merge_asof to add the missing user_ids with help of isin and concat:

    tmp = pd.merge_asof(df1.sort_values('create_at'), 
                        df2.sort_values('withdrawal_request_date'), 
                        by='user_id', 
                        left_on='create_at', 
                        right_on='withdrawal_request_date', 
                        direction='forward')
    
    out = pd.concat([tmp, df2[~df2['user_id'].isin(tmp['user_id'])]],
                    ignore_index=True)
    

    Output:

       user_id  create_at withdrawal_request_date
    0        1 2023-01-01              2023-01-03
    1        2 2023-01-01              2023-01-02
    2        1 2023-01-02              2023-01-03
    3        2 2023-01-03                     NaT
    4        2 2023-01-05                     NaT
    5        3        NaT              2023-01-01