Consider the following pandas:
df1:
user_id create_at
0 1 2023-01-01
1 1 2023-01-02
2 2 2023-01-01
3 2 2023-01-03
4 2 2023-01-05
df2:
user_id withdrawal_request_date
0 1 2023-01-03
1 2 2023-01-02
2 3 2023-01-01
I want to do a merge_asof
BUT when I do, I lose the user with user_id=3
. He is not in df1 so I lose him as you see below:
pd.merge_asof(df1.sort_values('create_at'),
df2.sort_values('withdrawal_request_date'),
by = 'user_id',
left_on = 'create_at',
right_on = 'withdrawal_request_date',
direction = 'forward')
user_id create_at withdrawal_request_date
0 1 2023-01-01 2023-01-03
1 2 2023-01-01 2023-01-02
2 1 2023-01-02 2023-01-03
3 2 2023-01-03 NaT
4 2 2023-01-05 NaT
However, I would like to keep him in the final dataframe. The expected output should look like:
user_id create_at withdrawal_request_date
0 1 2023-01-01 2023-01-03
1 2 2023-01-01 2023-01-02
2 1 2023-01-02 2023-01-03
3 2 2023-01-03 NaT
4 2 2023-01-05 NaT
5 3 NaT 2023-01-01
NOTE that my dataset is huge with thousands of users each doing a lot of activity so doing something like outer joins
will not work here.
Any ideas?
DATA
df1 = pd.DataFrame({
'user_id': [1, 1, 2, 2, 2],
'create_at': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-03', '2023-01-05'])
})
df2 = pd.DataFrame({
'user_id': [1, 2, 3],
'withdrawal_request_date': pd.to_datetime(['2023-01-03', '2023-01-02', '2023-01-01'])
})
You can't perform what you want directly with a merge_asof
. A merge_asof
is anyway a left join.
What you can do however is post-process the output of the merge_asof
to add the missing user_ids with help of isin
and concat
:
tmp = pd.merge_asof(df1.sort_values('create_at'),
df2.sort_values('withdrawal_request_date'),
by='user_id',
left_on='create_at',
right_on='withdrawal_request_date',
direction='forward')
out = pd.concat([tmp, df2[~df2['user_id'].isin(tmp['user_id'])]],
ignore_index=True)
Output:
user_id create_at withdrawal_request_date
0 1 2023-01-01 2023-01-03
1 2 2023-01-01 2023-01-02
2 1 2023-01-02 2023-01-03
3 2 2023-01-03 NaT
4 2 2023-01-05 NaT
5 3 NaT 2023-01-01