There are two tables df1 and df2. df1 columns are id, predicted_date, df2 columns are id and actual_date.
df1 = pd.DataFrame({
'id': ['1', '1', '1', '2', '2', '2', '3', '3'],
'predicted_date': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01','2022-02-01']
})
df2 = pd.DataFrame({
'id': ['1', '1', '2', '2', '3', '3', '3', '3'],
'actual_date': ['2022-01-02', '2022-02-02', '2022-03-02', '2022-01-02', '2022-02-02', '2022-03-02', '2022-01-02','2022-02-02']
})
I want to join them in order to have a dataframe with id, preicted_date, and actual_date. predicted_date and actual_date should correspond to the ids.
I tried to concatenate but the ids are repeated, so the result is not correct. If to merge the dataframes, predicted_date or actuale_date observations get repeated.
df_new = pd.concat([df1, df2], axis = 1)
With concat, the result is:
I want to have somethink like:
How can it be done?
Try, using merge
on id
and a helper column, key
, created using groupby
on id with cumcount
:
df1.assign(key=df1.groupby('id').cumcount())\
.merge(df2.assign(key=df2.groupby('id').cumcount()),
on=['id', 'key'],
how='outer')\
.drop('key', axis=1)
Output:
id predicted_date actual_date
0 1 2022-01-01 2022-01-02
1 1 2022-02-01 2022-02-02
2 1 2022-03-01 NaN
3 2 2022-01-01 2022-03-02
4 2 2022-02-01 2022-01-02
5 2 2022-03-01 NaN
6 3 2022-01-01 2022-02-02
7 3 2022-02-01 2022-03-02
8 3 NaN 2022-01-02
9 3 NaN 2022-02-02