pythonpandasjoinmergeconcatenation

Join dataframes by a column with repeated values


There are two tables df1 and df2. df1 columns are id, predicted_date, df2 columns are id and actual_date.

df1 = pd.DataFrame({
                    'id': ['1', '1', '1', '2', '2', '2', '3', '3'],
                    'predicted_date': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01','2022-02-01']
                   })

df2 = pd.DataFrame({
                    'id': ['1', '1', '2', '2', '3', '3', '3', '3'],
                    'actual_date': ['2022-01-02', '2022-02-02', '2022-03-02', '2022-01-02', '2022-02-02', '2022-03-02', '2022-01-02','2022-02-02']
                   })

I want to join them in order to have a dataframe with id, preicted_date, and actual_date. predicted_date and actual_date should correspond to the ids.

I tried to concatenate but the ids are repeated, so the result is not correct. If to merge the dataframes, predicted_date or actuale_date observations get repeated.

df_new = pd.concat([df1, df2], axis = 1)

With concat, the result is:

enter image description here

I want to have somethink like:

I want to have this. There can be NAs instead of blank.

How can it be done?


Solution

  • Try, using merge on id and a helper column, key, created using groupby on id with cumcount:

    df1.assign(key=df1.groupby('id').cumcount())\
       .merge(df2.assign(key=df2.groupby('id').cumcount()), 
              on=['id', 'key'], 
              how='outer')\
       .drop('key', axis=1)
    

    Output:

      id predicted_date actual_date
    0  1     2022-01-01  2022-01-02
    1  1     2022-02-01  2022-02-02
    2  1     2022-03-01         NaN
    3  2     2022-01-01  2022-03-02
    4  2     2022-02-01  2022-01-02
    5  2     2022-03-01         NaN
    6  3     2022-01-01  2022-02-02
    7  3     2022-02-01  2022-03-02
    8  3            NaN  2022-01-02
    9  3            NaN  2022-02-02