pythonpandasexplodefill

Adding rows for missing months and Fill na values with last value in a partition Python


I have a pandas dataframe and I need to fill NULL values with the last value on a partition. Specifically, for each "id" and "month", I need to create and explode last "value" on subsequent months.

Example of my dataset:

id month value
1 2023-01-01 London
1 2023-02-01 Paris
2 2023-01-01 New York
3 2023-02-01 Paris
4 2023-03-01 NULL

My desidered output (Exploding the values up to April 2023):

id month value
1 2023-01-01 London
1 2023-02-01 Paris
1 2023-03-01 Paris
1 2023-04-01 Paris
2 2023-01-01 New York
2 2023-02-01 New York
2 2023-03-01 New York
2 2023-04-01 New York
3 2023-02-01 Paris
3 2023-03-01 Paris
3 2023-04-01 Paris
4 2023-03-01 NULL
4 2023-04-01 NULL

Thank u!


Solution

  • One option using reshaping as a rectangular intermediate with pivot/stack:

    df['month'] = pd.to_datetime(df['month'])
    
    out = (df
        .pivot(index='id', columns='month', values='value')
        .reindex(columns=pd.date_range('2023-01-01', '2023-04-01', freq='MS').rename('month'))
        .ffill(axis=1)
        .stack().reset_index(name='value')
    )
    

    Or with a MultiIndex and groupby.ffill:

    out = (df
       .set_index(['id', 'month'])
       .reindex(pd.MultiIndex.from_product([df['id'].unique(),
                                            pd.date_range('2023-01-01', '2023-04-01', freq='MS')],
                                          names=['id', 'month']
                                          ))
       .groupby(level=0).ffill().dropna().reset_index()
    )
    

    Output:

        id      month     value
    0    1 2023-01-01    London
    1    1 2023-02-01     Paris
    2    1 2023-03-01     Paris
    3    1 2023-04-01     Paris
    4    2 2023-01-01  New York
    5    2 2023-02-01  New York
    6    2 2023-03-01  New York
    7    2 2023-04-01  New York
    8    3 2023-02-01     Paris
    9    3 2023-03-01     Paris
    10   3 2023-04-01     Paris
    

    keeping/filling original NaNs:

    You can add a helper column to identify the original NaNs:

    out = (df
       .set_index(['id', 'month']).assign(flag=True)
       .reindex(pd.MultiIndex.from_product([df['id'].unique(),
                                            pd.date_range('2023-01-01', '2023-04-01', freq='MS')],
                                          names=['id', 'month']
                                          ))
       .groupby(level=0).ffill().loc[lambda d: d.pop('flag').notna()].reset_index()
    )
    

    Output:

        id      month     value
    0    1 2023-01-01    London
    1    1 2023-02-01     Paris
    2    1 2023-03-01     Paris
    3    1 2023-04-01     Paris
    4    2 2023-01-01  New York
    5    2 2023-02-01  New York
    6    2 2023-03-01  New York
    7    2 2023-04-01  New York
    8    3 2023-02-01     Paris
    9    3 2023-03-01     Paris
    10   3 2023-04-01     Paris
    11   4 2023-03-01       NaN
    12   4 2023-04-01       NaN