pythonpandasdataframedatetimeffill

Fill in missing dates and add lists together from column by row and front fill in pandas dataframe


I'd like to fill in the following dataframe with dates between a range and front fill all the columns. As I am completing this, I would like to append the lists in the Wells column so it becomes continues to add items by date.

Dataframe to expand by date

   StartDate     Wells                     count Sum_Cumm   vol
0   1967-10-01  [MUN-523, MUN-354, MUN-2660] 50   50    8503.323620
1   1968-01-01  [MUN-152]                     1   51    8336.591784
2   1968-03-01  [MUN-1032]                    1   52    8176.272712
3   1968-10-01  [MUN-16128]                   1   53    9191.110200

Code that I'm working on

newdf = (newdf.set_index('StartDate').reindex(pd.date_range('10-01-1967', '12-31-1994', freq='MS')).rename_axis(['StartDate']).reset_index()).ffill(newdf['vol'])

Dataframe that I'd like to end up with

 StartDate   Wells                                        count Sum_Cumm    vol
0   1967-10-01  [MUN-523, MUN-354, MUN-2660]                     50   50    8503.323620
1   1967-11-01  [MUN-523, MUN-354, MUN-2660]                     1    51    8503.323620    
2   1967-12-01  [MUN-523, MUN-354, MUN-2660]                     1    51    8503.323620
3   1968-01-01  [MUN-523, MUN-354, MUN-2660,MUN-152]             1    52    8336.591784
4   1968-02-01  [MUN-523, MUN-354, MUN-2660,MUN-152]             1    53    8336.591784
5   1968-03-01  [MUN-523, MUN-354, MUN-2660,MUN-152,MUN-1032]    1    53    8176.272712
6   1968-04-01  [MUN-523, MUN-354, MUN-2660,MUN-152,MUN-1032]    1    53    8176.272712

Solution

  • You can use period_range to create a new index and reindex the existing df to create a new dataframe and then ffill.

    For Wells column do a cumsum and then apply np.unique

    new_idx = pd.to_datetime(
        pd.period_range(df["StartDate"].min(), df["StartDate"].max(), freq="M")
        .asfreq("D", how="S")
        .strftime("%Y-%m-%d")
    )
    df2 = df.set_index("StartDate").reindex(new_idx)
    df2 = df2.ffill(downcast="infer")
    df2["Wells"] = df2["Wells"].cumsum().apply(np.unique)
    df2 = df2.rename_axis("StartDate").reset_index()
    
    print(df2)
    
        StartDate                                              Wells  count  \
    0  1967-10-01                       [MUN-2660, MUN-354, MUN-523]     50   
    1  1967-11-01                       [MUN-2660, MUN-354, MUN-523]     50   
    2  1967-12-01                       [MUN-2660, MUN-354, MUN-523]     50   
    3  1968-01-01              [MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    4  1968-02-01              [MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    5  1968-03-01    [MUN-1032, MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    6  1968-04-01    [MUN-1032, MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    7  1968-05-01    [MUN-1032, MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    8  1968-06-01    [MUN-1032, MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    9  1968-07-01    [MUN-1032, MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    10 1968-08-01    [MUN-1032, MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    11 1968-09-01    [MUN-1032, MUN-152, MUN-2660, MUN-354, MUN-523]      1   
    12 1968-10-01  [MUN-1032, MUN-152, MUN-16128, MUN-2660, MUN-3...      1   
    
        Sum_Cumm          vol  
    0         50  8503.323620  
    1         50  8503.323620  
    2         50  8503.323620  
    3         51  8336.591784  
    4         51  8336.591784  
    5         52  8176.272712  
    6         52  8176.272712  
    7         52  8176.272712  
    8         52  8176.272712  
    9         52  8176.272712  
    10        52  8176.272712  
    11        52  8176.272712  
    12        53  9191.110200