pythonpandasdataframeinterpolationextrapolation

Fill in NaN in Pandas Dataframe using trend of previous valid values


I am trying to fill in gaps in data by grouping and then using the trend of the previous data points to predict what the missing values are.

df

Group  Week  Value
B      1     5
B      2     6
B      3     NaN
B      4     NaN
B      5     NaN
B      6     8
B      7     8
B      8     7
B      9     6
B      10    NaN

Which graphically looks like this: Initial df plot

Once the desired function has taken place the dataframe will look like the following:

Group  Week  Value
B      1     5
B      2     6
B      3     7
B      4     8
B      5     9
B      6     8
B      7     8
B      8     7
B      9     6
B      10    5.5

The trend of previous points to find these NaN values is shown graphically here: NaN values calculated

The first three NaN values in this example are found by simply plotting the values 5 and 6, finding the linear equation (y = mx + c) and fitting x as the Week to calculate y. This same process would be carried on for all NaN values

I have tried interpolating (df = df.groupby('Group').apply(lambda group: group.interpolate(method='index')) but this obviously looks at the next valid data point and includes it in the calculation, which I am trying to avoid

May be worth noting that the dataframe I am using has 200,000 rows and 4,000 groups!


Solution

  • You can create subgroups Series g and pass method="spline" and order=1 to interpolate:

    g = df['Value'].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
    df['Value'] = (df.groupby(['Group', g])['Value']
                     .apply(lambda x: x.interpolate(method="spline", order=1)))
    df
    Out[1]: 
      Group  Week  Value
    0     B     1    5.0
    1     B     2    6.0
    2     B     3    7.0
    3     B     4    8.0
    4     B     5    9.0
    5     B     6    8.0
    6     B     7    8.0
    7     B     8    7.0
    8     B     9    6.0
    9     B    10    5.5
    

    The intermediary step to get g looks like this.

    g = df['Value'].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
    g
    Out[1]: 
    0    0.0
    1    0.0
    2    0.0
    3    0.0
    4    0.0
    5    3.0
    6    3.0
    7    3.0
    8    3.0
    9    3.0
    

    The numbers basically just create subgroups. My method was one way to achieve this.


    Per your comment, I create a mask m that counts groups that are of size 1. Then, I combine the separate methods using fillna():

    df = pd.DataFrame({'Group': {0: 'A',
      1: 'B',
      2: 'B',
      3: 'B',
      4: 'B',
      5: 'B',
      6: 'B',
      7: 'B',
      8: 'B',
      9: 'B'},
     'Week': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
     'Value': {0: 5.0,
      1: 6.0,
      2: np.nan,
      3: np.nan,
      4: np.nan,
      5: 8.0,
      6: 8.0,
      7: 7.0,
      8: 6.0,
      9: np.nan}})
    g = df['Value'].iloc[1:].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
    m = df.groupby(['Group', g])['Value'].transform('count') > 1
    v1 = (df[m].groupby(['Group', g])['Value']
                         .apply(lambda x: x.interpolate(method="spline", order=1)))
    v2 = (df.groupby(['Group', g])['Value']
                         .apply(lambda x: x.interpolate(method="index")))
    df['Value'] = df['Value'].fillna(v1).fillna(v2)
    df