I am trying to fill in gaps in data by grouping and then using the trend of the previous data points to predict what the missing values are.
df
Group Week Value
B 1 5
B 2 6
B 3 NaN
B 4 NaN
B 5 NaN
B 6 8
B 7 8
B 8 7
B 9 6
B 10 NaN
Which graphically looks like this: Initial df plot
Once the desired function has taken place the dataframe will look like the following:
Group Week Value
B 1 5
B 2 6
B 3 7
B 4 8
B 5 9
B 6 8
B 7 8
B 8 7
B 9 6
B 10 5.5
The trend of previous points to find these NaN values is shown graphically here: NaN values calculated
The first three NaN values in this example are found by simply plotting the values 5
and 6
, finding the linear equation (y = mx + c) and fitting x as the Week to calculate y. This same process would be carried on for all NaN values
I have tried interpolating (df = df.groupby('Group').apply(lambda group: group.interpolate(method='index'))
but this obviously looks at the next valid data point and includes it in the calculation, which I am trying to avoid
May be worth noting that the dataframe I am using has 200,000 rows and 4,000 groups!
You can create subgroups Series g
and pass method="spline"
and order=1
to interpolate
:
g = df['Value'].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
df['Value'] = (df.groupby(['Group', g])['Value']
.apply(lambda x: x.interpolate(method="spline", order=1)))
df
Out[1]:
Group Week Value
0 B 1 5.0
1 B 2 6.0
2 B 3 7.0
3 B 4 8.0
4 B 5 9.0
5 B 6 8.0
6 B 7 8.0
7 B 8 7.0
8 B 9 6.0
9 B 10 5.5
The intermediary step to get g
looks like this.
g = df['Value'].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
g
Out[1]:
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 3.0
6 3.0
7 3.0
8 3.0
9 3.0
The numbers basically just create subgroups. My method was one way to achieve this.
Per your comment, I create a mask
m that counts groups that are of size 1. Then, I combine the separate methods using fillna()
:
df = pd.DataFrame({'Group': {0: 'A',
1: 'B',
2: 'B',
3: 'B',
4: 'B',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B'},
'Week': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Value': {0: 5.0,
1: 6.0,
2: np.nan,
3: np.nan,
4: np.nan,
5: 8.0,
6: 8.0,
7: 7.0,
8: 6.0,
9: np.nan}})
g = df['Value'].iloc[1:].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
m = df.groupby(['Group', g])['Value'].transform('count') > 1
v1 = (df[m].groupby(['Group', g])['Value']
.apply(lambda x: x.interpolate(method="spline", order=1)))
v2 = (df.groupby(['Group', g])['Value']
.apply(lambda x: x.interpolate(method="index")))
df['Value'] = df['Value'].fillna(v1).fillna(v2)
df