I have a sample dataset like below. Please note that ['ID', 'Date']
constitute the index.
ID | Date | ValueA | ValueB | ValueC |
---|---|---|---|---|
A | 01-01-2020 | NaN | NaN | NaN |
A | 01-02-2020 | 1 | 5 | 7 |
A | 01-03-2020 | NaN | 6 | 8 |
A | 01-04-2020 | 2 | NaN | NaN |
A | 01-05-2020 | 2 | NaN | 7 |
B | 01-01-2020 | 2 | NaN | 6 |
B | 01-02-2020 | 3 | 5 | 7 |
B | 01-03-2020 | NaN | NaN | NaN |
B | 01-04-2020 | 5 | NaN | 10 |
B | 01-05-2020 | 5 | 8 | NaN |
Below is the desired output
ID | Date | ValueA | ValueB | ValueC |
---|---|---|---|---|
A | 01-01-2020 | NaN | NaN | NaN |
A | 01-02-2020 | 1 | 5 | 7 |
A | 01-03-2020 | 1 | 6 | 8 |
A | 01-04-2020 | 2 | NaN | 8 |
A | 01-05-2020 | 2 | NaN | 7 |
B | 01-01-2020 | 2 | NaN | 6 |
B | 01-02-2020 | 3 | 5 | 7 |
B | 01-03-2020 | 3 | 5 | 7 |
B | 01-04-2020 | 5 | 5 | 10 |
B | 01-05-2020 | 5 | 8 | NaN |
For each ID, I need to fill NaN values in ValueA/ValueB/ValueC but the area is limited to inside. I've tried:
df_padded = df.interpolate(limit_area='inside')
df_fill = df.fillna(method="ffill")
mask = df_padded.isna()
df_fill[mask] = np.nan
Above code works when dealing with a single ID. But since there are multiple IDs present in the dataset the output becomes:
ID | Date | ValueA | ValueB | ValueC |
---|---|---|---|---|
A | 01-01-2020 | NaN | NaN | NaN |
A | 01-02-2020 | 1 | 5 | 7 |
A | 01-03-2020 | 1 | 6 | 8 |
A | 01-04-2020 | 2 | 6 | 8 |
A | 01-05-2020 | 2 | 6 | 7 |
B | 01-01-2020 | 2 | 6 | 6 |
B | 01-02-2020 | 3 | 5 | 7 |
B | 01-03-2020 | 3 | 5 | 7 |
B | 01-04-2020 | 5 | 5 | 10 |
B | 01-05-2020 | 5 | 8 | NaN |
I've also tried applying below function, but could not set it to multi-index to see if I can then apply the mask.
def fwdfill(df_monthlies):
ID_list = df_monthlies['ID'].unique().tolist()
for ID in ID_LIST:
ID_data = df_monthlies[df_monthlies['ID'] == ID]
ID_data = ID_data.sort_values(by=['Date'])
ID_data = ID_data.reset_index(drop=True)
Use df.reset_index
+ df.groupby
, and then:
Option 1
groupby.apply
+ df.interpolate
with "pad" on axis=1
("pad" is not possible with a MultiIndex, hence the need to reset the index first).
out = (
df.reset_index()
.groupby('ID', group_keys=False)
.apply(lambda x: x.interpolate('pad', limit_area='inside', axis=1))
.set_index(['ID', 'Date'])
)
Option 2
The basic operation being:
cols = ['ValueA', 'ValueB', 'ValueC']
(
df.reset_index()
.groupby('ID')[cols]
.transform(lambda x: x.interpolate('pad', limit_area='inside'))
)
You can assign that back in a one-liner:
out2 = (
df.reset_index()
.pipe(
lambda x: x.assign(
**x.groupby('ID')[cols].transform(
lambda x: x.interpolate('pad', limit_area='inside')
)
)
)
.set_index(['ID', 'Date'])
)
out2.equals(out)
# True
Output:
ValueA ValueB ValueC
ID Date
A 01-01-2020 NaN NaN NaN
01-02-2020 1.0 5.0 7.0
01-03-2020 1.0 6.0 8.0
01-04-2020 2.0 NaN 8.0
01-05-2020 2.0 NaN 7.0
B 01-01-2020 2.0 NaN 6.0
01-02-2020 3.0 5.0 7.0
01-03-2020 3.0 5.0 7.0
01-04-2020 5.0 5.0 10.0
01-05-2020 5.0 8.0 NaN
Data used
import pandas as pd
import numpy as np
data = {'ID': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B',
7: 'B', 8: 'B', 9: 'B'},
'Date': {0: '01-01-2020', 1: '01-02-2020', 2: '01-03-2020', 3: '01-04-2020',
4: '01-05-2020', 5: '01-01-2020', 6: '01-02-2020', 7: '01-03-2020',
8: '01-04-2020', 9: '01-05-2020'},
'ValueA': {0: np.nan, 1: 1.0, 2: np.nan, 3: 2.0, 4: 2.0, 5: 2.0, 6: 3.0,
7: np.nan, 8: 5.0, 9: 5.0},
'ValueB': {0: np.nan, 1: 5.0, 2: 6.0, 3: np.nan, 4: np.nan, 5: np.nan,
6: 5.0, 7: np.nan, 8: np.nan, 9: 8.0},
'ValueC': {0: np.nan, 1: 7.0, 2: 8.0, 3: np.nan, 4: 7.0, 5: 6.0, 6: 7.0,
7: np.nan, 8: 10.0, 9: np.nan}}
df = pd.DataFrame(data).set_index(['ID', 'Date'])