I want to impute missing values using the grouped summary statistics based on a different dataframe. For instance, I would like to impute missing values in numvar_original
in df1
to be like numvar_ideal
where the missing values are based on group-level means from df2
:
df1
catvar numvar_original numvar_ideal
1 10 10
1 NaN 5.5
2 30 30
2 NaN 6.5
df2
catvar numvar_original
1 5
1 6
2 6
2 7
# I tried the following:
df1['numvar'].fillna(df2.groupby('catvar')['numvar'].transform('mean'), inplace=True)
# The missing values weren't replaced
df1['numvar'].fillna(df1.groupby('catvar')['numvar'].transform('mean'), inplace=True)
# I checked that this works in filling up the missing values but I have to use group-level mean values from df1
Is there a way to do this without resorting to a combination of apply
or map
with a dictionary (since I've read that apply/map can be slower with larger datasets)?
Try:
means = df2.groupby("catvar")["numvar_original"].mean().to_dict()
df1 = df1.groupby("catvar", group_keys=False).apply(
lambda x: x.fillna(means[x["catvar"].iloc[0]])
)
print(df1)
Prints:
catvar numvar_original numvar_ideal
0 1 10.0 10.0
1 1 5.5 5.5
2 2 30.0 30.0
3 2 6.5 6.5
OR: Using .merge
:
means = df2.groupby("catvar")["numvar_original"].mean()
df1["numvar_original"] = df1["numvar_original"].fillna(
df1.merge(means, on="catvar")["numvar_original_y"]
)
print(df1)
Prints:
catvar numvar_original numvar_ideal
0 1 10.0 10.0
1 1 5.5 5.5
2 2 30.0 30.0
3 2 6.5 6.5