I have a column with missing categorical data and I am trying to replace them by existing categorical variables from the same column.
I do not want to use the mode because I have too many missing data, it will skew the data and I would rather not drop the rows with missing data.
I think the ideal way would be to get the proportion of each variables for my column and then replace the missing proportionally by the existing categorical variables.
Example dataframe:
ClientId Apple_cat Region Price
0 21 cat_1 Reg_A 5
1 15 cat_2 Nan 6
2 6 Nan Reg_B 7
3 91 cat_3 Reg_A 3
4 45 Nan Reg_C 7
5 89 cat_2 Nan 6
Note: Ideally, I'd like to avoid hardcoding each category and region name.
You can roll your own function for a neat and vectorized method to solving this:
def na_randomfill(series):
na_mask = pd.isnull(series) # boolean mask for null values
n_null = na_mask.sum() # number of nulls in the Series
if n_null == 0:
return series # if there are no nulls, no need to resample
# Randomly sample the non-null values from our series
# only sample this Series as many times as we have nulls
fill_values = series[~na_mask].sample(n=n_null, replace=True, random_state=0)
# This ensures our new values will replace NaNs in the correct locations
fill_values.index = series.index[na_mask]
return series.fillna(fill_values)
This solution works on 1 Series at time and can be called like so:
out = na_randomfill(df["Apple_cat"])
print(out)
0 cat_1
1 cat_2
2 cat_3
3 cat_3
4 cat_2
5 cat_2
Name: Apple_cat, dtype: object
Alternatively you can use apply to call it on each of your columns. Note that because of the if
statement in our function, we do not need to specify null-containing columns in advance before calling apply
:
out = df.apply(na_randomfill)
print(out)
ClientId Apple_cat Region Price
0 21 cat_1 Reg_A 5
1 15 cat_2 Reg_A 6
2 6 cat_3 Reg_B 7
3 91 cat_3 Reg_A 3
4 45 cat_2 Reg_C 7
5 89 cat_2 Reg_C 6