pythonpandasdata-cleaningdata-presentation

How to generate random categorical data from existing ones to fill missing values - Python


I have a column with missing categorical data and I am trying to replace them by existing categorical variables from the same column.

I do not want to use the mode because I have too many missing data, it will skew the data and I would rather not drop the rows with missing data.

I think the ideal way would be to get the proportion of each variables for my column and then replace the missing proportionally by the existing categorical variables.

Example dataframe:

   ClientId    Apple_cat    Region    Price
0  21          cat_1        Reg_A     5
1  15          cat_2        Nan       6
2  6           Nan          Reg_B     7
3  91          cat_3        Reg_A     3
4  45          Nan          Reg_C     7
5  89          cat_2        Nan       6

Note: Ideally, I'd like to avoid hardcoding each category and region name.


Solution

  • You can roll your own function for a neat and vectorized method to solving this:

    def na_randomfill(series):
        na_mask = pd.isnull(series)   # boolean mask for null values
        n_null = na_mask.sum()        # number of nulls in the Series
        
        if n_null == 0:
            return series             # if there are no nulls, no need to resample
        
        # Randomly sample the non-null values from our series
        #  only sample this Series as many times as we have nulls 
        fill_values = series[~na_mask].sample(n=n_null, replace=True, random_state=0)
    
        # This ensures our new values will replace NaNs in the correct locations
        fill_values.index = series.index[na_mask]
        
        return series.fillna(fill_values) 
    

    This solution works on 1 Series at time and can be called like so:

    out = na_randomfill(df["Apple_cat"])
    
    print(out)
    0    cat_1
    1    cat_2
    2    cat_3
    3    cat_3
    4    cat_2
    5    cat_2
    Name: Apple_cat, dtype: object
    

    Alternatively you can use apply to call it on each of your columns. Note that because of the if statement in our function, we do not need to specify null-containing columns in advance before calling apply:

    out = df.apply(na_randomfill)
    
    print(out)
       ClientId Apple_cat Region  Price
    0        21     cat_1  Reg_A      5
    1        15     cat_2  Reg_A      6
    2         6     cat_3  Reg_B      7
    3        91     cat_3  Reg_A      3
    4        45     cat_2  Reg_C      7
    5        89     cat_2  Reg_C      6