pythonpandasfillnaproportions

Filling null values based on the proportion of the categories in that column


I have the following data

col=['black','black','black','grey','white','grey','grey','nan','grey','black','black','red','nan','nan','nan','nan','black','black','white']
dd=pd.DataFrame({'color':col})
dd.replace('nan',np.NaN,inplace=True)
dd.sample(5)
Out[1]: 
    color
8    grey
14    NaN
7     NaN
2   black
9   black

The following is the proportion of each color in the column

dd.color.value_counts(normalize=True)
Out[2]: 
black    0.500000
grey     0.285714
white    0.142857
red      0.071429

Now I want to fill the null values based on these proportions above. So 50% of null values will become black, 28% grey,14% white and 7.1% red


Solution

  • You can randomly assign the value based on the probability using numpy.random.choice (https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)

    STEP #1. Calculate the probabilities for each values.

    STEP #2. Assign the values to NaN based on the probability so that the values are assigned as the original distribution.

    This way will work regardless of the number of values.

    import pandas as pd
    import numpy as np
    
    col = ['black','black','black','grey','white','grey','grey','nan','grey','black','black','red','nan','nan','nan','nan','black','black','white']
    dd = pd.DataFrame({'color': col})
    dd.replace('nan', np.NaN, inplace=True)
    print(dd.color.value_counts(normalize=True))
    """
    black    0.500000
    grey     0.285714
    white    0.142857
    red      0.071429
    Name: color, dtype: float64
    """
    
    probability = list(dd.color.value_counts(normalize=True).values)
    dd['color'] = dd['color'].apply(lambda x: np.random.choice(['black', 'grey', 'white', 'red'], 1, replace=False, p=probability)[0] if pd.isna(x) else x)
    
    print(dd.color.value_counts(normalize=True))
    """
    black    0.526316
    grey     0.315789
    white    0.105263
    red      0.052632
    Name: color, dtype: float64
    """