I have the following data
col=['black','black','black','grey','white','grey','grey','nan','grey','black','black','red','nan','nan','nan','nan','black','black','white']
dd=pd.DataFrame({'color':col})
dd.replace('nan',np.NaN,inplace=True)
dd.sample(5)
Out[1]:
color
8 grey
14 NaN
7 NaN
2 black
9 black
The following is the proportion of each color in the column
dd.color.value_counts(normalize=True)
Out[2]:
black 0.500000
grey 0.285714
white 0.142857
red 0.071429
Now I want to fill the null values based on these proportions above. So 50% of null values will become black, 28% grey,14% white and 7.1% red
You can randomly assign the value based on the probability using numpy.random.choice
(https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)
STEP #1. Calculate the probabilities for each values.
STEP #2. Assign the values to NaN
based on the probability so that the values are assigned as the original distribution.
This way will work regardless of the number of values.
import pandas as pd
import numpy as np
col = ['black','black','black','grey','white','grey','grey','nan','grey','black','black','red','nan','nan','nan','nan','black','black','white']
dd = pd.DataFrame({'color': col})
dd.replace('nan', np.NaN, inplace=True)
print(dd.color.value_counts(normalize=True))
"""
black 0.500000
grey 0.285714
white 0.142857
red 0.071429
Name: color, dtype: float64
"""
probability = list(dd.color.value_counts(normalize=True).values)
dd['color'] = dd['color'].apply(lambda x: np.random.choice(['black', 'grey', 'white', 'red'], 1, replace=False, p=probability)[0] if pd.isna(x) else x)
print(dd.color.value_counts(normalize=True))
"""
black 0.526316
grey 0.315789
white 0.105263
red 0.052632
Name: color, dtype: float64
"""