pythonpandasdataframenan

Dataframe Replace NaN whith random in range


I have a Dataframe in Python whith NaN, as this:

import pandas as pd
details = {
    'info1' : [10,None,None,None,None,None,15,None,None,None,5],
    'info2' : [15,None,None,None,10,None,None,None,None,None,20],
}
df = pd.DataFrame(details)
print(df)
info1 info2
0 10 15
1 nan nan
2 nan nan
3 nan nan
4 nan 10
5 nan nan
6 15 nan
7 nan nan
8 nan nan
9 nan nan
10 5 20

How to replace NaNs with the random number (e.g., uniform) in a specific range (based on rows that have values), as this:

enter image description here


Solution

  • For a vectorial solution, directly call np.random.uniform with the ffill/bfill as boundaries:

    import numpy as np
    
    df[:] = np.random.uniform(df.ffill(), df.bfill())
    

    Output (with np.random.seed(0)):

            info1      info2
    0   10.000000  15.000000
    1   13.013817  12.275584
    2   12.118274  11.770529
    3   12.187936  10.541135
    4   14.818314  10.000000
    5   13.958625  15.288949
    6   15.000000  19.255966
    7   14.289639  10.871293
    8   14.797816  18.326198
    9    7.218432  18.700121
    10   5.000000  20.000000
    
    Note

    As pointed out by @wjandrea, the behavior of uniform is not officially supported when low > high*. If you want a robust solution, use my original approach with an intermediate array and sort:

    import numpy as np
    
    # ensure the low boundary is before the high
    tmp = np.sort(np.dstack([df.ffill(),
                             df.bfill()]),
                  axis=2)
    
    # generate random numbers between low and high
    df[:] = np.random.uniform(tmp[..., 0], tmp[..., 1])
    

    (*)

    If high < low, the results are officially undefined and may eventually raise an error, i.e. do not rely on this function to behave when passed arguments satisfying that inequality condition.