I have a Dataframe in Python whith NaN, as this:
import pandas as pd
details = {
'info1' : [10,None,None,None,None,None,15,None,None,None,5],
'info2' : [15,None,None,None,10,None,None,None,None,None,20],
}
df = pd.DataFrame(details)
print(df)
info1 | info2 | |
---|---|---|
0 | 10 | 15 |
1 | nan | nan |
2 | nan | nan |
3 | nan | nan |
4 | nan | 10 |
5 | nan | nan |
6 | 15 | nan |
7 | nan | nan |
8 | nan | nan |
9 | nan | nan |
10 | 5 | 20 |
How to replace NaNs with the random number (e.g., uniform) in a specific range (based on rows that have values), as this:
For a vectorial solution, directly call np.random.uniform
with the ffill
/bfill
as boundaries:
import numpy as np
df[:] = np.random.uniform(df.ffill(), df.bfill())
Output (with np.random.seed(0)
):
info1 info2
0 10.000000 15.000000
1 13.013817 12.275584
2 12.118274 11.770529
3 12.187936 10.541135
4 14.818314 10.000000
5 13.958625 15.288949
6 15.000000 19.255966
7 14.289639 10.871293
8 14.797816 18.326198
9 7.218432 18.700121
10 5.000000 20.000000
As pointed out by @wjandrea, the behavior of uniform
is not officially supported when low > high
*. If you want a robust solution, use my original approach with an intermediate array and sort
:
import numpy as np
# ensure the low boundary is before the high
tmp = np.sort(np.dstack([df.ffill(),
df.bfill()]),
axis=2)
# generate random numbers between low and high
df[:] = np.random.uniform(tmp[..., 0], tmp[..., 1])
(*)
If
high < low
, the results are officially undefined and may eventually raise an error, i.e. do not rely on this function to behave when passed arguments satisfying that inequality condition.