It is necessary to replace the empty Nan values (the 'area' column) with the median value from other areas (equal to or less than the 'total' value). For example: Line 2 has the value 'total'==8. Select the table with the values 'total'==8, look for the median for 'area'.median() and write the value (if any). If there is no value, then reduce the 'total' by 1 and search further. And in line 6 there is the value 'total'==59. Therefore, let's take the median of 'total'==56 and the value of 'area'=34.
Here's what the data should look like: result
import pandas as pd
import numpy as np
df = pd.DataFrame({'total': [5, 8, 8, 8, 20, 56, 59], \
'area': [40, 51, 53, np.nan, np.nan, 34, np.nan]})
df
# total area
0 5 40.0
1 8 51.0
2 8 53.0
3 8 NaN
4 20 NaN
5 56 34.0
6 59 NaN
result = pd.DataFrame({'total': [5, 8, 8, 8, 20, 56, 59], 'area': [40, 51, 53, 52, 52, 34, 34]})
result
# total area
0 5 40
1 8 51
2 8 53
3 8 52
4 20 52
5 56 34
6 59 34
I have created a function, but it does not produce the desired result:
def find_area(total_num, x=1):
while x > 0:
y = df.query('total == @total_num')['area'].sum()
if y > 0:
return df.query('total == @total_num')['area'].median()
x=0
break
else:
total_num -= 1
df['area'] = df['area'].fillna(find_area)
df
First compute the median per group (with groupby.median
), get rid of the NaNs with dropna
, and perform a merge_asof
to map the missing values. Use the output to fill the NaN by boolean indexing:
tmp = df.groupby('total')['area'].median().dropna()
df.loc[df['area'].isna(), 'area'] = pd.merge_asof(
df['total'].reset_index(), tmp, on='total'
).set_index('index')['area']
Output:
total area
0 5 40.0
1 8 51.0
2 8 53.0
3 8 52.0
4 20 52.0
5 56 34.0
6 59 34.0