I have a big Dataframe (about 3GB) and I want to calcul a sort of median on a group by on few columns but i don't want to take the mean of the two central elements when i have an even number of values but get the lowest of this two values. I know how to do a normal median, here is an example to reproduce my issue :
import pandas as pd
data = {'idx': [1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,5],
'value': [5,12,7,8,10,3,8,4,6,1,19,5,10,12,3,8,14]
}
df = pd.DataFrame (data, columns = ['idx','value'])
df['median']=df.groupby(['idx'])['value'].transform(np.median)
print(df)
idx value median
0 1 5.0 8.0
1 1 12.0 8.0
2 1 7.0 8.0
3 1 8.0 8.0
4 1 10.0 8.0
5 2 3.0 5.0
6 2 8.0 5.0
7 2 4.0 5.0
8 2 6.0 5.0
9 2 1.0 5.0
10 2 19.0 5.0
11 3 5.0 10.0
12 3 10.0 10.0
13 3 12.0 10.0
14 4 3.0 5.5
15 4 8.0 5.5
16 5 14.0 14.0
But as i said i do not want to have this result.
I want :
I can do this whith the function below but the performance is very low :
def calcul_median(x):
a=x['values'].values.tolist()
if len(a)%2==1:
a = np.median(a)
elif len(a)==0:
a=0
else:
a.sort()
a =a[int((len(a)/2)-1)]
x['median'] =a
return x
df2=df.groupby(['idx']).apply(calcul_median)
This function works but it is very slow (50 times slower than median).
EDIT
The function statistics.median_low do that but that is also slow. 3s with numpy vs 52s with statistics.
I have try another function with argpartition
def calcul_tps_medianv2(x):
a=x['dureetrajet'].values.tolist()
if len(a)%2==1:
a = np.median(a)
elif len(a)==0:
a=0
else:
a[np.argpartition(a,int((len(a)/2)-1))[int((len(a)/2)-1)]]
x['median'] =a
return x
But that is slower than with the statistics solution.
Have you any idea to speed up this function or any other idea ? Thanks for your help.
The standard library contains a median_low() function that does just that.
Tim Pietzcker