pythonpandasmedian

Python function to calculate a median without mean in a dataframe


I have a big Dataframe (about 3GB) and I want to calcul a sort of median on a group by on few columns but i don't want to take the mean of the two central elements when i have an even number of values but get the lowest of this two values. I know how to do a normal median, here is an example to reproduce my issue :

import pandas as pd 
data = {'idx':  [1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,5],
        'value': [5,12,7,8,10,3,8,4,6,1,19,5,10,12,3,8,14]
        }

df = pd.DataFrame (data, columns = ['idx','value'])
df['median']=df.groupby(['idx'])['value'].transform(np.median)
print(df)

    idx  value  median
0     1    5.0     8.0
1     1   12.0     8.0
2     1    7.0     8.0
3     1    8.0     8.0
4     1   10.0     8.0
5     2    3.0     5.0
6     2    8.0     5.0
7     2    4.0     5.0
8     2    6.0     5.0
9     2    1.0     5.0
10    2   19.0     5.0
11    3    5.0    10.0
12    3   10.0    10.0
13    3   12.0    10.0
14    4    3.0     5.5
15    4    8.0     5.5
16    5   14.0    14.0

But as i said i do not want to have this result.

I want :

I can do this whith the function below but the performance is very low :

def calcul_median(x):
    a=x['values'].values.tolist()
    if len(a)%2==1:
        a = np.median(a)
    elif len(a)==0:
        a=0
    else:
        a.sort()
        a =a[int((len(a)/2)-1)]
    x['median'] =a
    return x

df2=df.groupby(['idx']).apply(calcul_median)

This function works but it is very slow (50 times slower than median).

EDIT

The function statistics.median_low do that but that is also slow. 3s with numpy vs 52s with statistics.

I have try another function with argpartition

def calcul_tps_medianv2(x):
    a=x['dureetrajet'].values.tolist()
    if len(a)%2==1:
        a = np.median(a)
    elif len(a)==0:
        a=0
    else:
        a[np.argpartition(a,int((len(a)/2)-1))[int((len(a)/2)-1)]]
    x['median'] =a
    return x

But that is slower than with the statistics solution.

Have you any idea to speed up this function or any other idea ? Thanks for your help.


Solution

  • The standard library contains a median_low() function that does just that.

    Tim Pietzcker