pythonpandasgroup-byquantile

How to flag tukey outliers using python pandas groupby


I would like to use pandas groupby to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, wanted to ask if there is a more obvious / pythonic approach.

Given input data with two groups, two variables X and Y:

n=10000
df= pd.DataFrame({'key': ['a']*n+['b']*n
                 ,"x"  : np.hstack((
                  np.random.normal(10, 1.0, size=n)
                 ,np.random.normal(100, 1.0, size=n)
                  ))
                 ,"y"  : np.hstack((
                  np.random.normal(20, 1.0, size=n)
                 ,np.random.normal(200, 1.0, size=n)
                  ))              
                 })

To identify outliers I need to calculate the quartiles and inter-quartile range for each group to calculate the limits. Seemed reasonable to create a function:

def get_outlier(x,tukeymultiplier=2):
    Q1=x.quantile(.25)
    Q3=x.quantile(.75)
    IQR=Q3-Q1
    lowerlimit = Q1 - tukeymultiplier*IQR
    upperlimit = Q3 + tukeymultiplier*IQR
    return (x<lowerlimit) | (x>upperlimit)

And then use groupby and call the function via transform, e.g.:

g=df.groupby('key')[['x','y']]
df['x_outlierflag']=g.transform(get_outlier).x
df['y_outlierflag']=g.transform(get_outlier).y

df.loc[df.x_outlierflag==True]
df.loc[df.y_outlierflag==True]

I'm not worried about performance at this point, because the data are small. But not sure if there is a more natural way to do this? For example, it's not clear to me how apply() differs from transform(). Is there an apply() approach that would be better?


Solution

  • Here is the alternative more efficient and concise approach with groupby and transform

    c = ['x', 'y']
    Q1 = df.groupby('key')[c].transform('quantile', 0.25)
    Q3 = df.groupby('key')[c].transform('quantile', 0.75)
    IQR = Q3 - Q1
        
    lower = df[c].lt(Q1 - IQR * tukeymultiplier)
    upper = df[c].gt(Q3 + IQR * tukeymultiplier)
    
    df.join((lower | upper).add_suffix('_outlier_flag'))
    

      key           x           y  x_outlier_flag  y_outlier_flag
    0   a   37.179263  124.097101           False            True
    1   a  -20.823970  -22.472672           False           False
    2   a  123.438609 -121.588755            True            True
    3   a  -53.817197   31.569917           False           False
    4   a  -83.406668   48.850183            True           False
    5   a  125.069615 -119.059600            True            True
    6   a -177.185957  -39.235412            True           False
    7   a  -57.935018  -36.746727           False           False
    8   a   95.881448  -27.108608            True           False
    9   a  -15.649324   86.914889           False            True
    ....