I would like to use pandas groupby to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, wanted to ask if there is a more obvious / pythonic approach.
Given input data with two groups, two variables X and Y:
n=10000
df= pd.DataFrame({'key': ['a']*n+['b']*n
,"x" : np.hstack((
np.random.normal(10, 1.0, size=n)
,np.random.normal(100, 1.0, size=n)
))
,"y" : np.hstack((
np.random.normal(20, 1.0, size=n)
,np.random.normal(200, 1.0, size=n)
))
})
To identify outliers I need to calculate the quartiles and inter-quartile range for each group to calculate the limits. Seemed reasonable to create a function:
def get_outlier(x,tukeymultiplier=2):
Q1=x.quantile(.25)
Q3=x.quantile(.75)
IQR=Q3-Q1
lowerlimit = Q1 - tukeymultiplier*IQR
upperlimit = Q3 + tukeymultiplier*IQR
return (x<lowerlimit) | (x>upperlimit)
And then use groupby and call the function via transform, e.g.:
g=df.groupby('key')[['x','y']]
df['x_outlierflag']=g.transform(get_outlier).x
df['y_outlierflag']=g.transform(get_outlier).y
df.loc[df.x_outlierflag==True]
df.loc[df.y_outlierflag==True]
I'm not worried about performance at this point, because the data are small. But not sure if there is a more natural way to do this? For example, it's not clear to me how apply() differs from transform(). Is there an apply() approach that would be better?
Here is the alternative more efficient and concise approach with groupby
and transform
c = ['x', 'y']
Q1 = df.groupby('key')[c].transform('quantile', 0.25)
Q3 = df.groupby('key')[c].transform('quantile', 0.75)
IQR = Q3 - Q1
lower = df[c].lt(Q1 - IQR * tukeymultiplier)
upper = df[c].gt(Q3 + IQR * tukeymultiplier)
df.join((lower | upper).add_suffix('_outlier_flag'))
key x y x_outlier_flag y_outlier_flag
0 a 37.179263 124.097101 False True
1 a -20.823970 -22.472672 False False
2 a 123.438609 -121.588755 True True
3 a -53.817197 31.569917 False False
4 a -83.406668 48.850183 True False
5 a 125.069615 -119.059600 True True
6 a -177.185957 -39.235412 True False
7 a -57.935018 -36.746727 False False
8 a 95.881448 -27.108608 True False
9 a -15.649324 86.914889 False True
....