I am working with a Timeseries data wherein I am trying to perform outlier detection using IQR method.
Sample Data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'datecol' : pd.date_range('2024-1-1', '2024-12-31'),
'val' : np.random.random.randin(low = 100, high = 5000, size = 8366})
my function:
def is_outlier(x):
iqr = x.quantile(.75) - x.quantile(.25)
outlier = (x <= x.quantile(.25) - 1.5*iqr) | (x >= x.quantile(.75) + 1.5*iqr)
return np.select([outlier], [1], 0)
df.groupby(df['datecol'].dt.weekday)['val'].apply(is_outlier)
to which the output is something like below:
0 [1,1,0,0,....
1 [1,0,0,0,....
2 [1,1,0,0,....
3 [1,0,1,0,....
4 [1,1,0,0,....
5 [1,1,0,0,....
6 [1,0,0,1,....
I am expecting a single series as output which I can add back to the original dataframe
as a flag column.
Can someone please help me with this
You should use groupby.transform
, not apply
:
df['flag'] = df.groupby(df['datecol'].dt.weekday)['val'].transform(is_outlier)
Alternatively, explicitly return a Series and use group_keys=False
:
def is_outlier(x):
iqr = x.quantile(.75) - x.quantile(.25)
outlier = (x <= x.quantile(.25) - 1.5*iqr) | (x >= x.quantile(.75) + 1.5*iqr)
return pd.Series(np.where(outlier, 1, 0), index=x.index)
df['flag'] = (df.groupby(df['datecol'].dt.weekday, group_keys=False)
['val'].apply(is_outlier)
)
Note that with a single condition, np.where
should be preferred to np.select
.
You could also use a vectorial approach with groupby.quantile
:
wd = df['datecol'].dt.weekday
g = df.groupby(wd)['val']
q25 = g.quantile(.25)
q75 = g.quantile(.75)
iqr = wd.map(q75-q25)
df['flag'] = 1 - df['val'].between(wd.map(q25) - 1.5*iqr, wd.map(q75) + 1.5*iqr)
Output:
datecol val flag
0 2024-01-01 3193 0
1 2024-01-02 1044 0
2 2024-01-03 2963 0
3 2024-01-04 4448 0
4 2024-01-05 1286 0
.. ... ... ...
361 2024-12-27 1531 0
362 2024-12-28 4565 0
363 2024-12-29 3396 0
364 2024-12-30 1870 0
365 2024-12-31 3818 0