pythonpandasnumpy

Getting strange output when using group by apply with np.select function


I am working with a Timeseries data wherein I am trying to perform outlier detection using IQR method.

Sample Data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'datecol' : pd.date_range('2024-1-1', '2024-12-31'),
                   'val' : np.random.random.randin(low = 100, high = 5000, size = 8366})

my function:

def is_outlier(x):
    iqr = x.quantile(.75) - x.quantile(.25)
    outlier = (x <= x.quantile(.25) - 1.5*iqr) | (x >= x.quantile(.75) + 1.5*iqr)
    return np.select([outlier], [1], 0)

df.groupby(df['datecol'].dt.weekday)['val'].apply(is_outlier)

to which the output is something like below:

0    [1,1,0,0,....
1    [1,0,0,0,....
2    [1,1,0,0,....
3    [1,0,1,0,....
4    [1,1,0,0,....
5    [1,1,0,0,....
6    [1,0,0,1,....

I am expecting a single series as output which I can add back to the original dataframe as a flag column.

Can someone please help me with this


Solution

  • You should use groupby.transform, not apply:

    df['flag'] = df.groupby(df['datecol'].dt.weekday)['val'].transform(is_outlier)
    

    Alternatively, explicitly return a Series and use group_keys=False:

    def is_outlier(x):
        iqr = x.quantile(.75) - x.quantile(.25)
        outlier = (x <= x.quantile(.25) - 1.5*iqr) | (x >= x.quantile(.75) + 1.5*iqr)
        return pd.Series(np.where(outlier, 1, 0), index=x.index)
    
    df['flag'] = (df.groupby(df['datecol'].dt.weekday, group_keys=False)
                  ['val'].apply(is_outlier)
                  )
    

    Note that with a single condition, np.where should be preferred to np.select.

    You could also use a vectorial approach with groupby.quantile:

    wd = df['datecol'].dt.weekday
    g = df.groupby(wd)['val']
    q25 = g.quantile(.25)
    q75 = g.quantile(.75)
    iqr = wd.map(q75-q25)
    df['flag'] = 1 - df['val'].between(wd.map(q25) - 1.5*iqr, wd.map(q75) + 1.5*iqr)
    

    Output:

           datecol   val  flag
    0   2024-01-01  3193     0
    1   2024-01-02  1044     0
    2   2024-01-03  2963     0
    3   2024-01-04  4448     0
    4   2024-01-05  1286     0
    ..         ...   ...   ...
    361 2024-12-27  1531     0
    362 2024-12-28  4565     0
    363 2024-12-29  3396     0
    364 2024-12-30  1870     0
    365 2024-12-31  3818     0