python-3.xpandasdataframeiqr

How to Remove outlier from DataFrame using IQR?


I Have Dataframe with a lot of columns (Around 100 feature), I want to apply the interquartile method and wanted to remove the outlier from the data frame.

I am using this link stackOverflow

But the problem is nan of the above method is working correctly,

As I am trying like this

Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()

it is giving me this

((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
Out[35]: 
Day                      0
Col1                     0
Col2                     0
col3                     0
Col4                     0
Step_Count            1179
dtype: int64

I just wanted to know that, What I will do next so that all the outlier from the data frame will be removed.

if i am using this

def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low  = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out

re_dat = remove_outlier(stepframe, stepframe.columns)

I am getting this error

ValueError: Cannot index with multidimensional key

in this line

    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]

Solution

  • You can use:

    np.random.seed(33454)
    stepframe = pd.DataFrame({'a': np.random.randint(1, 200, 20), 
                              'b': np.random.randint(1, 200, 20),
                              'c': np.random.randint(1, 200, 20)})
    
    stepframe[stepframe > 150] *= 10
    print (stepframe)
    
    Q1 = stepframe.quantile(0.25)
    Q3 = stepframe.quantile(0.75)
    IQR = Q3 - Q1
    
    df = stepframe[~((stepframe < (Q1 - 1.5 * IQR)) |(stepframe > (Q3 + 1.5 * IQR))).any(axis=1)]
    
    print (df)
          a    b     c
    1   109   50   124
    3   137   60  1990
    4    19  138   100
    5    86   83   143
    6    55   23    58
    7    78  145    18
    8   132   39    65
    9    37  146  1970
    13   67  148  1880
    15  124  102    21
    16   93   61    56
    17   84   21    25
    19   34   52   126
    

    Details:

    First create boolean DataFrame with chain by |:

    print (((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))))
            a      b      c
    0   False   True  False
    1   False  False  False
    2    True  False  False
    3   False  False  False
    4   False  False  False
    5   False  False  False
    6   False  False  False
    7   False  False  False
    8   False  False  False
    9   False  False  False
    10   True  False  False
    11  False   True  False
    12  False   True  False
    13  False  False  False
    14  False   True  False
    15  False  False  False
    16  False  False  False
    17  False  False  False
    18  False   True  False
    19  False  False  False
    

    And then use DataFrame.any for check at least one True per row and last invert boolean mask by ~:

    print (~((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).any(axis=1))
    0     False
    1      True
    2     False
    3      True
    4      True
    5      True
    6      True
    7      True
    8      True
    9      True
    10    False
    11    False
    12    False
    13     True
    14    False
    15     True
    16     True
    17     True
    18    False
    19     True
    dtype: bool
    

    invert solution with changed conditions - < to >= and > to <=, chain by & for AND and last filter by all for check all Trues per rows

    print (((stepframe >= (Q1 - 1.5 * IQR)) & (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1))
    0     False
    1      True
    2     False
    3      True
    4      True
    5      True
    6      True
    7      True
    8      True
    9      True
    10    False
    11    False
    12    False
    13     True
    14    False
    15     True
    16     True
    17     True
    18    False
    19     True
    dtype: bool
    
    
    df = stepframe[((stepframe >= (Q1 - 1.5 * IQR))& (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1)]