pythonpandasdataframeoutliers

Removing outliers using IQR don't work, dataframe shape doesn't change


I have a dataset where I have weights and heights and I want to remove all outliers in two columns using the same code, like this (my variables are float640:

for c in df.columns[0:2]:
    Q1= np.percentile(df[c],25)
    Q3= np.percentile(df[c],75)
    IQR= Q3-Q1
    upper= Q3+ 1.5*IQR
    lower= Q1-1.5*IQR    
    upper_array=np.where(df[c]>=upper)[0]
    lower_array=np.where(df[c]<=lower)[0]
    df[c].drop(index=upper_array,inplace=True)
    df[c].drop(index=lower_array, inplace=True)

The initial data shape was (3000,3) but after running the code, the shape does not change. However in fact, outlier exists in my data enter image description here enter image description here

I tried to modify the drop method (suggesthed by an AI):

    outlier_indices= np.concatenate((upper_array,lower_array))
    df[c].drop(index=outlier_indices, inplace=True)

and

    df= df[~upper_array]
    df= df[~lower_array]

But to no avail.

Is there anything wrong with my code?


Solution

  • The issue is that you're removing the indices from a copy of the column. You should remove them from the full dataframe:

    df.drop(index=upper_array,inplace=True)
    

    Be aware that this would also remove rows in which only one of the two columns has an outlier.

    Also, since this process is iterative, you don't consider the original population for the second column, but only a subset. I've already shown in this answer that outliers removal is sensitive to iteration.

    A better approach in my opinion should be to use a function to flag the outliers:

    def iqr_outliers_flag(s):
        q1, q3 = np.quantile(s, [0.25, 0.75])
        iqr = q3 - q1
        upper_bound = q1 + 1.5 * iqr
        lower_bound = q3 - 1.5 * iqr
        
        return (s<lower_bound) | (s>upper_bound)
    

    Then you can remove rows in which:

    all columns have an outlier:

    df[~df.apply(iqr_outliers_flag).all(axis=1)].shape
    # final shape (2909, 2)
    

    any column has an outlier:

    df[~df.apply(iqr_outliers_flag).any(axis=1)].shape
    # final shape (2011, 2)
    

    Or leave the shape unchanged and mask the outliers with NaNs:

    df2 = df.mask(df.apply(iqr_outliers_flag))
    
                 0         1
    0          NaN  0.400157
    1     0.978738       NaN
    2          NaN -0.977278
    3     0.950088 -0.151357
    4    -0.103219  0.410599
    ...        ...       ...
    2995  0.682201 -0.011214
    2996       NaN -0.831492
    2997 -0.407972       NaN
    2998  0.352599 -0.537885
    2999  0.393444  0.286518
    
    [3000 rows x 2 columns]
    

    Note that even if you mask/remove outliers, your boxplot might show values outside of the 1.5 IQR range, which is normal because the filtered population now has different statistics and thus potentially new outliers relative to the new IQR value, as explained here.

    Reproducible input:

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    
    df = pd.DataFrame(np.random.normal(size=(3000, 2)))
    

    Original data:

    boxplot without filtering

    Filtering rows with all outliers:

    boxplot filtering all

    Filtering rows with any outlier:

    boxplot filtering any