pythondataframeoutliers

How to remove y-outliers from x-y Scatter plot in Python?


I am plotting a dataframe, df, containing x and y in a scatter plot. Clearly, in many cases, for each x value, y-values may be scattered. I want to remove y outliers for each x. This is different from bulk outlier removal using IQRs.

Can anyone assist with the same?

I was not able to find any ready-made code for this. There are codes that remove outliers in bulk, not selectively for each x.


Solution

  • Group DataFrame by the x column and then apply a function to remove outliers from each group:

    def remove_outliers(group, column='y'):
        Q1 = group[column].quantile(0.25)
        Q3 = group[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        return group[(group[column] >= lower_bound) & (group[column] <= upper_bound)]
    df = df.groupby('x').apply(remove_outliers).reset_index(drop=True)