[SOLVED] Removing outliers and calculating a trimmed mean in Python for multiple columns with different number of actual values

Removing outliers and calculating a trimmed mean in Python for multiple columns with different number of actual values

I have a dataset. Let's say, 10010 rows and 100 columns, column values might include NaN and for each column of NaNs can be different.

I want

to pick n number of columns from this dataset (let's say 20, without order, e.g, Column1, Column2, etc).
trim outliers (2.5% of the highest and 2.5% of the lowest for each of selected columns), excluding NaN values (so if 10 values among 10010 are NaN in Column1, I need to trim out actual highest 250 values from the top and 250 actual lowest values from the bottom of 10000 values)
But if the Column2 has 110 NaN initially, I want to trim 2.5% percent from each side for the actual number of values (in this case 9900, not 10000 like in Column1 column)
Calculate trimmed mean for each of selected columns
Have a new dataset after trimming where all trimmed outliers were converted to NaN

Solution

This simplified example below shows an approach which might be useful and uses pd.quantile. The code could be developed to your requirement (obviously including the quantile parameter).

import pandas as pd

df = pd.DataFrame({'col1': [ 1, 2, 3, 4, None, 6, 7, 8, 54],
                   'col2': [3, 5, 13, 14, 2, 16, 17, 18, 19] })

cols = ['col1', 'col2']
for col in cols:
    lo = df[col].quantile(0.1)
    hi = df[col].quantile(0.9)
    df[col] = df[col].where((df[col]> lo) & (df[col] < hi), None)
    print(f'mean for {col} is: ', df[col].mean().round(2))


print(df)

gives:

mean for col1 is:  5.0
mean for col2 is:  12.29

   col1  col2
0   NaN   3.0
1   2.0   5.0
2   3.0  13.0
3   4.0  14.0
4   NaN   NaN
5   6.0  16.0
6   7.0  17.0
7   8.0  18.0
8   NaN   NaN

The code above uses a value threshold to change outliers to NaN; this would be the usual approach. If the requirement was the change a number of values at either extreme then this could be done through saving and manipulating the index, sorting by value, changing the outlier proportions then restoring the original order using the index. The code below assumes that the default numerical index has originally been used; if not then the user index would need to be saved then finally re-instated.

cut_val = 0.2     # proportion of non_NaN values to remove from each extreme
num_rows = len(df)

cols = ['col1', 'col2']
for col in cols:
    num_not_nan = num_rows - df[col].isna().sum()
    cut = int(num_not_nan*cut_val)
    dfx = df[col].sort_values()
    idx = dfx.index.to_list()   #save sorted index
    dfx.index = range(num_rows)       #use numerical re-index so .loc can be used
    dfx.loc[0:cut-1] = None
    dfx.loc[num_not_nan-cut:num_not_nan] = None
    dfx.index=idx              #impose original index
    df[col] = dfx.sort_index()
    print(f'mean for {col} is: ', df[col].mean().round(2))

print(df)