pythonpandastrimoutliers

Removing outliers and calculating a trimmed mean in Python for multiple columns with different number of actual values


I have a dataset. Let's say, 10010 rows and 100 columns, column values might include NaN and for each column of NaNs can be different.

I want


Solution

  • This simplified example below shows an approach which might be useful and uses pd.quantile. The code could be developed to your requirement (obviously including the quantile parameter).

    import pandas as pd
    
    df = pd.DataFrame({'col1': [ 1, 2, 3, 4, None, 6, 7, 8, 54],
                       'col2': [3, 5, 13, 14, 2, 16, 17, 18, 19] })
    
    cols = ['col1', 'col2']
    for col in cols:
        lo = df[col].quantile(0.1)
        hi = df[col].quantile(0.9)
        df[col] = df[col].where((df[col]> lo) & (df[col] < hi), None)
        print(f'mean for {col} is: ', df[col].mean().round(2))
    
    
    print(df)
    

    gives:

    mean for col1 is:  5.0
    mean for col2 is:  12.29
    
       col1  col2
    0   NaN   3.0
    1   2.0   5.0
    2   3.0  13.0
    3   4.0  14.0
    4   NaN   NaN
    5   6.0  16.0
    6   7.0  17.0
    7   8.0  18.0
    8   NaN   NaN
    

    The code above uses a value threshold to change outliers to NaN; this would be the usual approach. If the requirement was the change a number of values at either extreme then this could be done through saving and manipulating the index, sorting by value, changing the outlier proportions then restoring the original order using the index. The code below assumes that the default numerical index has originally been used; if not then the user index would need to be saved then finally re-instated.

    cut_val = 0.2     # proportion of non_NaN values to remove from each extreme
    num_rows = len(df)
    
    cols = ['col1', 'col2']
    for col in cols:
        num_not_nan = num_rows - df[col].isna().sum()
        cut = int(num_not_nan*cut_val)
        dfx = df[col].sort_values()
        idx = dfx.index.to_list()   #save sorted index
        dfx.index = range(num_rows)       #use numerical re-index so .loc can be used
        dfx.loc[0:cut-1] = None
        dfx.loc[num_not_nan-cut:num_not_nan] = None
        dfx.index=idx              #impose original index
        df[col] = dfx.sort_index()
        print(f'mean for {col} is: ', df[col].mean().round(2))
    
    print(df)