I have a dataset. Let's say, 10010 rows and 100 columns, column values might include NaN and for each column of NaNs can be different.
I want
This simplified example below shows an approach which might be useful and uses pd.quantile. The code could be developed to your requirement (obviously including the quantile parameter).
import pandas as pd
df = pd.DataFrame({'col1': [ 1, 2, 3, 4, None, 6, 7, 8, 54],
'col2': [3, 5, 13, 14, 2, 16, 17, 18, 19] })
cols = ['col1', 'col2']
for col in cols:
lo = df[col].quantile(0.1)
hi = df[col].quantile(0.9)
df[col] = df[col].where((df[col]> lo) & (df[col] < hi), None)
print(f'mean for {col} is: ', df[col].mean().round(2))
print(df)
gives:
mean for col1 is: 5.0
mean for col2 is: 12.29
col1 col2
0 NaN 3.0
1 2.0 5.0
2 3.0 13.0
3 4.0 14.0
4 NaN NaN
5 6.0 16.0
6 7.0 17.0
7 8.0 18.0
8 NaN NaN
The code above uses a value threshold to change outliers to NaN; this would be the usual approach. If the requirement was the change a number of values at either extreme then this could be done through saving and manipulating the index, sorting by value, changing the outlier proportions then restoring the original order using the index. The code below assumes that the default numerical index has originally been used; if not then the user index would need to be saved then finally re-instated.
cut_val = 0.2 # proportion of non_NaN values to remove from each extreme
num_rows = len(df)
cols = ['col1', 'col2']
for col in cols:
num_not_nan = num_rows - df[col].isna().sum()
cut = int(num_not_nan*cut_val)
dfx = df[col].sort_values()
idx = dfx.index.to_list() #save sorted index
dfx.index = range(num_rows) #use numerical re-index so .loc can be used
dfx.loc[0:cut-1] = None
dfx.loc[num_not_nan-cut:num_not_nan] = None
dfx.index=idx #impose original index
df[col] = dfx.sort_index()
print(f'mean for {col} is: ', df[col].mean().round(2))
print(df)