I am trying to get top values for an unordinal categorical column which accounts for 50% or more of the total counts, so all other occurences are replaced with 'others'.
values_df = df['column'].value_counts(normalize = True)
total = 0
for i, row in enumerate(values_df.values):
row = round(row,2)
if total <= 0.5:
total+=row
else:
df['column'][i] = 'others'
but then when I print(df['column'].value_counts())
I don't see values below to be changed to others.
IIUC, you can use cumsum
to compute the cumulated total, then boolean indexing with isin
:
values_df = (df['column']
.value_counts(normalize = True)
.round(2)
)
m = values_df.cumsum().gt(0.5)
df.loc[df['column'].isin(values_df.index[m]), 'column'] = 'others'
Example output:
column
0 4
1 others
2 others
3 3
4 3
5 3
6 others
7 3
8 others
9 others
10 4
11 others
12 others
13 4
14 others
Used input:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'column': np.random.randint(0, 6, size=15)})
column
0 4
1 5
2 0
3 3
4 3
5 3
6 1
7 3
8 5
9 2
10 4
11 0
12 0
13 4
14 2