pythonpandasexploratory-data-analysis

Issues with categorical binning


I am trying to get top values for an unordinal categorical column which accounts for 50% or more of the total counts, so all other occurences are replaced with 'others'.

values_df = df['column'].value_counts(normalize = True)

total = 0
for i, row in enumerate(values_df.values):
      row = round(row,2)
      if total <= 0.5:
           total+=row
      else:
           df['column'][i] = 'others'

but then when I print(df['column'].value_counts()) I don't see values below to be changed to others.


Solution

  • IIUC, you can use cumsum to compute the cumulated total, then boolean indexing with isin:

    values_df = (df['column']
                 .value_counts(normalize = True)
                 .round(2)
                )
    m = values_df.cumsum().gt(0.5)
    df.loc[df['column'].isin(values_df.index[m]), 'column'] = 'others'
    

    Example output:

        column
    0        4
    1   others
    2   others
    3        3
    4        3
    5        3
    6   others
    7        3
    8   others
    9   others
    10       4
    11  others
    12  others
    13       4
    14  others
    

    Used input:

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    df = pd.DataFrame({'column': np.random.randint(0, 6, size=15)})
    
        column
    0        4
    1        5
    2        0
    3        3
    4        3
    5        3
    6        1
    7        3
    8        5
    9        2
    10       4
    11       0
    12       0
    13       4
    14       2