pythonpandasdataframeseriescategorical-data

Python: Combining Low Frequency Factors/Category Counts


There is a great solution in R.

My df.column looks like:

Windows
Windows
Mac
Mac
Mac
Linux
Windows
...

I want to replace low frequency categories with 'Other' in this df.column vector. For example, I need my df.column to look like

Windows
Windows
Mac
Mac
Mac
Linux -> Other
Windows
...

I would like to rename these rare categories, to reduce the number of factors in my regression. This is why I need the original vector. In python, after running the command to get the frequency table I get:

pd.value_counts(df.column)


Windows          26083
iOS              19711
Android          13077
Macintosh         5799
Chrome OS          347
Linux              285
Windows Phone      167
(not set)           22
BlackBerry          11

I wonder if there is a method to rename 'Chrome OS', 'Linux' (low frequency data) into another category (for example category 'Other', and do so in an efficient way.


Solution

  • Mask by finding percentage of occupency i.e :

    series = pd.value_counts(df.column)
    mask = (series/series.sum() * 100).lt(1)
    # To replace df['column'] use np.where I.e 
    df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])
    

    To change the index with sum :

    new = series[~mask]
    new['Other'] = series[mask].sum()
    
    Windows      26083
    iOS          19711
    Android      13077
    Macintosh     5799
    Other          832
    Name: 1, dtype: int64
    

    If you want to replace the index then :

    series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)
    
    Windows      26083
    iOS          19711
    Android      13077
    Macintosh     5799
    Other          347
    Other          285
    Other          167
    Other           22
    Other           11
    Name: 1, dtype: int64
    

    Explanation

    (series/series.sum() * 100) # This will give you the percentage i.e 
    
    Windows          39.820158
    iOS              30.092211
    Android          19.964276
    Macintosh         8.853165
    Chrome OS         0.529755
    Linux             0.435101
    Windows Phone     0.254954
    (not set)         0.033587
    BlackBerry        0.016793
    Name: 1, dtype: float64
    

    .lt(1) is equivalent to lesser than 1. That gives you a boolean mask, based on that mask index and assign the data