pythonnumpy

Sorting of categorical variables using np.unique


I'm trying to get the unique values of categorical variables in sorted fashion using the below code but without success.

import numpy as np

unique_values, unique_value_counts = np.unique(['Small', 'Medium', 'Large', 'Medium', 'Small', 'Large', 'Small', 'Medium'], return_counts = True)

print(unique_values)

which gives me an output like below

['Large', 'Medium', 'Small']

However, I'm expecting output in ascending format like

['Small', 'Medium', 'Large']

Is there a way wherein I can get the categorical values in a sorted format using np.unique()?


Solution

  • You can first translate your strings using a dictionary mapping:

    a = np.array(['Small', 'Medium', 'Large', 'Medium',
                  'Small', 'Large', 'Small', 'Medium'])
    order = ['Small', 'Medium', 'Large']
    
    key = {k:v for v,k in enumerate(order)}
    # {'Small': 0, 'Medium': 1, 'Large': 2}
    
    _, idx, unique_value_counts = np.unique(np.vectorize(key.get)(a),                                        
                                            return_index=True,
                                            return_counts=True)
    unique_values = a[idx]
    
    unique_values
    # array(['Small', 'Medium', 'Large'], dtype='<U6')
    
    unique_value_counts
    # array([3, 3, 2])