pandasnumpymachine-learningscikit-learndimensionality-reduction

Reducing number of unique categories pandas


I have my dataset with job metrics, and one of my features is industry. It is a categorical feature and has 1200 unique values. Before I go on and work on building a model, I need to figure out how to best encode it esp because it has 1200 unique values. Does anyone have any tips or guidance as to where I should start?

The picture below shows the top 9 industries. I am thinking of selective encoding - maybe only using one-hot encoding for these 15-20 most frequent values, but I will be thankful for any suggestions. Thanks

Tried to look for several resources, but couldn't find anything promising so far
[A picture of the 9 most occurring industries]
https://i.sstatic.net/tDAEk.jpg


Solution

  • You could one hot encode everything, and maybe check correlations against target to see which job categories may be informative features.

    if the data is too large to do this, then yes perhaps selective encoding as you said -- just conditionally fill everything else as "other" and then proceed with one hot encoding.