pythonone-hot-encodingrecode

How to Automatically Dummy Code High Cardinality Variables in Python


I am working my way through the data engineer salary data set on Kaggle. The salary_currency column has the following value counts.

salary_currency
USD 13695
GBP   558
EUR   406
INR    51
CAD    49
...

16494 values total

Is there a way to dummy code only for values that are at least 2% (or any percent) of a given column? In other words only dummy code for USD, GBP, and EUR?


Solution

  • Yes, simply use latest version of OHE

    from sklearn.preprocessing import OneHotEncoder
    
    oh = OneHotEncoder(min_frequency = 0.02, sparse_output = False)
    data = oh.fit_transform(df[['salary_currency']])
    cols = oh.get_feature_names_out()
    features = pd.DataFrame(data,columns=cols)
    features.sum(axis=0)
    

    Returns following counts by columns

    salary_currency_CAD                    18.0
    salary_currency_EUR                    95.0
    salary_currency_GBP                    44.0
    salary_currency_INR                    27.0
    salary_currency_USD                   398.0
    salary_currency_infrequent_sklearn     25.0
    ``