I am working my way through the data engineer salary data set on Kaggle. The salary_currency column has the following value counts.
salary_currency
USD 13695
GBP 558
EUR 406
INR 51
CAD 49
...
16494 values total
Is there a way to dummy code only for values that are at least 2% (or any percent) of a given column? In other words only dummy code for USD, GBP, and EUR?
Yes, simply use latest version of OHE
from sklearn.preprocessing import OneHotEncoder
oh = OneHotEncoder(min_frequency = 0.02, sparse_output = False)
data = oh.fit_transform(df[['salary_currency']])
cols = oh.get_feature_names_out()
features = pd.DataFrame(data,columns=cols)
features.sum(axis=0)
Returns following counts by columns
salary_currency_CAD 18.0
salary_currency_EUR 95.0
salary_currency_GBP 44.0
salary_currency_INR 27.0
salary_currency_USD 398.0
salary_currency_infrequent_sklearn 25.0
``