python-3.xmachine-learningsklearn-pandasdata-preprocessingstandardization

Do we need to exclude OneHotEncoded columns while standardizing or normalizing using MinMaxScaler() or StandardScaler()?


This is the final cleaned DataFrame (df2) before Standardizing

my code: scaler=StandardScaler() df2[list(df2.columns)]=scaler.fit_transform(df2[list(df2.columns)]) df2

This returns a DataFrame after Standardizing every column including dummies and categories. Is it correct way?...Or should we specify only numerical columns while standardizing?


Solution

  • It doesn't really matter for minmax scaler because with a column with just 0 and 1 it will be an identity. StandardScaller on the other hand is an interesting one. If you apply it to one hot encoded one the code will decrease from 1 to the number proportional to how many samples do you have in this specific category. This boils down to an empirical question of what works for your application, as both paths can be justified. Simply standarising everything is a more "unified" way so would be a simpler approach overall, but in the end ML is an empirical field. Do what provides you with best results.