I know that we have to one-hot encode categorical data before training machine Learning algorithm. but my question is do we need to remove one column manually or sklearn will do it?
I assume you want to drop one column also for non-binary categorical features to avoid multi-collinearity, which might cause problems for linear models. It is as easy as providing drop_first=True
argument to pd.get_dummies()
. It seems that sklearn.preprocessing.OneHotEncoder
doesn't have a simple interface to do this and anyway its usage is complicated, as categorical features have to be encoded into int
's beforehand.