[SOLVED] One-hot-encoding with missing categories

One-hot-encoding with missing categories

I have a dataset with a category column. In order to use linear regression, I 1-hot encode this column.

My set has 10 columns, including the category column. After dropping that column and appending the 1-hot encoded matrix, I end up with 14 columns (10 - 1 + 5).

So I train (fit) my LinearRegression model with a matrix of shape (n, 14).

After training it, I want to test it on a subset of the training set, so I take only the 5 first and put them through the same pipeline. But these 5 first only contain 3 of the categories. So after going through the pipeline, I'm only left with a matrix of shape (n, 13) because it's missing 2 categories.

How can I force the 1-hot encoder to use the 5 categories ?

I'm using LabelBinarizer from sklearn.

Solution

The error is to "put the test data through the same pipeline". Basically i was doing:

data_prepared = full_pipeline.fit_transform(train_set)

lin_reg = LinearRegression()
lin_reg.fit(data_prepared, labels)

some_data = train_set.iloc[:5]
some_data_prepared = full_pipeline.fit_transform(some_data)

lin_reg.predict(some_data_prepared)
# => error because mismatching shapes

The problematic line is:

some_data_prepared = full_pipeline.fit_transform(some_data)

By doing fit_transform, I'll fit the LabelBinarizer to a set containing only 3 labels. Instead I should do:

some_data_prepared = full_pipeline.transform(some_data)

This way I'm using the pipeline fitted by the full set (train_set) and transform it in the same way.

Thanks @Vivek Kumar