How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder?

I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have a large dataset comprised of both numerical and categorical data. The dataset has lots of missing values. I am currently trying to encode the categorical features using OneHotEncoder. When I read about OneHotEncoder, my understanding was that for a missing value (NaN), OneHotEncoder would assign 0s to all the feature's categories, as such:

0     Male 
1     Female
2     NaN

After applying OneHotEncoder:

0     10 
1     01
2     00

However, when running the following code:

    # Encoding categorical data
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder


    ct = ColumnTransformer([('encoder', OneHotEncoder(handle_unknown='ignore'), [1])],
                           remainder='passthrough')
    obj_df = np.array(ct.fit_transform(obj_df))
    print(obj_df)

I am getting the error ValueError: Input contains NaN

So I am guessing my previous understanding of how OneHotEncoder handles missing values is wrong. Is there a way for me to get the functionality described above? I know imputing the missing values before encoding will resolve this issue, but I am reluctant to do this as I am dealing with medical data and fear that imputation may decrease the predictive accuracy of my model.

I found this question that is similar but the answer doesn't offer a detailed enough solution on how to deal with the NaN values.

Let me know what your thoughts are, thanks.

Solution

You will need to impute the missing values before. You can define a Pipeline with an imputing step using SimpleImputer setting a constant strategy to input a new category for null fields, prior to the OneHot encoding:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, [0])
    ])

df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)
array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])