I'm working on a multilabel classification problem using the ClassifierChain approach with RandomForestClassifier as the base estimator. I've encountered an issue where my input matrix X contains np.nan values. When using RandomForestClassifier alone, it handles NaN values without any problem, as it natively supports missing values via its internal tree splitting mechanism.
This is confusing to me because the base estimator (RandomForestClassifier) does handle NaN values correctly. I don't understand why ClassifierChain, which is just a wrapper, raises this error when the underlying classifier doesn't have an issue with NaNs.
When I train a simple RandomClassifier it does handle np.nan:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X = np.array([np.nan, -1, np.nan, 1]).reshape(-1, 1)
y_single_label = [0, 0, 1, 1]
tree = RandomForestClassifier(random_state=0)
tree.fit(X, y_single_label)
X_test = np.array([np.nan]).reshape(-1, 1)
tree.predict(X_test)
Even when I use MultiOutputClassifier instead of ClassifierChain (which doesn't model dependencies between labels), the training proceeds without any errors, even with NaNs in the input - as expected.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import ClassifierChain , MultiOutputClassifier
X = np.array([np.nan, -1, np.nan, 1]).reshape(-1, 1)
# Two label columns for multilabel classification
y = np.array([[0, 1], [0, 0], [1, 0], [1, 1]])
# Base classifier
base_clf = RandomForestClassifier()
# MultiOutputClassifier (Binary Relevance) with the base classifier
clf_BR = MultiOutputClassifier(base_clf)
# Fitting the model
clf_BR.fit(X, y)
However, when I switch to the ClassifierChain approach:
# Classifier chain with the base classifier
clf_chain = ClassifierChain(base_clf)
# Fitting the model
clf_chain.fit(X, y)
I get the following error during hyperparameter tuning:
Trial 0 failed with parameters: {'n_estimators': 30, 'max_depth': 16, 'max_samples': 0.4497444900238575, 'max_features': 550, 'order_type': 'random'} because of the following error: ValueError('Input X contains NaN.\nClassifierChain does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values')
Since it's important for us to keep the missing values as they are and not impute or drop them, I'm wondering if there's a way to make ClassifierChain work with missing values. Is there any workaround or something I'm missing here?
Here are my environment details:
Yes this seems to be how they have written this.
If you see the full stack trace, you will see that this code gets called eventually where the check for Nan
does happen.
One workaround is to impute it but also create a feature called is_nan
so that model knows when it's actually missing and if your model is complex enough it could learn to ignore the imputed value when the is_nan
feature is true.
I agree that the class should have supported Nan
s, so it might be worth filing a bug request.