scikit-learnrandom-forestmissing-data

SKLearn algorithms than handle native NaN values


I have a large data set with many missing values. I saw a list of SKLearn algorithms that handle native NaN values here: 6.4.7. Estimators that handle NaN values

This list includes RandomForestClassifier

However, when I tried to run an RF model in SKLearn with this large data set, I got the following error message:

ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values`

Does anyone have insight on this issue - perhaps SKLearn has not updated its list of algorithms that handle NaN values?

For now, I will try one or more of the other algorithms on this list, starting with HistGradientBoostingClassifier.

Thanks!


Solution

  • ValueError: Input X contains NaN. RandomForestClassifier does not accept missing values encoded as NaN natively.

    Missing value support in "classical" SkLearn tree models is a fairly recent addition.

    For DecisionTreeClassifier it's available since SkLearn 1.3(.0). For RandomForestClassifier it's available since SkLearn 1.4(.0). See release notes for more details.

    Check your SkLearn version (print(sklearn.__version__)), and if it's less than 1.4.0, upgrade your installation.