pythonmachine-learningscikit-learnrandom-forestmissing-data

Why does RandomForestClassifier in scikit-learn predict even on all-NaN input?


I am training a random forest classifier in python sklearn, see code below-

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X = df.drop("AP", axis =1), y = df["AP"].astype(int))

When I predict the values using this classifier on another dataset that has NaN values, the model provides some output. Not even that, I tried predicting output on a row with all variables as NaNs, it predicted the outputs.

#making a row with all NaN values 
row = pd.DataFrame([np.nan] * len(rf.feature_names_in_), index=rf_corn.feature_names_in_).T
rf.predict(row)

It predicts- array([1])

I know that RandomForestClassifier in scikit-learn does not natively support missing values. So I expected a ValueError, not a prediction.

I can ignore the NaN rows and only predict on non-nan rows but I am concerned if there is something wrong with this classifier. Any insight will be appreciated.


Solution

  • In the most recent version of scikit-learn (v1.4) they added support for missing values to RandomForestClassifier when the criterion is gini (default).

    Source: https://scikit-learn.org/dev/whats_new/v1.4.html#id7