I am training a random forest classifier in python sklearn
, see code below-
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X = df.drop("AP", axis =1), y = df["AP"].astype(int))
When I predict the values using this classifier on another dataset that has NaN
values, the model provides some output. Not even that, I tried predicting output on a row with all variables as NaNs
, it predicted the outputs.
#making a row with all NaN values
row = pd.DataFrame([np.nan] * len(rf.feature_names_in_), index=rf_corn.feature_names_in_).T
rf.predict(row)
It predicts-
array([1])
I know that RandomForestClassifier in scikit-learn does not natively support missing values. So I expected a ValueError, not a prediction.
I can ignore the NaN rows and only predict on non-nan rows but I am concerned if there is something wrong with this classifier. Any insight will be appreciated.
In the most recent version of scikit-learn (v1.4) they added support for missing values to RandomForestClassifier when the criterion is gini (default).
Source: https://scikit-learn.org/dev/whats_new/v1.4.html#id7