machine-learning scikit-learn random-forest sklearn-pandas grid-search

Sklearn Random Forest: determine the name of features ascertained by parameter grid for model fit and prediction

New to ML here and trying my hands on fitting a model using Random Forest. Here is my simplified code:

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.15, random_state=42)

model = RandomForestRegressor()
    param_grid = {
        'n_estimators': [100, 200, 500],#, 300],
        'max_depth': [3, 5, 7],
        'max_features': [3, 5, 7],
        'random_state': [42]
    }

Next, I perform grid search for the best parameters:

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

This yields the output:

{'max_depth': 7, 'max_features': 3, 'n_estimators': 500, 'random_state': 42}

Next, I implement prediction for the model. I get the output R2= 0.998 for test and train data:

y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

Question:

The above code did ascertain the 'max features' to be 3.

I suppose those 3 features were used to predict the model and then calculate R2. Is that right?
If #1 is correct then how do I print the 3 features which were used for the best prediction and obtain a R2 of 0.998?

Solution

The 'max_features' parameter in the RandomForestRegressor does not refer to the top 3 most important features used by the model but rather determines the number of features to consider when looking for the best split.

Specifically:

If max_features is an integer, then it considers that many features at each split.
If max_features is a float, it is a percentage and int(max_features * n_features) features are considered at each split.
If max_features is auto, then max_features=sqrt(n_features).
If max_features is sqrt, then max_features=sqrt(n_features) (same as auto).
If max_features is log2, then max_features=log2(n_features).
If max_features is None, then max_features=n_features.

So, when you find 'max_features': 3 in your best parameters, it means that the random forest is using 3 features to make the best split at each node, not necessarily the same 3 features each time. The features might change for each tree and each split in your random forest.

In the context of random forests, you can get the feature importances, which gives you the importance score of each feature in making predictions. Here is how you can do it:

feature_importances = grid_search.best_estimator_.feature_importances_
feature_names = features.columns
important_features_dict = {}
for x in range(len(feature_names)):
    important_features_dict[feature_names[x]] = feature_importances[x]
important_features_list = sorted(important_features_dict,
                                 key=important_features_dict.get,
                                 reverse=True)
print('Most important features: %s' %important_features_list[:3])

This gives you the top 3 features that are most important across all trees in the forest, not necessarily the ones used at each individual split. You should interpret this as a general measure of which features the model considers important overall, rather than a specific indication of which 3 features were used in any particular decision within the model.