machine-learningscikit-learnrandom-forestsklearn-pandasgrid-search

Sklearn Random Forest: determine the name of features ascertained by parameter grid for model fit and prediction


New to ML here and trying my hands on fitting a model using Random Forest. Here is my simplified code:

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.15, random_state=42)

model = RandomForestRegressor()
    param_grid = {
        'n_estimators': [100, 200, 500],#, 300],
        'max_depth': [3, 5, 7],
        'max_features': [3, 5, 7],
        'random_state': [42]
    }

Next, I perform grid search for the best parameters:

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

This yields the output:

{'max_depth': 7, 'max_features': 3, 'n_estimators': 500, 'random_state': 42}

Next, I implement prediction for the model. I get the output R2= 0.998 for test and train data:

y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

Question:

The above code did ascertain the 'max features' to be 3.

  1. I suppose those 3 features were used to predict the model and then calculate R2. Is that right?
  2. If #1 is correct then how do I print the 3 features which were used for the best prediction and obtain a R2 of 0.998?

Solution

  • The 'max_features' parameter in the RandomForestRegressor does not refer to the top 3 most important features used by the model but rather determines the number of features to consider when looking for the best split.

    Specifically:

    So, when you find 'max_features': 3 in your best parameters, it means that the random forest is using 3 features to make the best split at each node, not necessarily the same 3 features each time. The features might change for each tree and each split in your random forest.

    In the context of random forests, you can get the feature importances, which gives you the importance score of each feature in making predictions. Here is how you can do it:

    feature_importances = grid_search.best_estimator_.feature_importances_
    feature_names = features.columns
    important_features_dict = {}
    for x in range(len(feature_names)):
        important_features_dict[feature_names[x]] = feature_importances[x]
    important_features_list = sorted(important_features_dict,
                                     key=important_features_dict.get,
                                     reverse=True)
    print('Most important features: %s' %important_features_list[:3])
    

    This gives you the top 3 features that are most important across all trees in the forest, not necessarily the ones used at each individual split. You should interpret this as a general measure of which features the model considers important overall, rather than a specific indication of which 3 features were used in any particular decision within the model.