New to ML here and trying my hands on fitting a model using Random Forest. Here is my simplified code:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.15, random_state=42)
model = RandomForestRegressor()
param_grid = {
'n_estimators': [100, 200, 500],#, 300],
'max_depth': [3, 5, 7],
'max_features': [3, 5, 7],
'random_state': [42]
}
Next, I perform grid search for the best parameters:
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
This yields the output:
{'max_depth': 7, 'max_features': 3, 'n_estimators': 500, 'random_state': 42}
Next, I implement prediction for the model. I get the output R2= 0.998 for test and train data:
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
Question:
The above code did ascertain the 'max features'
to be 3.
The 'max_features'
parameter in the RandomForestRegressor
does not refer to the top 3 most important features used by the model but rather determines the number of features to consider when looking for the best split.
Specifically:
max_features
is an integer, then it considers that many features at each split.max_features
is a float, it is a percentage and int(max_features * n_features)
features are considered at each split.max_features
is auto
, then max_features=sqrt(n_features)
.max_features
is sqrt
, then max_features=sqrt(n_features)
(same as auto
).max_features
is log2
, then max_features=log2(n_features)
.max_features
is None
, then max_features=n_features
.So, when you find 'max_features': 3
in your best parameters, it means that the random forest is using 3 features to make the best split at each node, not necessarily the same 3 features each time. The features might change for each tree and each split in your random forest.
In the context of random forests, you can get the feature importances, which gives you the importance score of each feature in making predictions. Here is how you can do it:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_names = features.columns
important_features_dict = {}
for x in range(len(feature_names)):
important_features_dict[feature_names[x]] = feature_importances[x]
important_features_list = sorted(important_features_dict,
key=important_features_dict.get,
reverse=True)
print('Most important features: %s' %important_features_list[:3])
This gives you the top 3 features that are most important across all trees in the forest, not necessarily the ones used at each individual split. You should interpret this as a general measure of which features the model considers important overall, rather than a specific indication of which 3 features were used in any particular decision within the model.