I made randomforest model, and visualized result.
#training code
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
digits = load_digits()
forest_param = {'max_depth': np.arange(1,15),
'n_estimators': [50, 100, 150, 200, 250, 300, 350, 400]}
forest_classifier = RandomForestClassifier()
forest_grid = GridSearchCV(forest_classifier, forest_param, n_jobs=-1, return_train_score=True, cv=10)
digit_data = digits.data
digit_target = digits.target
forest_grid.fit(digit_data, digit_target)
print("best forest validation score")
forest_grid.best_score_
#visualize code
def plot_search_results(grid, lsi_log_index):
"""
Params:
grid: A trained GridSearchCV object.
"""
## Results from grid search
results = grid.cv_results_
means_test = results['mean_test_score']
stds_test = results['std_test_score']
means_train = results['mean_train_score']
stds_train = results['std_train_score']
## Getting indexes of values per hyper-parameter
masks=[]
masks_names= list(grid.best_params_.keys())
for p_k, p_v in grid.best_params_.items():
masks.append(list(results['param_'+p_k].data==p_v))
params=grid.param_grid
## Ploting results
fig, ax = plt.subplots(1,len(params),sharex='none', sharey='all',figsize=(20,5))
fig.suptitle('Score per parameter')
fig.text(0.04, 0.5, 'MEAN SCORE', va='center', rotation='vertical')
pram_preformace_in_best = {}
for i, p in enumerate(masks_names):
m = np.stack(masks[:i] + masks[i+1:])
pram_preformace_in_best
best_parms_mask = m.all(axis=0)
best_index = np.where(best_parms_mask)[0]
x = np.array(params[p])
y_1 = np.array(means_test[best_index])
e_1 = np.array(stds_test[best_index])
y_2 = np.array(means_train[best_index])
e_2 = np.array(stds_train[best_index])
ax[i].errorbar(x, y_1, e_1, linestyle='--', marker='o', label='test')
ax[i].errorbar(x, y_2, e_2, linestyle='-', marker='^',label='train' )
ax[i].set_xlabel(p.upper())
for log_scaler in lsi_log_index:
ax[log_scaler].set_xscale("log")
plt.legend()
plt.show()
plot_search_results(forest_grid,[])
I want to make validation score shrink, when overfitting occur.
like this SVR_C parameter. Image1
validation score are shrink, when overfitting occur.
But, max depth parameter's validation score not shrink, when over fitting occur.
Image2
I'v learned validation score are shrink, overfitting situation occur.
Can you tell me why this situation happen? :)
Well it all depends on the dataset. From your Image2, we can see for your RandomForestClassifier
that a max_depth
is not overfitting your train set. The depth of your trees are expanded until all leaves are pure or until all leaves contain less than min_samples_split
samples (2 by default). Those condition ensure that your trees do not expand to their maximum depth. Therefore your model is not overfitting.
On the other hand with SVR
, a large C
parameter will ensure that all samples are correctly classify. Hence the model is overfitting.