pythonscikit-learnregressiongrid-searchgridsearchcv

sklearn GridSearchCV gives questionable results


I have input data X_train with dimension (477 x 200) and y_train with length 477. I want to use a support vector machine regressor and I am doing grid search.

param_grid = {'kernel': ['poly', 'rbf', 'linear','sigmoid'], 'degree': [2,3,4,5], 'C':[0.01,0.1,0.3,0.5,0.7,1,1.5,2,5,10]}
grid = GridSearchCV(estimator=regressor_2, param_grid=param_grid, scoring='neg_root_mean_squared_error', n_jobs=1, cv=3, verbose = 1)
grid_result = grid.fit(X_train, y_train))

I get for grid_result.best_params_ {'C': 0.3, 'degree': 2, 'kernel': 'linear'} with a score of -7.76. And {'C': 10, 'degree': 2, 'kernel': 'rbf'} gives mit -8.0.

However, when I do

regressor_opt = SVR(kernel='linear', 'degree'=2, C=0.3)
regressor_opt.fit(X_train,y_train)

y_train_pred = regressor_opt.predict(X_train)
print("rmse=",np.sqrt(sum(y_train-y_train_pred)**2)/np.shape(y_train_pred)))

I get 7.4 and when I do

regressor_2 = SVR(kernel='rbf', 'degree'=2, C=10)
regressor_2.fit(X_train,y_train)
    
y_train_pred = regressor_2.predict(X_train)
print("rmse=",np.sqrt(sum(y_train-y_train_pred)**2)/np.shape(y_train_pred)))

I get 5.9. This is clearly better than 7.4 but in the gridsearch the negative rmse I got for that parameter combination was -8 and therefore worse than 7.4. Can anybody explain to me what is going on? Should I not use scoring='neg_root_mean_square_error'?


Solution

  • GridSearchCV will give you the score based on the left-out data. This is fundamentally how cross-validation works. What you're doing when you train and assess on the full train set is failing to do that cross-validation; you will be obtaining an overly optimistic result. You see this slightly for the linear kernel (7.4 vs 7.76) and more exaggerated for the more flexible RBF kernel (5.9 vs 8). GridSearchCV has, I expect correctly, identified that your more flexible model does not generalise as well.

    You should be able to see this effect more clearly by taking your specific estimators (regressor_opt and regressor_2) and using sklearn's cross_validate() to get the results for left-out folds. I expect you will see regressor_2 performing considerably worse than your optimistic value of 5.9. You may find that an informative exercise.

    Remember, you want a model that will perform best on new data, not a model that fits arbitrarily well to your training data.

    I suggest further discussion of this does not belong on stackoverflow, but instead on crossvalidated.