scikit-learnshufflegridsearchcvoverfitting-underfitting

gridsearchCV - shuffle data for every single parameter combination


I am using gridsearchCV to determine model hyper-parameters:

pipe = Pipeline(steps=[(self.FE, FE_algorithm), (self.CA, Class_algorithm)])
param_grid = {**FE_grid, **CA_grid} 

scorer = make_scorer(f1_score, average='macro')
       
search = GridSearchCV(pipe, param_grid, cv=ShuffleSplit(test_size=0.20, n_splits=5,random_state=0), n_jobs=-1,
                              verbose=3, scoring=scorer)

search.fit(self.data_input, self.data_output)

However, I believe I am running into some problems with overfitting: results

I would like to shuffle the data under every single parameter combination, is there any way to do this? Currently, with the k-fold cross validation the same sets of validation data are being evaluated for each parameter combination, k-fold, and so overfitting is becoming an issue.


Solution

  • No, there isn't. The search splits the data once and creates a task for each combination of fold and parameter combination (source).

    Shuffling per parameter combination is probably not desirable anyway: the selection might then just pick the "easiest" split instead of the "best" parameter. If you think you are overfitting to the validation folds, then consider using

    1. fewer parameter options
    2. more folds, or repeated splits*
    3. a scoring callable that customizes evaluation
    4. models that are more conservative

    *my favorite among these, although the computation cost may be too high