pythonsavegridsearchcv

How to save the best estimator in GridSearchCV?


When faced with a large dataset, I need to spend a day using GridSearchCV() to train an SVM with the best parameters. How can I save the best estimator so that I can use this trained estimator directly when I start my computer next time?


Solution

  • By default, GridSearchCV does not expose or store the best model instance it only returns the parameter set that led to the highest score. If you want the best predictor, you have to specify refit=True, or if you are using multiple metrics refit=name-of-your-decider-metric. This will run a final training step using the full dataset and the best parameters found. To find the optimal parameters, GridSearchCV obviously does not use the entire dataset for training, as they have to split out the hold-out validation set.

    Now, when you do that, you can get the model via the best_estimator_ attribute. Having this, you can pickel that model using joblib and reload it the next day to do your prediction. In a mix of pseudo and real code, that would read like:

    from sklearn.model_selection import GridSearchCV
    from sklearn.svm import SVC
    from joblib import dump, load
    
    svc = SVC() # Probably not what you are using, but just as an example
    gcv = GridSearchCV(svc, parameters, refit=True) 
    gcv.fit(X, y)
    estimator = gcv.best_estimator_
    dump(estimator, "your-model.joblib")
    # Somewhere else
    estimator = load("your-model.joblib")