When faced with a large dataset, I need to spend a day using GridSearchCV()
to train an SVM with the best parameters. How can I save the best estimator so that I can use this trained estimator directly when I start my computer next time?
By default, GridSearchCV
does not expose or store the best model instance it only returns the parameter set that led to the highest score. If you want the best predictor, you have to specify refit=True
, or if you are using multiple metrics refit=name-of-your-decider-metric
. This will run a final training step using the full dataset and the best parameters found. To find the optimal parameters, GridSearchCV
obviously does not use the entire dataset for training, as they have to split out the hold-out validation set.
Now, when you do that, you can get the model via the best_estimator_
attribute. Having this, you can pickel that model using joblib and reload it the next day to do your prediction. In a mix of pseudo and real code, that would read like:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from joblib import dump, load
svc = SVC() # Probably not what you are using, but just as an example
gcv = GridSearchCV(svc, parameters, refit=True)
gcv.fit(X, y)
estimator = gcv.best_estimator_
dump(estimator, "your-model.joblib")
# Somewhere else
estimator = load("your-model.joblib")