[SOLVED] Fails to save model after running GridSearchCV with a scikit pipeline

Fails to save model after running GridSearchCV with a scikit pipeline

I have the following toy example to replicate the issue

import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X, y = make_regression(n_samples=30, n_features=5, noise=0.2)

reg = xgb.XGBRegressor(tree_method='hist', eval_metric='mae', n_jobs= 4)
steps = list()
steps.append(('reg', reg))
pipeline = Pipeline(steps=steps)
param_grid = {'reg__max_depth': [2, 4, 6],}
cv = 3
model = GridSearchCV(pipeline, param_grid, cv=cv, scoring='neg_mean_absolute_error')
best_model = model.fit(X = X, y = y)

Then the following four methods fail to save the fitted model:

model.save_model('test_1.json')

# AttributeError: 'GridSearchCV' object has no attribute 'save_model'

best_model.save_model('test2.json')

# AttributeError: 'GridSearchCV' object has no attribute 'save_model'

best_model.best_estimator_.save_model('test3.json')

# AttributeError: 'Pipeline' object has no attribute 'save_model'

model.best_estimator_.save_model('test4.json')

# AttributeError: 'Pipeline' object has no attribute 'save_model'

But these two methods work.

import joblib
joblib.dump(model.best_estimator_, 'naive_model.joblib')
joblib.dump(best_model.best_estimator_, 'naive_best_model.joblib')

Can anyone tell me if it is the way I constructer my pipeline mistakenly breaks the method to save the best model?

Solution

Only "xgboost" object has an attribute "save_model". When you use gridsearch it is already a different object wrapped around "xgboost". The same thing with pipelines. You will need to do model.best_estimator_['reg'].save_model. But it will save only xgboost without any data transformation from pipeline. "joblib" and "pickle" are more universal solutions, imho