I want to undersample 3 cross-validation folds from a dataset, using say, RandomUnderSampler from imblearn, and then, optimize the hyperparameters of various gbms using those undersampled folds as input.
The code I have so far is:
def train_model_with_undersampling(undersampler, estimator, scale, params, X_train, y_train):
# we need the resampler within a pipeline, because we are
# using cross-validation to optimize hyperparameters, which
# means that we need a left out fold without resampling to
# evaluate the model.
# The only way is with imblearn's pipeline (see imblearn docs)
# annoying cause we need to resample every time
if scale is True:
pipe = Pipeline([
("scaler", MinMaxScaler()),
("sampler", undersampler),
("model", estimator),
])
else:
pipe = Pipeline([
("sampler", undersampler),
("model", estimator),
])
search = HalvingRandomSearchCV(
estimator=pipe,
param_distributions=params,
n_candidates="exhaust",
factor=3, # only a third of the candidates are promoted
resource='model__n_estimators', # the limiting resource
max_resources=500, # max number of trees
min_resources=10,
scoring='roc_auc',
cv=3,
random_state=10,
refit=True,
n_jobs=-1,
)
search.fit(X_train, y_train)
return search
However, this function will undersample the data to tune each gbm model. This is inefficient, because the undersampling is the same.
What I would like is to be able to pass to HalvingRandomSearchCV the undersampled folds and the test fold somehow.
In short, I want to undersample 3 different folds of X_train, and then be able to use those to optimize the hyperparameters of xgboost, catboost, gradientboostingclassifier and other models.
Is there a way to do so?
You can do this:
Get initial folds using .split() method of your sklearn CV object. It returns indices for train and test of each fold.
Undersample train fold data using imblearn sampler. You can discard resulting undersampled data, as you need only indices.
Extract indices of undersampled train fold from fitted imblearn sampler and use them to get undersampled train fold indices
For each fold, save tuple (fold_train_sampled_indices, fold_test_indices)
def cv_undersample_split(X, y, cv, imb_sampler):
folds = []
for fold_train_idx, fold_test_idx in cv.split(X, y):
imb_sampler.fit_resample(X[fold_train_idx], y[fold_train_idx])
fold_train_sampled_idx = fold_train_idx[imb_sampler.sample_indices_]
folds.append((fold_train_sampled_idx, fold_test_idx))
return folds
folds = cv_undersample_split(X=X_train, y=y_train,
cv=KFold(3),
imb_sampler=RandomUnderSampler()
)
Now you can use folds instead of cv parameter in HalvingRandomSearchCV.
estimators = [(GradientBoostingClassifier(), {'model__max_depth': [1, 3]}),
(RandomForestClassifier(), {'model__max_depth': [1, 3]})
]
for estimator, params in estimators:
print(estimator)
pipe = Pipeline([
("model", estimator)
])
search = HalvingRandomSearchCV(
estimator=pipe,
param_distributions=params,
n_candidates="exhaust",
factor=3, # only a third of the candidates are promoted
resource='model__n_estimators', # the limiting resource
max_resources=500, # max number of trees
min_resources=10,
scoring='roc_auc',
cv=folds, # <---- use pre-computed folds here
random_state=10,
refit=True,
n_jobs=-1,
verbose=True
)
search.fit(X_train, y_train)
which gives the following output (among other things):
Fitting 3 folds for each of 2 candidates, totalling 6 fits