python scikit-learn hyperparameters imblearn

how to pass pre-computed folds to successiveHalving in sklearn

I want to undersample 3 cross-validation folds from a dataset, using say, RandomUnderSampler from imblearn, and then, optimize the hyperparameters of various gbms using those undersampled folds as input.

The code I have so far is:

def train_model_with_undersampling(undersampler, estimator, scale, params, X_train, y_train):
    # we need the resampler within a pipeline, because we are
    # using cross-validation to optimize hyperparameters, which
    # means that we need a left out fold without resampling to
    # evaluate the model.
    # The only way is with imblearn's pipeline (see imblearn docs)

    # annoying cause we need to resample every time
    if scale is True:
        pipe = Pipeline([
            ("scaler", MinMaxScaler()),
            ("sampler", undersampler),
            ("model", estimator),
        ])
    else:
        pipe = Pipeline([
            ("sampler", undersampler),
            ("model", estimator),
        ])

    search = HalvingRandomSearchCV(
        estimator=pipe,
        param_distributions=params,
        n_candidates="exhaust",
        factor=3,  # only a third of the candidates are promoted
        resource='model__n_estimators',  # the limiting resource
        max_resources=500, # max number of trees
        min_resources=10,
        scoring='roc_auc',
        cv=3,
        random_state=10,
        refit=True,
        n_jobs=-1,
    )

    search.fit(X_train, y_train)
    return search

However, this function will undersample the data to tune each gbm model. This is inefficient, because the undersampling is the same.

What I would like is to be able to pass to HalvingRandomSearchCV the undersampled folds and the test fold somehow.

In short, I want to undersample 3 different folds of X_train, and then be able to use those to optimize the hyperparameters of xgboost, catboost, gradientboostingclassifier and other models.

Is there a way to do so?

Solution

You can do this:

Get initial folds using .split() method of your sklearn CV object. It returns indices for train and test of each fold.
Undersample train fold data using imblearn sampler. You can discard resulting undersampled data, as you need only indices.
Extract indices of undersampled train fold from fitted imblearn sampler and use them to get undersampled train fold indices
For each fold, save tuple (fold_train_sampled_indices, fold_test_indices)

def cv_undersample_split(X, y, cv, imb_sampler):
    folds = []
    for fold_train_idx, fold_test_idx in cv.split(X, y):
        imb_sampler.fit_resample(X[fold_train_idx], y[fold_train_idx])
        fold_train_sampled_idx = fold_train_idx[imb_sampler.sample_indices_]
        folds.append((fold_train_sampled_idx, fold_test_idx))
    return folds

folds = cv_undersample_split(X=X_train, y=y_train, 
    cv=KFold(3), 
    imb_sampler=RandomUnderSampler()
)

Now you can use folds instead of cv parameter in HalvingRandomSearchCV.

estimators = [(GradientBoostingClassifier(), {'model__max_depth': [1, 3]}),
              (RandomForestClassifier(), {'model__max_depth': [1, 3]})
]

for estimator, params in estimators:
  print(estimator)  
  pipe = Pipeline([
              ("model", estimator)
          ])
  search = HalvingRandomSearchCV(
          estimator=pipe,
          param_distributions=params,
          n_candidates="exhaust",
          factor=3,  # only a third of the candidates are promoted
          resource='model__n_estimators',  # the limiting resource
          max_resources=500, # max number of trees
          min_resources=10,
          scoring='roc_auc',
          cv=folds,  # <---- use pre-computed folds here
          random_state=10,
          refit=True,
          n_jobs=-1,
          verbose=True
      )
  search.fit(X_train, y_train)

which gives the following output (among other things):

Fitting 3 folds for each of 2 candidates, totalling 6 fits