pythonscikit-learncross-validationcalibrationoversampling

Combination of CalibratedClassifierCV with RandomOverSampler


When using a classifier like GaussianNB(), the resulting .predict_proba() values are sometimes poorly calibrated; that's why I'd like to wrap this classifier into sklearn's CalibratedClassifierCV.

I have now a binary classification problem with only a very few positive samples - so few that CalibratedClassifierCV fails because there are less samples than folds (the resulting error is then Requesting 5-fold cross-validation but provided less than 5 examples for at least one class.). Thus, I'd like to upsample the minority class before applying the classifier. I use imblearn's pipeline for this as it ensures that resampling takes place only during fit and not during inference.

However, I do not find a way to upsample my training data and combine it with CalibratedClassifierCV while ensuring that upsampling only takes place during fit and not during inference.

I tried the following reproducible example, but it seems that CalibratedClassifierCV wants to split the data first, prior to upsampling - and it fails.

Is there a way to correctly upsample data while using CalibratedClassifierCV?

from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline

X, y = make_classification(
    n_samples = 100,
    n_features = 10,
    n_classes = 2,
    weights = (0.95,), # 5% of samples are of class 1
    random_state = 10,
    shuffle = True
)

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size = 0.2,
    random_state = 10,
    shuffle = True,
    stratify = y
)

pipeline = Pipeline([
    ("resampling", RandomOverSampler(
        sampling_strategy=0.2,
        random_state=10
    )),
    ("model", GaussianNB())
])

m = CalibratedClassifierCV(
    base_estimator=pipeline,
    method="isotonic",
    cv=5,
    n_jobs=-1
)

m.fit(X_train, y_train) # results in error

Solution

  • I guess I understand my conceptual error: the cross-validation split has to happen BEFORE upsampling and not after (otherwise there would be information leakage from validation to training). But if it happens before, I cannot have more folds than samples of the positive class... Thus, oversampling does not save me from having not enough samples for CalibratedClassifierCV. So I indeed have to reduce the number of folds, as @NMH1013 suggests.