pythonmachine-learningimblearn

Does imblearn pipeline turn off sampling for testing?


Let us suppose the following code (from imblearn example on pipelines)

...    
# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)

pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)

I want to make it sure that when executing the pipeline.predict(X_test) the sampling procedures enn and renn will not be executed (but of course the pca must be executed).

  1. First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to the test/validation set. Please correct me here if I am wrong.

  2. I browsed though the imblearn Pipeline code but I could not find the predict method there.

  3. I also would like to be sure that this correct behavior works when the pipeline is inside a GridSearchCV

I just need some assurance that this is what happens with the imblearn.Pipeline.

EDIT: 2020-08-28

@wundermahn answer is all I needed.

This edit is just to add that this is the reason one should use the imblearn.Pipeline for imbalanced pre-processing and not sklearn.Pipeline Nowhere in the imblearn documentation I found an explanation why the need for imblearn.Pipeline when there is sklearn.Pipeline


Solution

  • Great question(s). To go through them in the order you posted:

    1. First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to the test/validation set. Please correct me here if I am wrong.

    That is correct. You certainly do not want to test (whether that be on your test or validation data) on data that is not representative of the actual, live, "production" dataset. You should really only apply this to training. Please note, that if you are using techniques like cross-fold validation, you should apply the sampling to each fold individually, as indicated by this IEEE paper.

    1. I browsed though the imblearn Pipeline code but I could not find the predict method there.

    I'm assuming you found the imblearn.pipeline source code, and so if you did, what you want to do is take a look at the fit_predict method:

     @if_delegate_has_method(delegate="_final_estimator")
        def fit_predict(self, X, y=None, **fit_params):
            """Apply `fit_predict` of last step in pipeline after transforms.
            Applies fit_transforms of a pipeline to the data, followed by the
            fit_predict method of the final estimator in the pipeline. Valid
            only if the final estimator implements fit_predict.
            Parameters
            ----------
            X : iterable
                Training data. Must fulfill input requirements of first step of
                the pipeline.
            y : iterable, default=None
                Training targets. Must fulfill label requirements for all steps
                of the pipeline.
            **fit_params : dict of string -> object
                Parameters passed to the ``fit`` method of each step, where
                each parameter name is prefixed such that parameter ``p`` for step
                ``s`` has key ``s__p``.
            Returns
            -------
            y_pred : ndarray of shape (n_samples,)
                The predicted target.
            """
            Xt, yt, fit_params = self._fit(X, y, **fit_params)
            with _print_elapsed_time('Pipeline',
                                     self._log_message(len(self.steps) - 1)):
                y_pred = self.steps[-1][-1].fit_predict(Xt, yt, **fit_params)
            return y_pred
    

    Here, we can see that the pipeline utilizes the .predict method of the final estimator in the pipeline, in the example you posted, scikit-learn's knn:

     def predict(self, X):
            """Predict the class labels for the provided data.
            Parameters
            ----------
            X : array-like of shape (n_queries, n_features), \
                    or (n_queries, n_indexed) if metric == 'precomputed'
                Test samples.
            Returns
            -------
            y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
                Class labels for each data sample.
            """
            X = check_array(X, accept_sparse='csr')
    
            neigh_dist, neigh_ind = self.kneighbors(X)
            classes_ = self.classes_
            _y = self._y
            if not self.outputs_2d_:
                _y = self._y.reshape((-1, 1))
                classes_ = [self.classes_]
    
            n_outputs = len(classes_)
            n_queries = _num_samples(X)
            weights = _get_weights(neigh_dist, self.weights)
    
            y_pred = np.empty((n_queries, n_outputs), dtype=classes_[0].dtype)
            for k, classes_k in enumerate(classes_):
                if weights is None:
                    mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
                else:
                    mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)
    
                mode = np.asarray(mode.ravel(), dtype=np.intp)
                y_pred[:, k] = classes_k.take(mode)
    
            if not self.outputs_2d_:
                y_pred = y_pred.ravel()
    
            return y_pred
    
    1. I also would like to be sure that this correct behaviour works when the pipeline is inside a GridSearchCV

    This sort of assumes the above two assumptions are true, and I am taking this to mean you want a complete, minimal, reproducible example of this working in a GridSearchCV. There is extensive documentation from scikit-learn on this, but an example I created using knn is below:

    import pandas as pd, numpy as np
    
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.datasets import load_digits
    from sklearn.model_selection import GridSearchCV, train_test_split
    
    param_grid = [
        {
            'classification__n_neighbors': [1,3,5,7,10],
        }
    ]
    
    X, y = load_digits(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20)
    
    pipe = Pipeline([
        ('sampling', SMOTE()),
        ('classification', KNeighborsClassifier())
    ])
    
    grid = GridSearchCV(pipe, param_grid=param_grid)
    grid.fit(X_train, y_train)
    mean_scores = np.array(grid.cv_results_['mean_test_score'])
    print(mean_scores)
    
    # [0.98051926 0.98121129 0.97981998 0.98050474 0.97494193]
    

    Your intuition was spot on, good job :)