Let us suppose the following code (from imblearn example on pipelines)
...
# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
# Create the classifier
knn = KNN(1)
# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)
# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)
pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)
I want to make it sure that when executing the pipeline.predict(X_test)
the sampling procedures enn
and renn
will not be executed (but of course the pca
must be executed).
First, it is clear to me that
over-, under-, and mixed-sampling
are procedures to be applied to thetraining set
, not to thetest/validation set
. Please correct me here if I am wrong.I browsed though the
imblearn Pipeline
code but I could not find thepredict
method there.I also would like to be sure that this correct behavior works when the pipeline is inside a
GridSearchCV
I just need some assurance that this is what happens with the imblearn.Pipeline
.
EDIT: 2020-08-28
@wundermahn answer is all I needed.
This edit is just to add that this is the reason one should use the imblearn.Pipeline
for imbalanced pre-processing and not sklearn.Pipeline
Nowhere in the imblearn
documentation I found an explanation why the need for imblearn.Pipeline
when there is sklearn.Pipeline
Great question(s). To go through them in the order you posted:
- First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to the test/validation set. Please correct me here if I am wrong.
That is correct. You certainly do not want to test (whether that be on your test
or validation
data) on data that is not representative of the actual, live, "production" dataset. You should really only apply this to training. Please note, that if you are using techniques like cross-fold validation, you should apply the sampling to each fold individually, as indicated by this IEEE paper.
- I browsed though the imblearn Pipeline code but I could not find the predict method there.
I'm assuming you found the imblearn.pipeline
source code, and so if you did, what you want to do is take a look at the fit_predict
method:
@if_delegate_has_method(delegate="_final_estimator")
def fit_predict(self, X, y=None, **fit_params):
"""Apply `fit_predict` of last step in pipeline after transforms.
Applies fit_transforms of a pipeline to the data, followed by the
fit_predict method of the final estimator in the pipeline. Valid
only if the final estimator implements fit_predict.
Parameters
----------
X : iterable
Training data. Must fulfill input requirements of first step of
the pipeline.
y : iterable, default=None
Training targets. Must fulfill label requirements for all steps
of the pipeline.
**fit_params : dict of string -> object
Parameters passed to the ``fit`` method of each step, where
each parameter name is prefixed such that parameter ``p`` for step
``s`` has key ``s__p``.
Returns
-------
y_pred : ndarray of shape (n_samples,)
The predicted target.
"""
Xt, yt, fit_params = self._fit(X, y, **fit_params)
with _print_elapsed_time('Pipeline',
self._log_message(len(self.steps) - 1)):
y_pred = self.steps[-1][-1].fit_predict(Xt, yt, **fit_params)
return y_pred
Here, we can see that the pipeline
utilizes the .predict
method of the final estimator in the pipeline, in the example you posted, scikit-learn's knn
:
def predict(self, X):
"""Predict the class labels for the provided data.
Parameters
----------
X : array-like of shape (n_queries, n_features), \
or (n_queries, n_indexed) if metric == 'precomputed'
Test samples.
Returns
-------
y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
Class labels for each data sample.
"""
X = check_array(X, accept_sparse='csr')
neigh_dist, neigh_ind = self.kneighbors(X)
classes_ = self.classes_
_y = self._y
if not self.outputs_2d_:
_y = self._y.reshape((-1, 1))
classes_ = [self.classes_]
n_outputs = len(classes_)
n_queries = _num_samples(X)
weights = _get_weights(neigh_dist, self.weights)
y_pred = np.empty((n_queries, n_outputs), dtype=classes_[0].dtype)
for k, classes_k in enumerate(classes_):
if weights is None:
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
else:
mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)
mode = np.asarray(mode.ravel(), dtype=np.intp)
y_pred[:, k] = classes_k.take(mode)
if not self.outputs_2d_:
y_pred = y_pred.ravel()
return y_pred
- I also would like to be sure that this correct behaviour works when the pipeline is inside a GridSearchCV
This sort of assumes the above two assumptions are true, and I am taking this to mean you want a complete, minimal, reproducible example of this working in a GridSearchCV. There is extensive documentation from scikit-learn
on this, but an example I created using knn
is below:
import pandas as pd, numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
param_grid = [
{
'classification__n_neighbors': [1,3,5,7,10],
}
]
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20)
pipe = Pipeline([
('sampling', SMOTE()),
('classification', KNeighborsClassifier())
])
grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)
mean_scores = np.array(grid.cv_results_['mean_test_score'])
print(mean_scores)
# [0.98051926 0.98121129 0.97981998 0.98050474 0.97494193]
Your intuition was spot on, good job :)