I have created this function below, that creates a pipeline and returns it.
def make_final_pipeline(columns_transformer, onehotencoder, estimator,
Name_of_estimator, index_of_categorical_features, use_smote=True):
if use_smote:
# Final pipeline with the SMOTE-NC and the estimator.
finalPipeline = ImblearnPipeline(
steps=[('col_transformer', columns_transformer),
('smote', SMOTENC(categorical_features=index_of_categorical_features,
sampling_strategy='auto')),
('oneHotColumnEncoder', onehotencoder),
(Name_of_estimator, estimator)
]
)
else:
# Final pipeline with the estimator only.
finalPipeline = ImblearnPipeline(
steps=[('col_transformer', columns_transformer),
('oneHotColumnEncoder', onehotencoder),
(Name_of_estimator, estimator)
]
)
return finalPipeline
In the returned Pipeline, the SMOTENC
step becomes optional thanks to use_smote
. However, according to [this question],(Is it possible to toggle a certain step in sklearn pipeline?), it is possible to create a customized OptionalSMOTENC
that would take all arguments of SMOTENC
as well as use_smote
, and would be so that make_final_pipeline
could be written as:
def make_final_pipeline(columns_transformer, onehotencoder, estimator,
Name_of_estimator, index_of_categorical_features, use_smote=True):
# Final pipeline with the optional SMOTE-NC and the estimator.
finalPipeline = ImblearnPipeline(
steps=[('col_transformer', columns_transformer),
('smote', OptionalSMOTENC(categorical_features=index_of_categorical_features,
sampling_strategy='auto', use_smote=use_smote)),
('oneHotColumnEncoder', onehotencoder),
(Name_of_estimator, estimator)
]
)
return finalPipeline
I guess that the OptionalSMOTENC should be like this:
class OptionalSMOTENC(SMOTENC):
def __init__(categorical_features, sampling_strategy='auto', use_smote=True):
super().__init__()
self.categorical_features = categorical_features
self.sampling_strategy = sampling_strategy
self.smote = smote
def fit(self, X, y = None):
if self.smote:
# fit smotenc
else:
# do nothing
def fit_resample(self, X, y = None):
if self.smote:
# fit_resample smotenc
else:
# do nothing
But I do not know how to correctly write it: can I write class OptionalSMOTENC(SMOTENC)
or should I just write class OptionalSMOTENC()
? Did I put super().__init__()
at a right place?
To conclude, I am not familiar with the way to write such an estimator, could you help me?
I was finally able to come up with a solution:
class OptionalSMOTENC(SMOTENC):
def __init__(self, categorical_features, sampling_strategy='auto',
random_state=None, k_neighbors=5, n_jobs=None, use_smote=True):
super().__init__(categorical_features, sampling_strategy=sampling_strategy,
random_state=random_state, k_neighbors=k_neighbors, n_jobs=n_jobs)
self.use_smote = use_smote
def fit(self, X, y = None):
if self.use_smote:
return SMOTENC.fit(self, X, y)
else:
return self
def fit_resample(self, X, y = None):
if self.use_smote:
return SMOTENC.fit_resample(self, X, y)
else:
return X, y
From my understanding, one could replace SMOTENC
by any estimator and create a class like:
class OptionalEstimator(Estimator):
def __init__(self, arg1, arg2, arg3, use_estimator=True): # Replace arg1, arg2, arg3 by the arguments of Estimator.
super().__init__(arg1, arg2, arg3)
self.use_estimator = use_estimator
def fit(self, X, y = None):
if self.use_estimator:
return Estimator.fit(self, X, y)
else:
return self
def transform(self, X, y = None):
if self.use_estimator:
return Estimator.transform(self, X, y)
else:
return X, y