python-3.xmachine-learningclassificationimbalanced-dataoversampling

oversampling (SMOTE) does not work properly when fitted inside a pipeline


I have an imbalanced classification problem and I am using make_pipeline from imblearn

So the steps are the following:

kf = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)
params = {
    'max_depth': [2,3,5],
#     'max_features':['auto', 'sqrt', 'log2'],
#     'min_samples_leaf': [5,10,20,50,100,200,300],
    'n_estimators': [10,25,30,50]
#     'bootstrap': [True, False]

}
from imblearn.pipeline import make_pipeline
imba_pipeline = make_pipeline(SMOTE(random_state = 42), RobustScaler(), RandomForestClassifier(random_state=42)) 
imba_pipeline

out:Pipeline(steps=[('smote', SMOTE(random_state=42)),
                ('robustscaler', RobustScaler()),
                ('randomforestclassifier',
                 RandomForestClassifier(random_state=42))])

new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
                        return_train_score=True, n_jobs=-1, verbose=2)

grid_imba.fit(X_train, y_train)

And everything is going ok and I am reaching to the end to by problem (i.e I can see the classification report)

However when I am trying to see inside the black box with eli5 with eli.explain_weights(imba_pipeline)

I get back as error

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=42)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't

I know that this Is a common problem and i have read the related questions but i am confused as the problem is occurred after the end of my classification procedure

Any suggestions?


Solution

  • Your pipeline has two fitted steps (+ the scaler): the SMOTE augmentation and the random forest. It looks like this is confusing the eli5 which wants to work with the assumptions that only the last layer is fitted. To get the weight explanation of the random forest you could try calling eli5 only on that layer of the pipeline with

    from eli5 import explain_weights
    
    explain_weights(imba_pipeline['randomforestclassifier'])
    

    provided the pipeline is fitted, but in your code you were fitting the grid search so

    explain_weights(grid_imba.best_estimator_['randomforestclassifier'])
    

    would be more appropriate.