pythonmachine-learningscikit-learnfeature-selectiongbm

Integration of RFE with GBM for Feature Selection and Hyperparameter Tuning


My name is Lucas, and I'm relatively new to the field of machine learning. I've written this code with the help of some online documentation and tutorials. However, I'd like some assistance in understanding if the integration of RFE() with GBM() is correct.

def evaluateAlgorithm(X_train, X_test, y_train, y_test, dataset):
    Kfold = StratifiedKFold(n_splits=20, shuffle=True)

    GBM = GradientBoostingClassifier(
        loss='log_loss', learning_rate=0.01,
        n_estimators=1000, subsample=0.9,
        min_samples_split=2, min_samples_leaf=1,
        min_weight_fraction_leaf=0.0, max_depth=8,
        init=None, random_state=None,
        max_features=None, verbose=0,
        max_leaf_nodes=None, warm_start=False)

    pipeline = Pipeline(steps=[['feature_selection', RFE(GBM)], ['model', GBM]])

    parameters = {'model__learning_rate': [0.01, 0.02, 0.03],
                  'model__subsample': [0.9, 0.5, 0.3, 0.1],
                  'model__n_estimators': [100, 500, 1000],
                  'model__max_depth': [1, 2, 3],
                  'feature_selection__n_features_to_select': [7, 14, 27]}

    grid_GBM = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=Kfold,
                            verbose=1, n_jobs=-1, refit=True, scoring='accuracy')
    grid_GBM.fit(X_train, y_train)

    print("\n=========================================================================")
    print(" Results from Grid Search Gradient Boosting")
    print("=========================================================================")
    print("\n The best estimator across ALL searched params: \n", grid_GBM.best_estimator_)
    print("\n The best score across ALL searched params: \n", grid_GBM.best_score_)
    print("\n The best parameters across ALL searched params: \n", grid_GBM.best_params_)
    print("\n=========================================================================")

    # Obtain features selected by RFE
    rfe_selected_features_indices = grid_GBM.best_estimator_['feature_selection'].support_
    rfe_selected_features_names = X_train.columns[rfe_selected_features_indices]
    print("Features selected by RFE:", rfe_selected_features_names)

    model_GBM = grid_GBM.best_estimator_

    # Cross-validation
    cv_results_GBM = cross_val_score(model_GBM, X_train, y_train, cv=Kfold, scoring='accuracy', n_jobs=-1, verbose=0)

    print()
    print("Cross Validation results Gradient Boosting: ", cv_results_GBM)
    prt_string = "CV Mean accuracy: %f (Std: %f)" % (cv_results_GBM.mean(), cv_results_GBM.std())
    print(prt_string)

    trained_Model_GBM = model_GBM.fit(X_train, y_train)

    print();
    print('========================================================')
    print();
    print(trained_Model_GBM.get_params(deep=True))
    print();
    print('=========================================================')

    # Make predictions on the test set
    pred_Labels_GBM = trained_Model_GBM.predict(X_test)
    pred_proba_GBM = trained_Model_GBM.predict_proba(X_test)

    # Evaluate performance
    print();
    print('Evaluation of the trained model Gradient Boosting: ')
    accuracy = accuracy_score(y_test, pred_Labels_GBM)
    print();
    print('Accuracy Gradient Boosting: ', accuracy)
    precision = precision_score(y_test, pred_Labels_GBM, pos_label='positive')
    print();
    print('Precision Gradient Boosting: ', precision)
    recall = recall_score(y_test, pred_Labels_GBM, pos_label='positive')
    print();
    print('Recall Score Gradient Boosting: ', recall)
    f1 = f1_score(y_test, pred_Labels_GBM, pos_label='positive')
    print();
    print('f1 Score Gradient Boosting: ', f1)
    confusion_mat = confusion_matrix(y_test, pred_Labels_GBM)
    classReport = classification_report(y_test, pred_Labels_GBM)
    print();
    print('Classification Report Gradient Boosting: \n', classReport)
    kappa_score = cohen_kappa_score(y_test, pred_Labels_GBM)
    print();
    print('Kappa Score Gradient Boosting: ', kappa_score)

    skplt.estimators.plot_learning_curve(model_GBM, X_train, y_train, figsize=(8, 6))
    plt.show()

    skplt.metrics.plot_roc(y_test, pred_proba_GBM, figsize=(8, 6));
    plt.show()

    skplt.metrics.plot_confusion_matrix(y_test, pred_Labels_GBM, figsize=(8, 6));
    plt.show()

    skplt.metrics.plot_precision_recall(y_test, pred_proba_GBM,
                                        title='Precision-Recall Curve', plot_micro=True,
                                        classes_to_plot=None, ax=None, figsize=(8, 6),
                                        cmap='nipy_spectral', title_fontsize='large',
                                        text_fontsize='medium');
    plt.show()


evaluateAlgorithm(X_train, X_test, y_train, y_test, dataset)

My goal is to use RFE to find the best combination of features alongside the grid search of GBM for the best hyperparameters. However, it seems that RFE is only finding the best features before the hyperparameters of the grid search. How can I resolve this so that both processes occur simultaneously? The idea is to achieve the best combination of both criteria. Additionally, do you have any suggestions for improving this code?

Based on Ben Reiniger's response, I have arrived at the following result.

    pipeline = Pipeline(steps=[['RFE', RFE(estimator=GBM)]])

parameters = {
    'RFE__estimator__learning_rate': [0.001, 0.01, 0.05],#0.01, 0.02, 0.03, 0.05
    'RFE__estimator__subsample': [0.9, 0.5, 0.1], #0.7,  0.3,
    'RFE__estimator__n_estimators': [500, 1000, 2000], #1500
    'RFE__estimator__max_depth': [4, 5, 6,], #3, 7, 8
    #'RFE__estimator__max_features': ['auto', 'sqrt', 'log2', None],
    #'RFE__estimator__min_samples_split': [2, 5, 10],
    #'RFE__estimator__min_samples_leaf': [1, 2, 4],
    #'RFE__estimator__max_leaf_nodes': [None, 5, 10, 20],
    'RFE__n_features_to_select': [5, 10, 20]
}

Solution

  • As written, your code tunes the hyperparameters of the final model, but not those of the gbm inside the feature-selection step. A couple of options:

    1. Expand the search space to include hyperparameters of the selection-gbm, e.g. feature_selection__estimator__max_depth.

    2. Drop the model step. RFE gives access to a final model on the selected feature set (estimator_) and the methods you're likely to need from it are made available directly from the RFE object (e.g. rfe.predict). So then just modify the names of the hyperparameters as above.

    The difference between these approaches is that the first allows a selection-gbm with different hyperparameters from the model-gbm. That'll tend to be more computationally expensive, but more flexible. I personally would be surprised if it provided significant improvement, so I'd suggest the second approach unless you have some time and an inclination to experiment.