pythonscikit-learnfeature-selectionrfe

Force RFECV to keep some features


I'm running features selection and I've been using RFECV to find the optimal number of features. However, there are certain features I'd like to keep...so, I was wondering if there's any way to force the algorithm to keep these selected ones, and run the RFECV on the remaining ones.

So far, I'm running it on all of the features, by using:

def main():

    df_data = pd.read_csv(csv_file_path, index_col=0)
    
    X_train, y_train, X_test, y_test = split_data(df_data)
    feats_selection(X_train, y_train, X_test, y_test)


def feats_selection(X_train, y_train, X_test, y_test):
    nr_splits = 10
    nr_repeats = 1
    features_step = 1
    est = DecisionTreeRegressor()

    cv_mode = RepeatedKFold(n_splits=nr_splits, n_repeats=nr_repeats, random_state=1)
    rfecv = RFECV(estimator=est, step=features_step, cv=cv_mode, scoring='neg_mean_squared_error', verbose=0)

    ## >>> here, the RFECV algorithm is automatically selecting the optimal features <<<
    X_train_transformed = rfecv.fit_transform(X_train, y_train)
    X_test_transformed = rfecv.transform(X_test)


    ## test on test subset
    est.fit(X_train_transformed, y_train)
    y_pred = est.predict(X_test_transformed)
    rmse = mean_squared_error(y_test, y_pred, squared=False)

Solution

  • RFECV doesn't have such a parameter, no.

    Perhaps the cleanest way to accomplish it uses a ColumnTransformer:

    cols_to_always_keep = [...]  # column names if you'll fit on dataframe, column indices otherwise
    col_sel = ColumnTransformer(
        transformers=['keep', "passthrough", cols_to_always_keep)],
        remainder=rfecv,
    )