pythonmachine-learningscikit-learnfeature-selection

How to find the best features efficiently?


I am looking to find the best possible model for predicting a target variable (categorical, 9 classes), using up to 30 available features. I have a dataset with 12k rows.

When I worked on similar problems previously, I had access to high-performance computing clusters, meaning that I didn't have to worry too much about resource constraints when tuning a model. Now, I'm restricted to using a 2021 M1 Macbook Pro, or a less powerful Ubuntu server. This is proving a huge challenge, as everything I try is ending up taking way too long to be feasibly used.

I started the process by running a very basic shoot-out cross-validation between 7 possible classifiers, employing all available features. This led to 3 potential classifiers (SVC-linear, random forest, multinomial logistic regression), all of which have returned mean accuracy values around .73 (which isn't bad, but I'm aiming for >.8.

Now, I want to find the best possible model configuration by a) finding the best feature combo for each model, and b) the best hyperparameters.

I've tried two strategies for feature selection:

One - mlextend's SequentialFeatureSelector, utilising all available processor cores. For only one model (SVC), this process ran for >30 hours, and then crashed the entire system. Not a feasible strategy.

Two - I tried using a more statistical approach SelectKBest, without having to test every possible feature combination. This is the code that came up with to do that:

rnd = RANDOM_STATE
model_feature_performance_df = pd.DataFrame()

for i, clf in enumerate(classifiers):
    for f in range(folds):
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, shuffle=True, random_state=rnd)
        
        for k in range(1, len(X.columns)+1):
            selector = SelectKBest(chi2, k=k)

            selector.fit(X_train, y_train)

            X_train_selected = selector.transform(X_train)
            X_test_selected = selector.transform(X_test)

            clf.fit(X_train_selected, y_train)
            y_pred = clf.predict(X_test_selected)

            f1 = np.round(f1_score(y_test, y_pred, average='weighted'), 3)
            acc = np.round(accuracy_score(y_test, y_pred), 3)

            features_used = ', '.join(list(X_train.columns[selector.get_support()]))

            tmp_df = pd.DataFrame(
                [{
                    'classifier': clf_names[i],
                    'fold': f,
                    'random_state': rnd,
                    'k': k,
                    'features': features_used,
                    'f1': f1,
                    'acc': acc
                }]
            )

            model_feature_performance_df = pd.concat([model_feature_performance_df, tmp_df])

        rnd += 1

Again, after over 24 hours, it had only completed one fold for the SVC model, and then it crashed without returning anything.

I am looking for any advice as to how to make an informed decision on what my best possible model could be within hours, not days.


Solution

  • Your two approaches are indeed standard approaches when selecting features.

    Please note that when using SelectKBest, or any univariate feature selection method, each feature is evaluated independently, without considering potential relationships between features. This might not result in the "best" combiantion of features.

    Please have a look at the scikit-learn website, they have an extensive guide on feature selection there: 1.13. Feature selection I couldn't explain it better or give a more comprehensive overview.