python-2.7machine-learningscikit-learnfeature-selectionrfe

Extract Optimal Features from Recursive Feature Elimination (RFE)


I have a dataset consisting of categorical and numerical data with 124 features. In order to reduce its dimensionality I want to remove irrelevant features. However, to run the dataset against a feature selection algorithm I one hot encoded it with get_dummies, which increased the number of features to 391.

In[16]:
X_train.columns
Out[16]:
Index([u'port_7', u'port_9', u'port_13', u'port_17', u'port_19', u'port_21',
   ...
   u'os_cpes.1_2', u'os_cpes.1_1'], dtype='object', length=391)

With the resulting data I can run recursive feature elimination with cross validation, as per the Scikit Learn example:

Which produces:

Cross Validated Score vs Features Graph

Given that the optimal number of features identified was 8, how do I identify the feature names? I am assuming that I can extract them into a new DataFrame for use in a classification algorithm?


[EDIT]

I have achieved this as follows, with help from this post:

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols, query_cols, sorter = sidx)]

feature_index = []
features = []
column_index(X_dev_train, X_dev_train.columns.values)

for num, i in enumerate(rfecv.get_support(), start=0):
    if i == True:
        feature_index.append(str(num))

for num, i in enumerate(X_dev_train.columns.values, start=0):
    if str(num) in feature_index:
        features.append(X_dev_train.columns.values[num])

print("Features Selected: {}\n".format(len(feature_index)))
print("Features Indexes: \n{}\n".format(feature_index))
print("Feature Names: \n{}".format(features))

which produces:

Features Selected: 8
Features Indexes: 
['5', '6', '20', '26', '27', '28', '67', '98']
Feature Names: 
['port_21', 'port_22', 'port_199', 'port_512', 'port_513', 'port_514', 'port_3306', 'port_32768']

Given that one hot encoding introduces multicollinearity, I don't think the target column selection is ideal because the features it has chosen are non-encoded continual data features. I have tried re-adding the target column unencoded but RFE throws the following error because the data is categorical:

ValueError: could not convert string to float: Wireless Access Point

Do I need to group multiple one hot encoded feature columns to act as the target?


[EDIT 2]

If I simply LabelEncode the target column, I can use this target as 'y' see example again. However, the output determines only a single feature (the target column) as optimal. I think this might be because of the one hot encoding, should I be looking at producing a dense array and if so, can it be run against RFE?


Solution

  • Answering my own question, I figured out the issue was related to the way I had one-hot encoded the data. Initially, I ran one hot encoding against all categorical columns as follows:

    ohe_df = pd.get_dummies(df[df.columns])              # One-hot encode all columns
    

    This introduced a large number of additional features. Taking a different approach, with some help from here, I have modified the encoding to encode multiple columns on a per-column/feature basis as follows:

    cf_df = df.select_dtypes(include=[object])      # Get categorical features
    nf_df = df.select_dtypes(exclude=[object])      # Get numerical features
    ohe_df = nf_df.copy()
    
    for feature in cf_df:
        ohe_df[feature] = ohe_df.loc[:,(feature)].str.get_dummies().values.tolist()
    

    Producing:

    ohe_df.head(2)      # Only showing a subset of the data
    +---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
    |   |                      os_name                      |    os_family    |     os_type     |             os_vendor             |                     os_cpes.0                     |
    +---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
    | 0 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 1, 0, 0, 0] | [1, 0, 0, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... |
    | 1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 1, 0] | [0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
    +---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
    

    Unfortunately, although this was what I was searching for, it didn't execute against RFECV. Next I thought perhaps I could take a slice of all the new features and pass them in as the target, but this resulted in an error. Finally, I realised I would have to iterate through all target values and take the top outputs from each. The code ended up looking something like this:

    for num, feature in enumerate(features, start=0):
    
        X = X_dev_train
        y = X_dev_train[feature]
        
        # Create the RFE object and compute a cross-validated score.
        svc = SVC(kernel="linear")
        # The "accuracy" scoring is proportional to the number of correct classifications
        # step is the number of features to remove at each iteration
        rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(kfold), scoring='accuracy')
        try:
            rfecv.fit(X, y)
       
            print("Number of observations in each fold: {}".format(len(X)/kfold))
            print("Optimal number of features : {}".format(rfecv.n_features_))
    
            g_scores = rfecv.grid_scores_
            indices = np.argsort(g_scores)[::-1]
    
            print('Printing RFECV results:')
            for num2, f in enumerate(range(X.shape[1]), start=0):
                if g_scores[indices[f]] > 0.80:
                    if num2 < 10:
                        print("{}. Number of features: {} Grid_Score: {:0.3f}".format(f + 1, indices[f]+1, g_scores[indices[f]]))
    
            print "\nTop features sorted by rank:"
            results = sorted(zip(map(lambda x: round(x, 4), rfecv.ranking_), X.columns.values))
            for num3, i in enumerate(results, start=0):
                if num3 < 10:
                    print i
    
            # Plot number of features VS. cross-validation scores
            plt.rc("figure", figsize=(8, 5))
            plt.figure()
            plt.xlabel("Number of features selected")
            plt.ylabel("CV score (of correct classifications)")
            plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
            plt.show()
            
        except ValueError:
            pass
    

    I'm sure this could be cleaner, may be even plotted in one graph, but it works for me.