I have a dataset consisting of categorical and numerical data with 124 features. In order to reduce its dimensionality I want to remove irrelevant features. However, to run the dataset against a feature selection algorithm I one hot encoded it with get_dummies, which increased the number of features to 391.
In[16]:
X_train.columns
Out[16]:
Index([u'port_7', u'port_9', u'port_13', u'port_17', u'port_19', u'port_21',
...
u'os_cpes.1_2', u'os_cpes.1_1'], dtype='object', length=391)
With the resulting data I can run recursive feature elimination with cross validation, as per the Scikit Learn example:
Which produces:
Cross Validated Score vs Features Graph
Given that the optimal number of features identified was 8, how do I identify the feature names? I am assuming that I can extract them into a new DataFrame for use in a classification algorithm?
[EDIT]
I have achieved this as follows, with help from this post:
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols, query_cols, sorter = sidx)]
feature_index = []
features = []
column_index(X_dev_train, X_dev_train.columns.values)
for num, i in enumerate(rfecv.get_support(), start=0):
if i == True:
feature_index.append(str(num))
for num, i in enumerate(X_dev_train.columns.values, start=0):
if str(num) in feature_index:
features.append(X_dev_train.columns.values[num])
print("Features Selected: {}\n".format(len(feature_index)))
print("Features Indexes: \n{}\n".format(feature_index))
print("Feature Names: \n{}".format(features))
which produces:
Features Selected: 8
Features Indexes:
['5', '6', '20', '26', '27', '28', '67', '98']
Feature Names:
['port_21', 'port_22', 'port_199', 'port_512', 'port_513', 'port_514', 'port_3306', 'port_32768']
Given that one hot encoding introduces multicollinearity, I don't think the target column selection is ideal because the features it has chosen are non-encoded continual data features. I have tried re-adding the target column unencoded but RFE throws the following error because the data is categorical:
ValueError: could not convert string to float: Wireless Access Point
Do I need to group multiple one hot encoded feature columns to act as the target?
[EDIT 2]
If I simply LabelEncode the target column, I can use this target as 'y' see example again. However, the output determines only a single feature (the target column) as optimal. I think this might be because of the one hot encoding, should I be looking at producing a dense array and if so, can it be run against RFE?
Answering my own question, I figured out the issue was related to the way I had one-hot encoded the data. Initially, I ran one hot encoding against all categorical columns as follows:
ohe_df = pd.get_dummies(df[df.columns]) # One-hot encode all columns
This introduced a large number of additional features. Taking a different approach, with some help from here, I have modified the encoding to encode multiple columns on a per-column/feature basis as follows:
cf_df = df.select_dtypes(include=[object]) # Get categorical features
nf_df = df.select_dtypes(exclude=[object]) # Get numerical features
ohe_df = nf_df.copy()
for feature in cf_df:
ohe_df[feature] = ohe_df.loc[:,(feature)].str.get_dummies().values.tolist()
Producing:
ohe_df.head(2) # Only showing a subset of the data
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| | os_name | os_family | os_type | os_vendor | os_cpes.0 |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| 0 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 1, 0, 0, 0] | [1, 0, 0, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... |
| 1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 1, 0] | [0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
Unfortunately, although this was what I was searching for, it didn't execute against RFECV. Next I thought perhaps I could take a slice of all the new features and pass them in as the target, but this resulted in an error. Finally, I realised I would have to iterate through all target values and take the top outputs from each. The code ended up looking something like this:
for num, feature in enumerate(features, start=0):
X = X_dev_train
y = X_dev_train[feature]
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
# step is the number of features to remove at each iteration
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(kfold), scoring='accuracy')
try:
rfecv.fit(X, y)
print("Number of observations in each fold: {}".format(len(X)/kfold))
print("Optimal number of features : {}".format(rfecv.n_features_))
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for num2, f in enumerate(range(X.shape[1]), start=0):
if g_scores[indices[f]] > 0.80:
if num2 < 10:
print("{}. Number of features: {} Grid_Score: {:0.3f}".format(f + 1, indices[f]+1, g_scores[indices[f]]))
print "\nTop features sorted by rank:"
results = sorted(zip(map(lambda x: round(x, 4), rfecv.ranking_), X.columns.values))
for num3, i in enumerate(results, start=0):
if num3 < 10:
print i
# Plot number of features VS. cross-validation scores
plt.rc("figure", figsize=(8, 5))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("CV score (of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
except ValueError:
pass
I'm sure this could be cleaner, may be even plotted in one graph, but it works for me.