I am using Python's sklearn
random forest (ensemble.RandomForestClassifier
) to do classification and am using feature_importances_
to find significant feature for the classifier. Now my code is:
for trip in database:
venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature
feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)
orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())
# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types
data = orig_ven_feat.tocsr()
le = LabelEncoder()
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
unlabelled_int = int(le.transform(["Unlabelled"]))
else:
unlabelled_int = -1
valid_rows_idx = np.where(labels!=unlabelled_int)[0]
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification
clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
train_data = data[train_ind,:]
test_data = data[test_ind,:]
labels_train = labels[train_ind]
labels_test = labels[test_ind]
print ("Training classifier...")
clf.fit(train_data,labels_train)
importances = clf.feature_importances_
Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)
I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:
How to get indices of top 20 from importances
Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)
Any idea to help me? Thank you very much!
The feature_importances_
method returns the relative importance numbers in the order the features were fed to the algorithm. So in order to get the top 20 features you'll want to sort the features from most to least important for instance like this:
importances = forest.feature_importances_
indices = numpy.argsort(importances)[-20:]
([-20:]
because you need to take the last 20 elements of the array since argsort
sorts in ascending order)