pythonscikit-learnclassificationfeature-selection

python: How to get real feature name from feature_importances


I am using Python's sklearn random forest (ensemble.RandomForestClassifier) to do classification and am using feature_importances_ to find significant feature for the classifier. Now my code is:

for trip in database:
    venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature

feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)

orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())

# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types 

data = orig_ven_feat.tocsr()

le = LabelEncoder() 
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
    unlabelled_int = int(le.transform(["Unlabelled"]))
else:
    unlabelled_int = -1

valid_rows_idx = np.where(labels!=unlabelled_int)[0]  
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification 

clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)                      
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
    train_data = data[train_ind,:]
    test_data = data[test_ind,:]
    labels_train = labels[train_ind]
    labels_test = labels[test_ind]

    print ("Training classifier...")
    clf.fit(train_data,labels_train)
    importances = clf.feature_importances_

Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)

I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:

  1. How to get indices of top 20 from importances

  2. Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)

Any idea to help me? Thank you very much!


Solution

  • The feature_importances_ method returns the relative importance numbers in the order the features were fed to the algorithm. So in order to get the top 20 features you'll want to sort the features from most to least important for instance like this:

    importances = forest.feature_importances_
    indices = numpy.argsort(importances)[-20:]
    

    ([-20:] because you need to take the last 20 elements of the array since argsort sorts in ascending order)