I have a dataset of reviews which has a class label of positive/negative. I am applying Naive Bayes to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
I am splitting the data into train and test dataset.
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
I am applying the naive bayes algorithm as follows
optimal_alpha = 1
NB_optimal = BernoulliNB(alpha=optimal_aplha)
# fitting the model
NB_optimal.fit(X_tr, y_tr)
# predict the response
pred = NB_optimal.predict(X_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the NB classifier for k = %d is %f%%' % (optimal_aplha, acc))
Here X_test is test dataset in which pred variable gives us whether the vector in X_test is positive or negative class.
The X_test shape is (54626 rows, 82343 dimensions)
length of pred is 54626
My question is I want to get the words with highest probability in each vector so that I can get to know by the words that why it predicted as positive or negative class. Therefore, how to get the words which have highest probability in each vector?
You can get the importantance of each word out of the fit model by using the coefs_
or feature_log_prob_
attributes. For example
neg_class_prob_sorted = NB_optimal.feature_log_prob_[0, :].argsort()[::-1]
pos_class_prob_sorted = NB_optimal.feature_log_prob_[1, :].argsort()[::-1]
print(np.take(count_vect.get_feature_names(), neg_class_prob_sorted[:10]))
print(np.take(count_vect.get_feature_names(), pos_class_prob_sorted[:10]))
Prints the top 10 most predictive words for each of your classes.