I'm trying to train the classifier for tweets. However, the issue is that it is saying that the classifier has a 100% accuracy and the list of the most informative features doesn't display anything. Does anyone know what I'm doing wrong? I believe all my inputs to the classifier are correct, so I have no idea where it is going wrong.
This is the dataset I'm using: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
This is my code:
import nltk
import random
file = open('Train/train.txt', 'r')
documents = []
all_words = [] #TODO remove punctuation?
INPUT_TWEETS = 3000
print("Preprocessing...")
for line in (file):
# Tokenize Tweet content
tweet_words = nltk.word_tokenize(line[2:])
sentiment = ""
if line[0] == 0:
sentiment = "negative"
else:
sentiment = "positive"
documents.append((tweet_words, sentiment))
for word in tweet_words:
all_words.append(word.lower())
INPUT_TWEETS = INPUT_TWEETS - 1
if INPUT_TWEETS == 0:
break
random.shuffle(documents)
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000] #top 3000 words
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
#Categorize as positive or Negative
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]
training_set = feature_set[:1000]
testing_set = feature_set[1000:]
print("Training...")
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)
There is a typo in your code:
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]
This causes sentiment
to have the same value all the time (namely the value of the last tweet from your preprocessing step) so training is pointless and all features are irrelevant.
Fix it and you will get:
('Naive Bayes Accuracy:', 66.75)
Most Informative Features
-- = True positi : negati = 6.9 : 1.0
these = True positi : negati = 5.6 : 1.0
face = True positi : negati = 5.6 : 1.0
saw = True positi : negati = 5.6 : 1.0
] = True positi : negati = 4.4 : 1.0
later = True positi : negati = 4.4 : 1.0
love = True positi : negati = 4.1 : 1.0
ta = True positi : negati = 4.0 : 1.0
quite = True positi : negati = 4.0 : 1.0
trying = True positi : negati = 4.0 : 1.0
small = True positi : negati = 4.0 : 1.0
thx = True positi : negati = 4.0 : 1.0
music = True positi : negati = 4.0 : 1.0
p = True positi : negati = 4.0 : 1.0
husband = True positi : negati = 4.0 : 1.0