pythonnltksentiment-analysisnaivebayesnltk-trainer

NLTK Naive Bayes Classifier Training issues


I'm trying to train the classifier for tweets. However, the issue is that it is saying that the classifier has a 100% accuracy and the list of the most informative features doesn't display anything. Does anyone know what I'm doing wrong? I believe all my inputs to the classifier are correct, so I have no idea where it is going wrong.

This is the dataset I'm using: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

This is my code:

import nltk
import random

file = open('Train/train.txt', 'r')


documents = []
all_words = []           #TODO remove punctuation?
INPUT_TWEETS = 3000

print("Preprocessing...")
for line in (file):

    # Tokenize Tweet content
    tweet_words = nltk.word_tokenize(line[2:])

    sentiment = ""
    if line[0] == 0:
        sentiment = "negative"
    else:
        sentiment = "positive"
    documents.append((tweet_words, sentiment))

    for word in tweet_words:
        all_words.append(word.lower())

    INPUT_TWEETS = INPUT_TWEETS - 1
    if INPUT_TWEETS == 0:
        break

random.shuffle(documents) 


all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]   #top 3000 words

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

#Categorize as positive or Negative
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]


training_set = feature_set[:1000]
testing_set = feature_set[1000:]  

print("Training...")
classifier = nltk.NaiveBayesClassifier.train(training_set)

print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)

Solution

  • There is a typo in your code:

    feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]

    This causes sentiment to have the same value all the time (namely the value of the last tweet from your preprocessing step) so training is pointless and all features are irrelevant.

    Fix it and you will get:

    ('Naive Bayes Accuracy:', 66.75)
    Most Informative Features
                      -- = True           positi : negati =      6.9 : 1.0
                   these = True           positi : negati =      5.6 : 1.0
                    face = True           positi : negati =      5.6 : 1.0
                     saw = True           positi : negati =      5.6 : 1.0
                       ] = True           positi : negati =      4.4 : 1.0
                   later = True           positi : negati =      4.4 : 1.0
                    love = True           positi : negati =      4.1 : 1.0
                      ta = True           positi : negati =      4.0 : 1.0
                   quite = True           positi : negati =      4.0 : 1.0
                  trying = True           positi : negati =      4.0 : 1.0
                   small = True           positi : negati =      4.0 : 1.0
                     thx = True           positi : negati =      4.0 : 1.0
                   music = True           positi : negati =      4.0 : 1.0
                       p = True           positi : negati =      4.0 : 1.0
                 husband = True           positi : negati =      4.0 : 1.0