python-3.xnlpnltktext-classification

Python Text Classification Accuracy Measurement Inconsistency


I'm trying to get accuracy, recall and precision measurements form NLTK movie review corpus but I get three undesirable outcomes:

import nltk
import random
import collections
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize

documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)



all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[word] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[300:], featuresets[:300]
print ('train on', len(train_set), 'instances, test on', len(test_set), 'instances')
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(10)

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(test_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
print('Precision:', nltk.precision(refsets['pos'], testsets['pos']))
print('Recall:', nltk.recall(refsets['pos'], testsets['pos']))
print('F_Measure:', nltk.f_measure(refsets['pos'], testsets['pos']))

Is there anything I can do, or am I just misunderstanding something?

Edit: a sufficient solution is to use random.seed(x) before shuffling, or to average several runs. It doesn't explain, however, why deleting the shuffle breaks the program.


Solution

  • The reason why deleting the shuffle breaks the program is that the NaiveBayesClassifier implementation in NLTK assumes that the data is randomly shuffled before splitting into training and testing sets. If you don't shuffle the data, the training and testing sets will have a biased distribution and may not generalize well to new data. To ensure that you get consistent results, you can set the seed for the random number generator before shuffling the data. This will ensure that the shuffling is done in a deterministic way and you get the same train/test splits every time you run the code. You can use random.seed() function to set the seed. Here's an updated code with the random seed set to 42:

    import nltk
    import random
    import collections
    
    from nltk.corpus import movie_reviews
    from nltk.tokenize import word_tokenize
    
    random.seed(42)  # Set the random seed for reproducibility
    
    documents = [(list(movie_reviews.words(fileid)), category)
                for category in movie_reviews.categories()
                for fileid in movie_reviews.fileids(category)]
    
    random.shuffle(documents)
    
    all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
    word_features = list(all_words)[:2000]
    
    def document_features(document):
        document_words = set(document)
        features = {}
        for word in word_features:
            features[word] = (word in document_words)
        return features
    
    featuresets = [(document_features(d), c) for (d,c) in documents]
    train_set, test_set = featuresets[300:], featuresets[:300]
    print ('train on', len(train_set), 'instances, test on', len(test_set), 'instances')
    
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    print(nltk.classify.accuracy(classifier, test_set))
    classifier.show_most_informative_features(10)
    
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (feats, label) in enumerate(test_set):
        refsets[label].add(i)
        observed = classifier.classify(feats)
        testsets[observed].add(i)
    print('Precision:', nltk.precision(refsets['pos'], testsets['pos']))
    print('Recall:', nltk.recall(refsets['pos'], testsets['pos']))
    print('F_Measure:', nltk.f_measure(refsets['pos'], testsets['pos']))
    

    This should give you consistent results each time you run the code.