I'm trying to get accuracy, recall and precision measurements form NLTK movie review corpus but I get three undesirable outcomes:
random.shuffle
method, which makes accuracy, etc., different each time. This isn't good.shuffle
line, but precision and recall don't work anymore and show 0.0 and "none" respectively.shuffle
line and change the training and test sets to [500:1500] and [:1500] respectively, like in this thread: How to get the precision and recall from a nltk classifier? The recall and precision do work now, but by doing so, the testing set is larger than the training one, which works at first glance but I believe you can't do that.import nltk
import random
import collections
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[300:], featuresets[:300]
print ('train on', len(train_set), 'instances, test on', len(test_set), 'instances')
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(10)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(test_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print('Precision:', nltk.precision(refsets['pos'], testsets['pos']))
print('Recall:', nltk.recall(refsets['pos'], testsets['pos']))
print('F_Measure:', nltk.f_measure(refsets['pos'], testsets['pos']))
Is there anything I can do, or am I just misunderstanding something?
Edit: a sufficient solution is to use random.seed(x)
before shuffling, or to average several runs. It doesn't explain, however, why deleting the shuffle breaks the program.
The reason why deleting the shuffle breaks the program is that the NaiveBayesClassifier implementation in NLTK assumes that the data is randomly shuffled before splitting into training and testing sets. If you don't shuffle the data, the training and testing sets will have a biased distribution and may not generalize well to new data. To ensure that you get consistent results, you can set the seed for the random number generator before shuffling the data. This will ensure that the shuffling is done in a deterministic way and you get the same train/test splits every time you run the code. You can use random.seed() function to set the seed. Here's an updated code with the random seed set to 42:
import nltk
import random
import collections
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
random.seed(42) # Set the random seed for reproducibility
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[300:], featuresets[:300]
print ('train on', len(train_set), 'instances, test on', len(test_set), 'instances')
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(10)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(test_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print('Precision:', nltk.precision(refsets['pos'], testsets['pos']))
print('Recall:', nltk.recall(refsets['pos'], testsets['pos']))
print('F_Measure:', nltk.f_measure(refsets['pos'], testsets['pos']))
This should give you consistent results each time you run the code.