I am learning Word2Vec and was trying to replicate a Word2Vec model from my textbook. Unlike what the textbook shows, however, my model gives a warning saying that supplied example count (0) did not equal expected count (2381)
. Apparently, my model was not trained at all. The corpus I fed to the model was apparently an re-usable iterator (it was a list) as it passed this test:
>>> print(sum(1 for _ in corpus))
>>> print(sum(1 for _ in corpus))
>>> print(sum(1 for _ in corpus))
2381
2381
2381
I tried with gensim 3.6 and gensim 4.3, and both versions gave me the same warning. Here is a code snippet I used with gensim 3.6:
word2vec_model = Word2Vec(size = 300, window=5, min_count = 2, workers = -1)
word2vec_model.build_vocab(corpus)
word2vec_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin.gz', lockf=1.0, binary=True)
word2vec_model.train(corpus, total_examples = word2vec_model.corpus_count, epochs = 15)
This is the warning message:
WARNING:gensim.models.base_any2vec:EPOCH - 1 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 2 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 3 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 4 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 5 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 6 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 7 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 8 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 9 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 10 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 11 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 12 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 13 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 14 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 15 : supplied example count (0) did not equal expected count (2381)
(0, 0)
I tried to train a different model with Doc2Vec with different corpus that is in the form of TaggedDocument, it gave me the same warning message.
Gensim's Word2Vec
& Doc2Vec
(& related models) don't take a workers=-1
value. You have to set a specific count of worker threads.
Setting -1
means no threads, and then the no-training situation you've observed. (There might be some better messaging of what's gone wrong in the latest Gensim or with loggin to at least the INFO level.)
Generally the worker
count should never be higher than the number of CPU cores – but also, when training using a corpus iterable on a machine with more than 8 cores, optimal throughput is more likely to be reached in the 6-12 thread range than anything higher, because of some contention/bottlnecking in the single-reader-thread, fan-out-to-many-workers approach Gensim uses, and the Python "GIL".
Unfortunately, the exact best throughput value will vary based on your other parameters, especially window
and vector_size
and negative
, and can only be found via trial-and-error. I often start with 6 on an 8-core machine, and 12 on any machine with 16 or more cores. (Another key tip is to make sure your corpus iterable is doing as little as possible – such as reading a pre-tokenized file from disk, rather than doing any other preprocessing every iteration, in the main thread.)
If you can get all your text from a pretokenized text file, you can also consider the corpus_file
mode, which lets each worker read its own unique range of the file, and thus better achieves maximum throughput by setting workers to the number of cores.
Separate tips:
A min_count=2
value so low usually hurts word2vec results: rare words don't learn good representation for themselves from a small number of usage examples, but can in aggregate dilute/interfere-with other words. Discarding more rare words, as the size of the corpus allows, often improves all surviving words enough to improve overall downstream evaluations.
.intersect_word2vec_format()
is an advanced/experimental option with no sure best practices; try to understand what it does from the source code, and the weird ways it changes the usual SGD tradeoffs, before trying it – and be sure to run extra checks that it's doing what you want over more typical approaches.