gensimword2vecdoc2vec

Word2Vec / Doc2Vec training fails: Supplied example count (0) did not equal expected count


I am learning Word2Vec and was trying to replicate a Word2Vec model from my textbook. Unlike what the textbook shows, however, my model gives a warning saying that supplied example count (0) did not equal expected count (2381). Apparently, my model was not trained at all. The corpus I fed to the model was apparently an re-usable iterator (it was a list) as it passed this test:

>>> print(sum(1 for _ in corpus))
>>> print(sum(1 for _ in corpus))
>>> print(sum(1 for _ in corpus))

2381
2381
2381

I tried with gensim 3.6 and gensim 4.3, and both versions gave me the same warning. Here is a code snippet I used with gensim 3.6:

word2vec_model = Word2Vec(size = 300, window=5, min_count = 2, workers = -1)
word2vec_model.build_vocab(corpus)
word2vec_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin.gz', lockf=1.0, binary=True)
word2vec_model.train(corpus, total_examples = word2vec_model.corpus_count, epochs = 15)

This is the warning message:

WARNING:gensim.models.base_any2vec:EPOCH - 1 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 2 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 3 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 4 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 5 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 6 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 7 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 8 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 9 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 10 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 11 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 12 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 13 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 14 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 15 : supplied example count (0) did not equal expected count (2381)
(0, 0)

I tried to train a different model with Doc2Vec with different corpus that is in the form of TaggedDocument, it gave me the same warning message.


Solution

  • Gensim's Word2Vec & Doc2Vec (& related models) don't take a workers=-1 value. You have to set a specific count of worker threads.

    Setting -1 means no threads, and then the no-training situation you've observed. (There might be some better messaging of what's gone wrong in the latest Gensim or with loggin to at least the INFO level.)

    Generally the worker count should never be higher than the number of CPU cores – but also, when training using a corpus iterable on a machine with more than 8 cores, optimal throughput is more likely to be reached in the 6-12 thread range than anything higher, because of some contention/bottlnecking in the single-reader-thread, fan-out-to-many-workers approach Gensim uses, and the Python "GIL".

    Unfortunately, the exact best throughput value will vary based on your other parameters, especially window and vector_size and negative, and can only be found via trial-and-error. I often start with 6 on an 8-core machine, and 12 on any machine with 16 or more cores. (Another key tip is to make sure your corpus iterable is doing as little as possible – such as reading a pre-tokenized file from disk, rather than doing any other preprocessing every iteration, in the main thread.)

    If you can get all your text from a pretokenized text file, you can also consider the corpus_file mode, which lets each worker read its own unique range of the file, and thus better achieves maximum throughput by setting workers to the number of cores.

    Separate tips: