As you might know, when you make a doc2vec model, you might do model.build_vocab(corpus_file='...')
first, then do model.train(corpus_file='...', total_examples=..., total_words=..., epochs=10)
.
I am making the model w/ huge wikipedia data file. So, I have to designate the 'total_examples' and the 'total_words' for parameters of train(). Gensim's Tutorial says that I can get the first one as total_examples=model.corpus_count
. This is fine. But I don't know how to get second one, total_words
. I can see the # of total words in the last log from model.build_vocab() as below. So, I directory put the number, like total_words=1304592715
, but I'd like to designate it like model.corpus_count manner.
Can someone tell me how to obtain the number?
Thank you,
:
2022-01-29 15:03:22,377 : INFO : PROGRESS: at example #1290000, processed 1253078267 words (6147969/s), 7881288 word types, 0 tags
2022-01-29 15:03:26,434 : INFO : PROGRESS: at example #1300000, processed 1277357579 words (5984975/s), 7959581 word types, 0 tags
2022-01-29 15:03:30,955 : INFO : collected 8039609 word types and 1309452 unique tags from a corpus of 1309452 examples and 1304592715 words
:
Similar to model.corpus_count
, the tally of words from the last corpus provided to .build_vocab()
should be cached in the model as model.corpus_total_words
.