I am training a large word2vec
model with gensim
and using logging to follow the training process. The log shows that
PROGRESS: at sentence #3060000, processed 267654284 words, keeping 940042 word types
What are these word types? The unique words among the 200M+ tokens in the data? I cannot find anything in the documentation.
Yes, this is reporting progress during the initial 1st vocabulary-survey, and that's the logging's odd terminology for unique word-tokens discovered.
During the scan, this will be a precise count of the unique tokens encountered, unless you're using the max_vocab_size
parameter which can trigger some mid-scan purging of rarer tokens. (I strongly recommend against using max_vocab_size
setting unless there's no way to proceed without it, because of the non-intuitive effects it has on the survey's running count & final vocabulary size.)
At the end of the scan, there will also be a report of the final unique count, then the unique count after your min_count
is applied.
If you want to place a hard cap on the known vocabulary – for example to cap the size of your model during training – the max_final_vocab
parameter can be used. (It only trims to the exact most-frequent-N words, at the end of the full scan, rather than applying interim larger mid-scan cullings that can be triggered by max_vocab_size
.)