pythongensim

What are the "word types" in gensim?


I am training a large word2vec model with gensim and using logging to follow the training process. The log shows that

PROGRESS: at sentence #3060000, processed 267654284 words, keeping 940042 word types

What are these word types? The unique words among the 200M+ tokens in the data? I cannot find anything in the documentation.


Solution

  • Yes, this is reporting progress during the initial 1st vocabulary-survey, and that's the logging's odd terminology for unique word-tokens discovered.

    During the scan, this will be a precise count of the unique tokens encountered, unless you're using the max_vocab_size parameter which can trigger some mid-scan purging of rarer tokens. (I strongly recommend against using max_vocab_size setting unless there's no way to proceed without it, because of the non-intuitive effects it has on the survey's running count & final vocabulary size.)

    At the end of the scan, there will also be a report of the final unique count, then the unique count after your min_count is applied.

    If you want to place a hard cap on the known vocabulary – for example to cap the size of your model during training – the max_final_vocab parameter can be used. (It only trims to the exact most-frequent-N words, at the end of the full scan, rather than applying interim larger mid-scan cullings that can be triggered by max_vocab_size.)