pythonpython-3.xgoogle-colaboratoryword-embeddingfasttext

How to get a progess bar for gensim.models.FastText.train()?


I have the following code to train a FastText embedding model.

embed_model = FastText(vector_size=meta_hyper['vector_size'],
                       window=meta_hyper['window'],
                       alpha= meta_hyper['alpha'],
                       workers=meta_hyper['CPU'])

embed_model.build_vocab(data)

start = time.time()

embed_model.train(data, total_examples=len(data), epochs=meta_hyper['epochs'])

I have a fairly large dataset (some million tokens), and I need to understand, how close the model is to the end of training. What can I do?

I have tried to use tqdm and search in the official documentation, it did not help.


Solution

  • For the goal of achieving an accurate estimate of the time remaining, the easiest thing would be to enable logging at the INFO level. For example, a simple 2-liner that does this globally is:

    import logging
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    

    Then, many Gensim functions (including .train()) will log their internal steps to the console, including progress reports during long operations.

    As training is essentially a uniformly-costly operation through all ranges/epochs of your corpus, even just a few minutes' progress will generally be representative of the overall rate, and thus be sufficient to (manualy) project when a training-session will end.

    For example, if it takes 5 minutes to get through 8% of your your 1st training-epoch, and you've requested 10 epochs, then training should complete in about (5 minutes * 100%/8% * 10 epochs =) 625 minutes.

    (The main thing that might ruin such a linear-projection might be if your corpus is wildly-different in text-size or token-diversity in different ranges – for example, the first 8% of your corpus is all short docs with common words, while other ranges are all long docs with rarer words. But that'd also be bad for other reasons – the model optimization works best if the full variety of training data is equally mised throughout the full corpus. So if your data has any potential 'clumping' of texts by length, vocabulary, etc, a single pre-shuffle before training is helpful for both training efficiency and predictable progress.)

    If there's too much logging, .train() has an optional report_delay parameter to specify the number of seconds to wait before each new progress-report.

    Getting a true progress-bar leveraging other tools would be a bit trickier, as .train() generally needs a multiply-re-iterable Python sequence to run its configurable number of epochs, and the most straightforward use of tqdm expects to show progress over 1 single iterator's iteration. (It might be possible to hack-around with some corpus/parameter changes & custom iterable-wrapping, but I'm not sure such an approach wouldn't hit other gotchas.)