pythonnlpdoc2vectqdm

Problem in tqdm function in a Doc2Vec model


I am using this article https://actsusanli.medium.com/ to implement the Doc2Vec model and I have a problem in the training step.

model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs = 40)

As you can see, I am using the tqdm function. When I ran the code the tqdm is 100%, after some minutes, but the algorithm still runs in the same shell for a long time.

Do you have any idea if this is a problem of tqdm function or something else?


Solution

  • By using the "list comprehension" ([..])...

    [x for x in tqdm(train_tagged.values)]
    

    ...you are having tqdm iterate once over your train_tagged.values sequence, into an actual in-memory Python list. This will show the tqdm progress rather quickly – then completely finish any involvement with tqdm.

    Then, you're passing that plain result list (without any tqdm features) into Doc2Vec.train(), where Doc2Vec does its epochs=40 training passes. tqdm is no longer involved, so there'll be no incremental progress-bar output.

    You might be tempted to try (or have already tried) something that skips the extra list creation, passing the tqdm-wrapped sequence directly in like:

    corpus = utils.shuffle(train_tagged.values)
    model_dbow.train(tqdm(corpus), total_examples=len(corpus), epochs = 40)
    

    But this has a different problem: the tqdm-wrapper is only designed to allow (& report the progress of) one iteration over the wrapped sequence. So this will show that one iteration's incremental progress.

    But when .train() tries its next necessary 39 re-iterations, to complete its epochs=40 training-runs, the single-pass tqdm object will be exhausted, preventing full & proper training.

    Note that there is an option for progress-logging within Gensim, by setting the Python logging level (globally, or just for the class Doc2Vec) to INFO. Doc2Vec will then emit a log-line showing progress, within each epoch and between epochs, about every 1 second. But: you can also make such logging less-frequent by supplying a different seconds value to the optional report_delay argument of .train(), for example report_delay=60 (for a log line every minute instead of every second).

    If you really want a progress-bar, it should possible to use tqdm - but you will have to work around its assumption that the iterable you're wrapping with tqdm() will only be iterated over once.

    I believe there'd be two possible approaches, each with different tradeoffs:

    (1) Instead of letting .train() repeat the corpus N times, do it yourself - adjusting the other .train() parameters accordingly. Roughly, that'd mean changing a line like...

    model.train(corpus, total_examples=len(corpus), epochs=40)
    

    ...into something that turns your desired 40 epochs into something that looks like just one iteration to both tqdm & Gensim's .train(), like...

    repeated_corpus = itertools.chain(*[corpus]*40)
    repeated_len = 40 * len(corpus)
    model.train(tqdm(repeated_corpus, total=repeated_len), total_examples=repeated_len, epochs=1)
    

    (Note that you now have to give tqdm a hint as to the sequence's length, because the one-time chained-iterator from itertools.chain() doesn't report its own length.)

    Then you'll get one progress-bar across the whole, training corpus - which the model is now seeing as one pass over a larger corpus, but ultimately involves the same 40 passes.

    You'll want to reinterpret any remaining log lines with this change in mind, and you'll lose a chance to install your own per-epoch callbacks via the model's end-of-epoch callback mechanism. (But, that's a seldom-used feature, anyway.)

    (2) Instead of wrapping the corpus with a single tqdm() (which can only show a progress-bar for one-iteration), wrap the corpus as a new fully-re-iterable object that itself will start a new tqdm() each time. For example, something like:

    from collections.abc import Iterable`
    class TqdmEveryIteration(Iterable):
        def __init__(self, inner_iterable):
            self.inner_iterable = inner_iterable
        def __iter__(self):
            return iter(tqdm(self.inner_iterable))
    

    Then, using this new extra tqdm-adding wrapper, you should be able to do:

    corpus = utils.shuffle(train_tagged.values)
    model_dbow.train(TqdmEveryIteration(corpus), total_examples=len(corpus), epochs = 40)
    

    In this case, you should get one progress bar per epoch, because a new tqdm() wrapper will be started each training pass.

    (If you try either of these approaches & they work well, please let me know! They should be roughly correct, but I haven't tested them yet.)

    Separately: if the article from the author at actsusanli.medium.com that you're modeling your work on is...

    https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4

    ...note that it's using an overly-complex & fragile anti-pattern, calling .train() multiple times in a loop with manual alpha management. That has problems as described in this other answer. But that approach would also have the side-effect of re-wrapping the corpus each time in a new tqdm (like the TqdmEveryIteration class above), so despite its other issues, would achieve one actual progress-bar each call to .train().

    (I sent the author a private note via Medium about a month ago about this problem.)