I am using this article https://actsusanli.medium.com/ to implement the Doc2Vec model and I have a problem in the training step.
model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs = 40)
As you can see, I am using the tqdm function. When I ran the code the tqdm is 100%, after some minutes, but the algorithm still runs in the same shell for a long time.
Do you have any idea if this is a problem of tqdm function or something else?
By using the "list comprehension" ([
..]
)...
[x for x in tqdm(train_tagged.values)]
...you are having tqdm
iterate once over your train_tagged.values
sequence, into an actual in-memory Python list. This will show the tqdm
progress rather quickly – then completely finish any involvement with tqdm
.
Then, you're passing that plain result list (without any tqdm
features) into Doc2Vec.train()
, where Doc2Vec
does its epochs=40
training passes. tqdm
is no longer involved, so there'll be no incremental progress-bar output.
You might be tempted to try (or have already tried) something that skips the extra list
creation, passing the tqdm
-wrapped sequence directly in like:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(tqdm(corpus), total_examples=len(corpus), epochs = 40)
But this has a different problem: the tqdm
-wrapper is only designed to allow (& report the progress of) one iteration over the wrapped sequence. So this will show that one iteration's incremental progress.
But when .train()
tries its next necessary 39 re-iterations, to complete its epochs=40
training-runs, the single-pass tqdm
object will be exhausted, preventing full & proper training.
Note that there is an option for progress-logging within Gensim, by setting the Python logging level (globally, or just for the class Doc2Vec
) to INFO
. Doc2Vec
will then emit a log-line showing progress, within each epoch and between epochs, about every 1 second. But: you can also make such logging less-frequent by supplying a different seconds value to the optional report_delay
argument of .train()
, for example report_delay=60
(for a log line every minute instead of every second).
If you really want a progress-bar, it should possible to use tqdm
- but you will have to work around its assumption that the iterable you're wrapping with tqdm()
will only be iterated over once.
I believe there'd be two possible approaches, each with different tradeoffs:
(1) Instead of letting .train()
repeat the corpus N times, do it yourself - adjusting the other .train()
parameters accordingly. Roughly, that'd mean changing a line like...
model.train(corpus, total_examples=len(corpus), epochs=40)
...into something that turns your desired 40 epochs into something that looks like just one iteration to both tqdm
& Gensim's .train()
, like...
repeated_corpus = itertools.chain(*[corpus]*40)
repeated_len = 40 * len(corpus)
model.train(tqdm(repeated_corpus, total=repeated_len), total_examples=repeated_len, epochs=1)
(Note that you now have to give tqdm
a hint as to the sequence's length, because the one-time chained-iterator from itertools.chain()
doesn't report its own length.)
Then you'll get one progress-bar across the whole, training corpus - which the model is now seeing as one pass over a larger corpus, but ultimately involves the same 40 passes.
You'll want to reinterpret any remaining log lines with this change in mind, and you'll lose a chance to install your own per-epoch callbacks via the model's end-of-epoch callback mechanism. (But, that's a seldom-used feature, anyway.)
(2) Instead of wrapping the corpus with a single tqdm()
(which can only show a progress-bar for one-iteration), wrap the corpus as a new fully-re-iterable object that itself will start a new tqdm()
each time. For example, something like:
from collections.abc import Iterable`
class TqdmEveryIteration(Iterable):
def __init__(self, inner_iterable):
self.inner_iterable = inner_iterable
def __iter__(self):
return iter(tqdm(self.inner_iterable))
Then, using this new extra tqdm
-adding wrapper, you should be able to do:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(TqdmEveryIteration(corpus), total_examples=len(corpus), epochs = 40)
In this case, you should get one progress bar per epoch, because a new tqdm()
wrapper will be started each training pass.
(If you try either of these approaches & they work well, please let me know! They should be roughly correct, but I haven't tested them yet.)
Separately: if the article from the author at actsusanli.medium.com
that you're modeling your work on is...
...note that it's using an overly-complex & fragile anti-pattern, calling .train()
multiple times in a loop with manual alpha
management. That has problems as described in this other answer. But that approach would also have the side-effect of re-wrapping the corpus each time in a new tqdm
(like the TqdmEveryIteration
class above), so despite its other issues, would achieve one actual progress-bar each call to .train()
.
(I sent the author a private note via Medium about a month ago about this problem.)