multithreading machine-learning centos7 gensim doc2vec

Why Doc2vec is slower with multiples cores rather than one?

I'm trying to train multiple "documents" (here mostly log format), and the Doc2Vec is taking longer if I'm specifying more than one core (which i have).

My data looks like this:

print(len(train_corpus))

7930196

print(train_corpus[:5])

[TaggedDocument(words=['port', 'ssh'], tags=[0]),
 TaggedDocument(words=['session', 'initialize', 'by', 'client'], tags=[1]),
 TaggedDocument(words=['dfs', 'fsnamesystem', 'block', 'namesystem', 'addstoredblock', 'blockmap', 'update', 'be', 'to', 'blk', 'size'], tags=[2]),
 TaggedDocument(words=['appl', 'selfupdate', 'component', 'amd', 'microsoft', 'windows', 'kernel', 'none', 'elevation', 'lower', 'version', 'revision', 'holder'], tags=[3]),
 TaggedDocument(words=['ramfs', 'tclass', 'blk', 'file'], tags=[4])]

I have 8 cores available:

print(os.cpu_count())

8

I am using gensim 4.1.2, on Centos7. Using this approch (stackoverflow.com/a/37190672/130288), It looks like my BLAS library is OpenBlas, so I setted OPENBLAS_NUM_THREADS=1 on my bashrc (and could be visible from Jupyter, using !echo $OPENBLAS_NUM_THREADS=1 )

This is my test code :

dict_time_workers = dict()
for workers in range(1, 9):

    model =  Doc2Vec(vector_size=20,
                    min_count=1,
                    workers=workers,
                    epochs=1)
    model.build_vocab(train_corpus, update = False)
    t1 = time.time()
    model.train(train_corpus, epochs=1, total_examples=model.corpus_count) 
    dict_time_workers[workers] = time.time() - t1

And the variable dict_time_workers is equal too :

{1: 224.23211407661438, 
2: 273.408652305603, 
3: 313.1667754650116, 
4: 331.1840877532959, 
5: 433.83785605430603,
6: 545.671571969986, 
7: 551.6248495578766, 
8: 548.430994272232}

As you can see, the time taking is increasing instead of decreasing. Results seems to be the same with a bigger epochs parameters. Nothing is running on my Centos7 except this.

If I look at what's happening on my threads using htop, I see that the right number of thread are used for each training. But, the more threads are used, less the percentage of usage is (for example, with only one thread, 95% is used, for 2 they both used around 65% of their max power, for 6 threads are 20-25% ...). I suspected an IO issue, but iotop showed me that nothing bad is happening at the same disk.

The post seems now to be related to this post Not efficiently to use multi-Core CPU for training Doc2vec with gensim .

Solution

When getting no benefit from extra cores like that, it's likely that the BLAS library you've got installed is already configured to try to use all cores for every bulk array operation. That means that other attempts to engage more cores, like Gensim's workers specification, just increase the overhead of contention, when each individual worker thread's individual BLAS callouts also try to use 8 threads.

Depending on the BLAS library in use, its own propensity to use more cores can typically be limited by environment variables named something like OPENBLAS_NUM_THREADS and/or MKL_NUM_THREADS.

If you set these to just 1 before your process launches, you may see different, and possibly better, multithreaded behavior.

Note, though: 1 just restores the assumption that every worker-thread only ever engages a single core. Some other mix of BLAS-cores & Gensim-worker-threads might actually achieve the best training throughput & non-contending core-utilization.

And, at least for Gensim workers, the actual thread count value achieving the best throughput will vary based on other model parameters that influence the relative amount of calculation time in highly-parallelizable code-blocks versus highly-contended blocks, especially window, vector_size, & negative. And, there's not really a shortcut to finding the best workers value except via trial-and-error: observing reported training rates in logs over a few minutes of running. (Though: any rate observed in, say, minutes 2-4 of a abbreviated trial run should be representative of the training rate through the whole corpus over multiple epochs.)

(For any system with at least 4 cores, the optimal value with a classic iterable corpus of TaggedDocuments is usually at least 3, no more than the number of cores, but also rarely more than 8-12 threads, due to other inherent sources of contention due to both Gensim's approach to fanning out the work among worker-threads, and the Python 'GIL'.)

Other thoughts:

the build_vocab() step is never multi-threaded, so benchmarking alternate workers values will give a truer readout of their effect by only timing the train() step
ensuring your iterable corpus does as little redundant work (like say IO & tokenization) on each pass can help limit any bottlenecks around the single manager thread doing each epoch's iteration & batching texts to the workers
the alternate corpus_file approach can achieve higher core utilization, up to any number of cores, by assigning each thread its own exclusive range of an input-file. But, it also means (a) your whole corpus must be in one uncompressed space-tokenized plain-text file; (b) your documents only get a single integer tag (their line-number); (c) you may be subject to some small as-yet-diagnosed-and-fixed bug(s). (See project issue #2747.)