I am using multi-processing in python 3.7
Some articles say that a good number for number of processes to be used in Pool is the number of CPU cores.
My AMD Ryzen CPU has 8 cores and can run 16 threads.
So, should the number of processes be 8 or 16?
import multiprocessing as mp
pool = mp.Pool( processes = 16 ) # since 16 threads are supported?
Q : "So, should the number of processes be 8 or 16?"
So, should the herd of sub-processes distributed workloads are cache re-use intensive (not memory-I/O), the SpaceDOMAIN
-constraints rule, as the size of the cache-able data will play cardinal role in deciding if 8 or 16.
Why ?
Because the costs of memory-I/O are about a thousand times more expensive in the TimeDOMAIN
, paying about 3xx - 4xx [ns]
per memory-I/O, compared to 0.1 ~ 0.4 [ns]
for in-cache data.
How to Make The Decision ?
Make a small scale test, before deciding on production scale configuration.
So, should the herd of to-be distributed workloads are network-I/O, or other remarkable (locally non-singular) source of latency, dependent, the TimeDOMAIN
may benefit from doing a latency-masking trick, running 16, 160 or merely 1600 threads ( not processes in this case ).
Why ?
Because the costs of doing the over-the-network-I/O provide so much waiting-time ( a few [ms]
of network-I/O RTT latency are time enough to do about 1E7 ~ 10.000.000
per CPU-core uop-s, which is quite a lot of work. So, smart interleaving of even the whole processes, here also just using the latency-masked thread-based concurrent processing may fit ( as the threads waiting for the remote "answer" from over-the-network-I/O ought not fight for a GIL-lock, as they have nothing to compute until they receive their expected I/O-bytes back, have they? )
How to Make The Decision ?
Review the code to determine how many over-the-network-I/O fetches and how many about the cache-footprint sized reads are in the game (in 2020/Q2+ L1-caches grew to about a few [MB]
-s). For those cases, where these operations repeat many times, do not hesitate to spin up one thread per each "slow" network-I/O target as the processing will benefit from the just by a coincidence created masking of the "long" waiting-times at a cost of just a cheap ("fast") and (due to "many" and "long" waiting times) rather sparse thread-switching or even the O/S-driven process-scheduler mapping the full sub-processes onto a free CPU-core.
So, should the herd of to-be distributed workloads is some mix of the above cases, there is no other way than to experiment on the actual hardware local / non-local resources.
Why ?
Because there is no rule of thumb to fine-tune the mapping of the workload processing onto the actual CPU-core resources.
Still,
one may easily find to have paid way more than ever getting back
The known trap
of achieving a SlowDown, instead of a ( just wished to get ) SpeedUp
In all cases, the overhead-strict, resources-aware and atomicity of workload respecting revised Amdahl's Law identifies a point-of-diminishing returns, after which any more workers ( CPU-core-s ) will not improve the wished to get Speedup. Many surprises of getting S << 1 are expressed in Stack Overflow posts, so one may read as many of what not to do (learning by anti-patterns) as one may wish.