pythonpython-3.xparallel-processingmultiprocessingparallelism-amdahl

If my 8 core CPU supports 16 threads, would 16 be a better number than 8 for number of processes in a Pool?


I am using multi-processing in python 3.7

Some articles say that a good number for number of processes to be used in Pool is the number of CPU cores.

My AMD Ryzen CPU has 8 cores and can run 16 threads.

So, should the number of processes be 8 or 16?

import multiprocessing as mp
pool = mp.Pool( processes = 16 )         # since 16 threads are supported?

Solution

  • Q : "So, should the number of processes be 8 or 16?"


    So, should the herd of sub-processes distributed workloads are cache re-use intensive (not memory-I/O), the SpaceDOMAIN-constraints rule, as the size of the cache-able data will play cardinal role in deciding if 8 or 16.

    Why ?
    Because the costs of memory-I/O are about a thousand times more expensive in the TimeDOMAIN, paying about 3xx - 4xx [ns] per memory-I/O, compared to 0.1 ~ 0.4 [ns] for in-cache data.

    How to Make The Decision ?
    Make a small scale test, before deciding on production scale configuration.


    So, should the herd of to-be distributed workloads are network-I/O, or other remarkable (locally non-singular) source of latency, dependent, the TimeDOMAIN may benefit from doing a latency-masking trick, running 16, 160 or merely 1600 threads ( not processes in this case ).

    Why ?
    Because the costs of doing the over-the-network-I/O provide so much waiting-time ( a few [ms] of network-I/O RTT latency are time enough to do about 1E7 ~ 10.000.000 per CPU-core uop-s, which is quite a lot of work. So, smart interleaving of even the whole processes, here also just using the latency-masked thread-based concurrent processing may fit ( as the threads waiting for the remote "answer" from over-the-network-I/O ought not fight for a GIL-lock, as they have nothing to compute until they receive their expected I/O-bytes back, have they? )

    How to Make The Decision ?
    Review the code to determine how many over-the-network-I/O fetches and how many about the cache-footprint sized reads are in the game (in 2020/Q2+ L1-caches grew to about a few [MB]-s). For those cases, where these operations repeat many times, do not hesitate to spin up one thread per each "slow" network-I/O target as the processing will benefit from the just by a coincidence created masking of the "long" waiting-times at a cost of just a cheap ("fast") and (due to "many" and "long" waiting times) rather sparse thread-switching or even the O/S-driven process-scheduler mapping the full sub-processes onto a free CPU-core.


    So, should the herd of to-be distributed workloads is some mix of the above cases, there is no other way than to experiment on the actual hardware local / non-local resources.

    Why ?
    Because there is no rule of thumb to fine-tune the mapping of the workload processing onto the actual CPU-core resources.


    Still,
    one may easily find to have paid way more than ever getting back
    The known trap
    of achieving a SlowDown, instead of a ( just wished to get ) SpeedUp

    In all cases, the overhead-strict, resources-aware and atomicity of workload respecting revised Amdahl's Law identifies a point-of-diminishing returns, after which any more workers ( CPU-core-s ) will not improve the wished to get Speedup. Many surprises of getting S << 1 are expressed in Stack Overflow posts, so one may read as many of what not to do (learning by anti-patterns) as one may wish.