pythonmultithreadingnumpyhtop

What am I setting when I limit the number of "threads"?


I have a somewhat large code that uses the libraries numpy, scipy, sklearn, matplotlib. I need to limit the CPU usage to stop it from consuming all the available processing power in my computational cluster. Following this answer I implemented the following block of code that is executed as soon as the script is run:

import os
parallel_procs = "4"
os.environ["OMP_NUM_THREADS"] = parallel_procs
os.environ["MKL_NUM_THREADS"] = parallel_procs
os.environ["OPENBLAS_NUM_THREADS"] = parallel_procs
os.environ["VECLIB_MAXIMUM_THREADS"] = parallel_procs
os.environ["NUMEXPR_NUM_THREADS"] = parallel_procs

My understanding is that this should limit the number of cores used to 4, but apparently this is not happening. This is what htop shows for my user and that script:

enter image description here

There are 16 processes, 4 of which show percentages of CPU above 100%. This is an excerpt of lscpu:

CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2

I am also using the multiprocessing library down the road in my code. I set the same number of processes using multiprocessing.Pool(processes=4). Without the block of code shown in above, the script insisted on using as many cores as possible apparently ignoring multiprocessing entirely.

My questions are then: what am I limiting when I use the code above? How should I interpret the htop output?


Solution

  • (This might be better as a comment, feel free to remove this if a better answer comes up, as it's based on my experience using the libraries.)

    I had a similar issue when multiprocessing parts of my code. The numpy/scipy libraries appear to spin up extra threads when you do vectorised operations if you compiled the libraries with BLAS or MKL (or if the conda repo you pulled them from also included a BLAS/MKL library), to accelerate certain calculations.

    This is fine when running your script in a single process, since it will spawn threads up to the number specified by OPENBLAS_NUM_THREADS or MKL_NUM_THREADS (depending on if you have a BLAS library or MKL library - you can identify which by using numpy.__config__.show()), but if you are explicitly using a multiprocesing.Pool, then you likely want to control the number of processes in multiprocessing - in this case, it makes sense to set n=1 (before importing numpy & scipy), or some small number to make sure you are not oversubscribing:

    n = '1'
    os.environ["OMP_NUM_THREADS"] = n
    os.environ["MKL_NUM_THREADS"] = n
    

    If you set multiprocessing.Pool(processes=4), it will use 4*n processes (n threads in each process). In your case, it seems like you have a pool of 4 processes and they fire up 4 threads each, hence the 16 python processes.

    The htop output gives 100% assuming a single CPU per core. As a Linux machine interprets a thread as a CPU (I might be wrong in the terminology here), if you have 4 threads per CPU, it means that the full load is actually 400%. This might not be maxed out, depending on the operations being performed (and on caching, as your machine looks hyperthreaded).

    So if you're doing the numpy/scipy operation in parts of the code which are in a single process/single thread, you are better off setting a larger n, but for the multiprocessing sections, it might be better to set a larger pool and single or small n. Unfortunately, you can only set this once, at the beginning of your script if you're passing in flags through the environmental flags. If you want to set it dynamically, I saw in a numpy issues discussion somewhere that you should use threadpoolctl (I'll add a link if I can find it again).