Why does linalg.solve doesnt use all available threads

I would like to know why linalg.solve from numpy isnt using all available threads to do its calculus.

I'm using it to solve for a multidimensional system, in a way that it should solve to find a vector with 27 entries n.N.N times. So as it has to do the same type of calculation many times I though that it would take advantage of all available threads but it isnt what is actually happening. It is only using 16 threads, even if I change n, have tested for n = 300 and 400. I also tryed to specify the threads manually from threadpoolctl, it only worked to decrease the number of running threads, but didnt saw difference on the performance though (don't have idea why).

The test code is

import numpy as np
from numpy import *
from timeit import default_timer as timer
import math;

import os, psutil
process = psutil.Process(os.getpid())

###### Main time loop ##########################################################

n = 400
N = 300

s = timer()

k_star = np.random.rand(27*n*N*N).reshape((27,n,N,N))
T = np.random.rand(27*27*n*N*N).reshape((27,27,n,N,N))

t1 = timer() - s

print("time to allocate k_star and T = " + str(t1))



for time in range(0,1001):

    s = timer()

    fout = np.transpose(linalg.solve(np.transpose(T),np.transpose(k_star)))

    t = timer() - s

    print("time to calculate fout solving the system = " + str(t))


    print(time)
    
    if time%1000 == 0:
        memory = process.memory_info().rss/1000000000
        print("RAM Memory occupied = " + str(round(memory,2)) + " Gb")

May be useful to the answer, the output from np.__config__.show() is

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CNL,AVX512_ICL

I would appreciate any help to understand what is happening and how I can increase the performance of the running code by using all available threads. Thanks in advance.

Solution

You use 4D/5D matrices with a pretty small amount of items in the last dimension. The thing is the current implementation of Numpy is inefficient in this case.

Indeed, linalg.solve internally calls solve (and the similar solve1 regarding the use-case) which linearize the ND array so to call dgesv many time in a sequential loop. You use the OpenBLAS implementation which is able to compute dgesv in parallel with multiple threads. However, dgesv is a LAPACK primitive and not really a BLAS one. OpenBLAS is mainly a BLAS implementation though it implement some LAPACK primitives. It is not meant to compute LAPACK primitive very efficiently. In fact only few library can do so (this is actually hard to do). Still, the problem here is that Numpy make a lot of LAPACK calls and each dgesv call only last for a very short time. Thus OpenBLAS cannot fully benefit from multicore processors since it takes a significant time (ie. at least few microseconds) to share the work with other cores and wait for them to do the work. In fact, the higher the number of core the higher the parallel overhead. After all, if you want to organize a party with 100 friends, it generally takes more time to do than with 1 friend, and if the party last for 10 minutes, the organization time will be much bigger than the party time.

With n = 40 and N = 30, here is the scheduling of the OpenBLAS threads on my i5-9600KF cores (only a portion lasting for 0.5 ms so we can see what is happening):

Each coloured slice represent the execution of an OpenBLAS thread. Black slices are IDLE time. Most of the time is IDLE and the main thread located on the core 1 do more work than others because the computational work is not well balanced.

If you want this computation to be fast in your use-case, this is better to use only 1 OpenBLAS thread and run different dgesv calls on multiple threads rather than trying to parallelize dgesv like OpenBLAS do by default. While Numpy could theoretically do that, Numpy is not meant to use multiple thread yet. The only parallel functions are mainly the one of BLAS/LAPACK libraries, not directly Numpy.