python performance numpy multiprocessing numexpr

Why does multi-processing slow down a nested for loop?

I have a lot of very large matrices AFeatures that I am comparing against some other very large matrices BFeatures, both of which have a shape of (878, 2, 4, 15, 17, 512), using the Euclidean distance. I am trying to parallelise this process to speed up the comparison. I am using Python 3 in a Conda environment and my original code uses an average of two CPU cores at 100%:

    per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
    
    for i in range(878):
        for j in range(878):
            for k in range(2):
                for l in range(4):
                    per_slice_comparisons[i, j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

I have tried two approaches for speeding up the code.

Using multi-processing

def fill_array(i):
    comparisons = np.zeros(shape=(878, 2, 4))

    for j in range(878):
        for k in range(2):
            for l in range(4):
                comparisons[j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] -BFeatures[j, k, l, :])
         comparisons[j, k, l] = 0

         return comparisons

pool = Pool(processes=6)
list_start_vals = range(878)
per_slice_comparisons = np.array(pool.map(fill_array, list_start_vals))
pool.close()

This approach increases run time by around 5%, although all 8 CPU cores are now being used at 100%. I have tried a number of different processes, the more there are the slower it gets.

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

 os.environ['NUMEXPR_MAX_THREADS'] = '8'
 os.environ['NUMEXPR_NUM_THREADS'] = '4'
 import numexpr as ne

 def linalg_norm(a):
     sq_norm = ne.evaluate('sum(a**2)')
     return ne.evaluate('sqrt(sq_norm)')

 per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
     for i in range(878):
         for j in range(878):
             for k in range(2):
                 for l in range(4):
                     per_slice_comparisons[i, j, k, l] = linalg_norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

However, for a nested for loop this approach increases total execution time by a factor of 3. I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically? If anyone has any ideas on how to fix this I would really appreciate it!

Solution

Why does multi-processing slow down a nested for loop in python?

Creating a process is a very expensive system operation. The operating system has to remap a lot of pages (program, shared library, data, etc.) so that the newly created processes can access to the ones of the initial process. The multiprocessing package also use inter-process communication in order to share the work between processes. This is slow too. Not to mention the required final join operation. To be efficient (ie. reduce the overheads as much as possible), a Python program using the multiprocessing package should share a small amount of data and perform expensive computations. In your case, you do not need the multiprocessing package since you use only Numpy arrays (see later).

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

Numexpr use threads rather then processes and threads are light compared to processes (ie. less expensive). Numexpr also uses aggressive optimization to speed up the evaluated expression as much as possible (something CPython does not do).

I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically?

The default implementation of Python is CPython with is an interpreter. Interpreters are generally very slow (especially CPython). CPython perform almost no optimization of your code. If you want fast loops, then you need alternatives that compile them to native code or JIT them. You can use Cython or Numba for that. The two can provide simple ways to parallelize your program. Using Numba is probably the simplest solution in your case. You can start by looking the example programs.

Update: if the implementation of Numpy is multithreaded can, then the multiprocessing code can be much slower. Indeed, each process will create N threads on a machine with N cores. Consequently N*N threads will be run. This situation is called over-subscription and is known to be inefficient (due to preemptive multitasking and especially context-switches). One way to check this hypothesis is to simply look how many threads are created (eg. using the hwloc tool on Posix systems) or simply monitor the processor usage.