pythonfor-loopparallel-processingnumba

How to optimise parallelisation of nested numba loops


I'm having issues figuring out how to optimise the parallelisation of nested loops with numba. As a basic example I've written a 2D convolution algorithm which relies on multiple nested loops. Obviously there are better options for calculating a 2D convolution, this is just for illustration. As you can see in the code below, I've written three versions of this algorithm, all of which are identical except for the loop implementation:

  1. convolve_numba_0p: Every loop is using the standard Python range function
  2. convolve_numba_2p: Only the image height and width use numba's prange function
  3. convolve_numba_allp: Every loop uses numba's prange function
import numba as nb
import numpy as np
import time


@nb.njit(parallel=True)
def convolve_numba_0p(image, kernel, out):
    for img_b in range(image.shape[0]):  # batch
        for k_c in range(kernel.shape[0]):  # kernel channel
            for img_c in range(image.shape[1]):  # image channel
                for out_i in range(out.shape[2]):  # image height
                    for out_j in range(out.shape[3]):  # image width
                        for k_i in range(kernel.shape[1]):  # kernel height
                            for k_j in range(kernel.shape[2]):  # kernel width
                                out[img_b, k_c, out_i, out_j] += (
                                    image[img_b, img_c, out_i + k_i, out_j + k_j] * kernel[k_c, k_i, k_j]
                                )
    return out


@nb.njit(parallel=True)
def convolve_numba_2p(image, kernel, out):
    for img_b in range(image.shape[0]):  # batch
        for k_c in range(kernel.shape[0]):  # kernel channel
            for img_c in range(image.shape[1]):  # image channel
                for out_i in nb.prange(out.shape[2]):  # image height
                    for out_j in nb.prange(out.shape[3]):  # image width
                        for k_i in range(kernel.shape[1]):  # kernel height
                            for k_j in range(kernel.shape[2]):  # kernel width
                                out[img_b, k_c, out_i, out_j] += (
                                    image[img_b, img_c, out_i + k_i, out_j + k_j] * kernel[k_c, k_i, k_j]
                                )
    return out


@nb.njit(parallel=True)
def convolve_numba_allp(image, kernel, out):
    for img_b in nb.prange(image.shape[0]):  # batch
        for k_c in nb.prange(kernel.shape[0]):  # kernel channel
            for img_c in range(image.shape[1]):  # image channel
                for out_i in nb.prange(out.shape[2]):  # image height
                    for out_j in nb.prange(out.shape[3]):  # image width
                        for k_i in nb.prange(kernel.shape[1]):  # kernel height
                            for k_j in nb.prange(kernel.shape[2]):  # kernel width
                                out[img_b, k_c, out_i, out_j] += (
                                    image[img_b, img_c, out_i + k_i, out_j + k_j] * kernel[k_c, k_i, k_j]
                                )
    return out


if __name__ == "__main__":
    # Prepare variables
    shape_image = (1, 3, 10000, 10000)
    image = np.random.randint(0, 65535, shape_image).astype(np.float32)
    kernel = np.random.rand(2, 3, 3)
    kernel -= np.mean(kernel, axis=(1, 2))[:, None, None]
    padding = kernel.shape[0] // 2
    image_padded = np.pad(image, ((0, 0), (0, 0), (padding, padding), (padding, padding)))

    # Precompile functions 
    _ = convolve_numba_0p(image_padded, kernel, out=np.zeros(shape_image))
    _ = convolve_numba_2p(image_padded, kernel, out=np.zeros(shape_image))
    _ = convolve_numba_allp(image_padded, kernel, out=np.zeros(shape_image))

    # Time functions 
    start = time.time()
    out_numba_0p = convolve_numba_0p(image_padded, kernel, out=np.zeros(shape_image))
    print(f"Numba 0p: {time.time()-start}")
    start = time.time()
    out_numba_2p = convolve_numba_2p(image_padded, kernel, out=np.zeros(shape_image))
    print(f"Numba 2p: {time.time()-start}")
    start = time.time()
    out_numba_allp = convolve_numba_allp(image_padded, kernel, out=np.zeros(shape_image))
    print(f"Numba allp: {time.time()-start}")

Times using different number of CPU threads:

# 1 thread
Numba 0p: 9.103097200393677
Numba 2p: 7.535120010375977
Numba allp: 6.83448052406311

# 2 threads
Numba 0p: 9.253679990768433
Numba 2p: 4.163503646850586
Numba allp: 6.80088472366333

# 10 threads
Numba 0p: 9.133044004440308
Numba 2p: 0.9814767837524414
Numba allp: 6.881485939025879

# 50 Threads
Numba 0p: 9.317977905273438 
Numba 2p: 0.5669231414794922
Numba allp: 6.787745952606201

As you can see, using a couple of prange loops improves performance significantly, but using it for all loops results in a performance drop so dramatic that it's almost as slow as using none. In fact, using all prange seems to always result in the same performance, regardless of the number of threads available. I'm aware that more workers could result in more overhead which could limit the benefits of more paralellisation, but considering that every operation is independent and that the number of operations is orders of magnitude higher than the number of threads, I would've expected at worst no significant difference with 2p.

Also, to reiterate: I'm very aware that relying only on nested loops is often not the best approach. The algorithm used as example here is not the point and it is specifically meant to represent situations where nested loops is the only option.

My questions are:

Q1) What could be causing such a drop in performance for a higher number of nested prange loops?

Q2) Is there something wrong in my implementation of numba here? I'm not talking about the algorithm, but the implementation itself, e.g. wrong parameters.

Q3) Is there a better method to optimise the paralellisation of nested loops?


Solution

  • Q1) What could be causing such a drop in performance for a higher number of nested prange loops?

    Numba typically uses OpenMP under the hood by default if TBB is not available (otherwise TBB is preferred). With the OpenMP layer (IDK with other ones), using nested prange does not collapse the loops (although OpenMP supports it). This means prange only applies to the first loop. If the first loop does not provide enough parallelism, then workers threads will starve. As a result, the function will typically be slow.

    I think the reason why convolve_numba_allp is a bit faster is certainly because some of the bound checks should be optimized out because prange changes the type of the loop iterator (see the doc for more information about that).

    Q2) Is there something wrong in my implementation of numba here? I'm not talking about the algorithm, but the implementation itself, e.g. wrong parameters.

    I think convolve_numba_allp theoretically has a race condition so it can results in corrupted data (or even results in a crash of the CPython interpreter in the worst case). Not to mention race conditions generally tend to slow the code down (due to shared access to the same cache lines). Indeed, for example, multiple iteration of the k_i loop can access to the same items of out at the same time. This is your job to ensure there is no race conditions in parallel loops (Numba cannot do that). Here, there is one batch and only the first loop should be parallelized so it is not so critical but definitively something to fix.

    Besides, I do not think it makes much sense for parallel loops to contain non-parallel loops containing themselves parallel loop (with the current definition of parallel loops in Numba and more specifically static scheduling).

    Q3) Is there a better method to optimise the paralellisation of nested loops?

    You can collapse the loop yourself but you be careful about how to do that since operations like modulus/division are expensive. It is not a problem though if the collapsed loops contains another loop which is big enough so to mitigate the overhead of the modulus/division.

    For embarrassingly parallel operations, the best solution is generally to ensure there are enough parallelism by collapsing enough loops. Collapsing the innermost loops can be pretty expensive.

    For more complex case where the cache matters, I think it is dependent of the platform. Parameters like the size of the multiple CPU caches, the access pattern, the size of the arrays, or even the SIMD instruction set can significantly impact the best strategy. Parallelism experts (i.e. humans) are the best to solve such a problem. Auto-tuning strategy can provide (nearly) the best solution assuming it can be generated from the target search space. In your case, I do not think this is the case. For example, changing the loop order and tiling can certainly significantly improve performance. Not to mention specializing the function for the target kernel should also results in a big speed up (thanks to loop unrolling). Writing a more efficient code should be easy but finding the optimal code is certainly pretty difficult from being simple.

    Here, parallelizing batches is not be a good idea because there is 1 batch to compute. parallelizing the loop iterating on kernel channel may not be a good idea since it will certainly results in a memory-bound execution on most platform.