numpyperformancenumbaopenblas

Using `numba` to speed up vectorization on very large `numpy` arrays


I originally had some code that operates on very large arrays using for loops. I wanted to see if I can speed it up with numpy and numba and tried 4 incremental steps to get it faster.

Setup:

from timeit import timeit

import numba as nb
import numpy as np
import numpy.typing as npt

TIMESTAMPS = 3
REGIONS = 150
HEIGHT = 240
WIDTH = 320

TRIALS = 100

np.random.seed(0)

FRAME_MASKS = np.ascontiguousarray(
    np.random.randint(0, 2, (TIMESTAMPS, HEIGHT, WIDTH), dtype=bool)
)
REGION_MASKS = np.ascontiguousarray(
    np.random.randint(0, 2, (REGIONS, HEIGHT, WIDTH), dtype=bool)
)
REGION_AREAS = np.sum(REGION_MASKS, axis=(-1, -2))

# OLFR = "Overlap of Frames with Regions"

# Run once to compile the code (although this is not trivial either...)
olfr_native_for_loops(*test_params)
olfr_fully_vectorized(*test_params)
olfr_with_numba(*test_params)
olfr_all_flags_numba(*test_params)

...

# Timing
test_params = (FRAME_MASKS, REGION_MASKS, REGION_AREAS)
for prefix, func in [
    ["for_loops", lambda: olfr_native_for_loops(*test_params)],
    ["fully_vectorized", lambda: olfr_fully_vectorized(*test_params)],
    ["with_numba", lambda: olfr_with_numba(*test_params)],
    ["all_flags_numba", lambda: olfr_all_flags_numba(*test_params)],
]:
    speed = timeit(func, number=TRIALS)
    print(f"Done. {prefix} speed={speed:.4f}")
  1. Native for loops
def olfr_native_for_loops(
    frame_masks: npt.ArrayLike,  # T x H x W
    region_masks: npt.ArrayLike,  # R x H x W
    region_areas: npt.ArrayLike,  # R
):
    ratios = np.zeros((TIMESTAMPS, REGIONS))

    for t in range(TIMESTAMPS):
        for r in range(REGIONS):
            raw_intersection = frame_masks[t] & region_masks[r]
            intersection_area = np.sum(raw_intersection, axis=(-1, -2))
            ratios[t, r] = intersection_area / region_areas[r]

    return ratios
Done. for_loops speed=2.0005

It has some vectorization as I learned in StackOverflow - 38760898 - np-newaxis-with-numba-nopython, but the for loops are easy to vectorize so I tried that next.

  1. Fully vectorized
def olfr_fully_vectorized(
    frame_masks: npt.ArrayLike,  # T x H x W
    region_masks: npt.ArrayLike,  # R x H x W
    region_areas: npt.ArrayLike,  # R
):
    raw_intersection = frame_masks[:, np.newaxis, ...] & region_masks[np.newaxis]
    intersection_area = np.sum(raw_intersection, axis=(-1, -2))
    ratios = intersection_area / region_areas
    return ratios
Done. fully_vectorized speed=2.4903

This is great, because it is clean, but it got slower. As mentioned in StackOverflow - 38760898, this isn't too surprising because the array is huge so it is probably getting evicted from the memory cache more easily.

So I want a solution that allows me to write vectorized code, but get the performance benefits of smaller array sizes that don't get evicted from the cache.

In my production example, I saw CPU usage was half what the "native python" solution had, so I assumed BLAS was not able to schedule enough threads to parallelize these operations (a separate issue) so I turned to numba.

  1. Regular numba
@nb.njit
def olfr_with_numba(
    frame_masks: npt.ArrayLike,  # T x H x W
    region_masks: npt.ArrayLike,  # R x H x W
    region_areas: npt.ArrayLike,  # R
):
    raw_intersection = frame_masks.reshape(TIMESTAMPS, 1, -1) & region_masks.reshape(
        1, REGIONS, -1
    )
    intersection_area = np.sum(raw_intersection, axis=-1)
    result = intersection_area / region_areas
    return result
Done. with_numba speed=14.7773

Okay, now it got even slower. Fine, I read more docs and learned that numba is not great with big arrays (StackOverflow - 57819913#7460739), and that I should enable the flags of numba.

  1. numba with flags
@nb.njit(parallel=True, fastmath=True, nogil=True, cache=True)
def olfr_all_flags_numba(
    frame_masks: npt.ArrayLike,  # T x H x W
    region_masks: npt.ArrayLike,  # R x H x W
    region_areas: npt.ArrayLike,  # R
):
    raw_intersection = frame_masks.reshape(TIMESTAMPS, 1, -1) & region_masks.reshape(
        1, REGIONS, -1
    )
    intersection_area = np.sum(raw_intersection, axis=-1)
    result = intersection_area / region_areas
    return result
Done. all_flags_numba speed=7.6477

That helped, but we're still worse than where we started.

In my production code I am able to successfully make it faster by using from multiprocessing.pool import ThreadPool and sending out threads for all my slow "huge vectorization" operations, but why can't BLAS/numba do that for me automatically?

Thank you in advance!

PS: My AWS c5.4xlarge CPU stats are below (I've seen a bigger L3 cache make fully_vectorized go faster, but that's the point of this question: what library can automatically split the array so it can stay in the L1/L2 cache as much as possible)

CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:            4
CPU MHz:             3400.948
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K

Solution

  • TL;DR: Numba generates a significantly-less efficient assembly code than Numpy, especially on Intel CPUs like your (due to a port 5 saturation). Numba also poorly parallelize the code since it seems only to consider the first axis for the parallelization while it could operate on both first and the second one (collapse).


    Performance results and setup

    First thing first, I am able to reproduce the Numba performance issue on my Debian Linux with a i5-9600KF CPU, a ~40 GiB/s RAM, with CPython 3.11.6, Numpy 1.24.2 and Numba 0.57.1. However, I am not able to reproduce the slower Numpy execution for the vectorized code. Here are the performance results:

    # Time per run (mean ± std. dev. of 7 runs, 10 loops each)
    olfr_native_for_loops:       15.5 ms ± 63.2 µs
    olfr_fully_vectorized:       15.3 ms ± 85.6 µs
    olfr_with_numba:             91.3 ms ± 54.6 µs
    olfr_all_flags_numba:        61.7 ms ± 111 µs
    

    Performance of the Numpy codes

    It has some vectorization as I learned in StackOverflow - 38760898 - np-newaxis-with-numba-nopython, but the for loops are easy to vectorize so I tried that next.

    This is great, because it is clean, but it got slower. As mentioned in StackOverflow - 38760898, this isn't too surprising because the array is huge so it is probably getting evicted from the memory cache more easily.

    This is indeed totally right. I mentioned a similar issue in your previous post.

    However, the timings are about the same on my machine. This is because the additional memory pressure is not big enough to make the whole code slow on my machine. We can see that by analyzing the RAM throughput on both codes:

    olfr_native_for_loops:  1.0 GiB/s (99% loads,  1% stores)
    olfr_fully_vectorized:  6.4 GiB/s (69% loads, 31% stores)
    

    We can see that the vectorized implementation significantly put more pressure on the RAM (6.4 times more). This is due to large temporary arrays as expected (they need to be stored and read back). This vectorized code can typically be slower on machines with a small RAM bandwidth (i.e. not on my machine). It also requires more memory so it can be a problem on machines with a very small amount of memory. The first version computing chunks should be clearly preferred over the later (at least regarding performance).

    So I want a solution that allows me to write vectorized code, but get the performance benefits of smaller array sizes that don't get evicted from the cache.

    This is unfortunately not possible with Numpy to automatically merge operations in order to execute them in an efficient interleaved way. Such an operation is sometimes called kernel merging. Modules like CuPy (a Numpy-like module for Nvidia GPUs) can do such a thing in simple cases. I think Dask can do it too (AFAIK Numpy is used as a backend for Dask and Dask arrays are a grid of Numpy arrays). AFAIK, NumExpr can do that in simple cases too. For complex cases, it is a rather difficult research problem (I actually worked on this during my PhD thesis).


    Performance of the Numba code

    Fine, I read more docs and learned that numba is not great with big arrays (StackOverflow - 57819913#7460739)

    This tends to be often right. Indeed, Numpy is manually optimized to use SIMD instructions when the compiler fail to do that automatically. Numba don't do that (certainly for sake of maintainability). Thus, Numpy tends to be faster when operating on large arrays. Numpy codes are slow on code operating on many small arrays or iterating on scalar because of the (quite big) Numpy function call overhead and interpreted CPython loops. This is why vectorization (i.e. calling C functions from Python) is generally a very good idea performance-wise. However, this simply does not apply to Numba. Numba loops are fast (as fast as C ones and so the one of the Numpy implementation). Since registers are faster than memory, it is actually often better to use loops in Numba. However, sometimes it is faster to operate on chunks (like the first code) because compilers can sometimes fail to generate SIMD instructions when the code do not operate on small arrays. The best is to check the generated code, but it unfortunately is not simple for non expert to extract the assembly code of the executed sections and understand it correctly.

    I should enable the flags of numba.

    The added flags does not really solve the problem.

    In my production code I am able to successfully make it faster by using from multiprocessing.pool import ThreadPool and sending out threads for all my slow "huge vectorization" operations, but why can't BLAS/numba do that for me automatically?

    This is actually the purpose of parallel=True flag but the automatic parallelization of code is often not very efficient. In fact, it is not just Numba but this approach tend not to be efficient, especially on machine with many cores. It is often better to manually parallelize a loop with prange and avoid temporary arrays (while still using parallel=True or otherwise prange is simply useless).

    Note that no BLAS is used here (BLAS are generally used for matrix-matrix/matrix-vector/vector-vector multiplications, but not for such operation, especially not on integers).

    That helped, but we're still worse than where we started.

    Indeed.

    We can profile the code using objmode to find out that the two first lines are slow and the np.sum is slowest part:

    raw_intersection computation:   41 ms
    intersection_area computation:  50 ms
    Final result computation:       <1 ms
    

    My hypothesis is that the Numba code of the np.sum is much slower either because Numba generates a significantly less-efficient SIMD code in this case on most Intel CPUs as opposed to Numpy. Unfortunately, this is difficult to prove that since the generated assembly code is really huge (5000 lines), it is also complex with many generated variants, and it is not possible yet to use low-level profiler on the binaries generated by Numba (so I cannot know which variant is actually executed). On top of that, I often found that logical binary AND operations are slow because Numba does not generate a SIMD code but a scalar one (or sometimes a really inefficient SIMD code).

    That being said, here is the kind of loop generated by Numba for the np.sum (AT&T assembly syntax -- one of the dozens of generated variants):

    .LBB4_111:
        vmovdqu (%rcx), %ymm3
        vmovdqu 32(%rcx), %ymm4
        vmovdqu 64(%rcx), %ymm5
        vmovdqu 96(%rcx), %ymm6
        vperm2i128  $32, %ymm5, %ymm3, %ymm7
        vperm2i128  $32, %ymm6, %ymm4, %ymm8
        vperm2i128  $49, %ymm5, %ymm3, %ymm3
        vperm2i128  $49, %ymm6, %ymm4, %ymm4
        vpunpcklqdq %ymm8, %ymm7, %ymm5
        vpunpcklqdq %ymm4, %ymm3, %ymm6
        vpunpckhqdq %ymm8, %ymm7, %ymm7
        vpunpckhqdq %ymm4, %ymm3, %ymm3
        vpbroadcastb    (%rdx), %xmm4
        vpcmpeqb    %xmm2, %xmm4, %xmm4
        vpxor   %xmm1, %xmm4, %xmm4
        vpmovzxdq   %xmm4, %ymm4
        vpand   %ymm0, %ymm4, %ymm4
        vpaddq  %ymm4, %ymm5, %ymm5
        vpaddq  %ymm4, %ymm7, %ymm7
        vpaddq  %ymm4, %ymm6, %ymm6
        vpaddq  %ymm4, %ymm3, %ymm3
        vinserti128 $1, %xmm6, %ymm5, %ymm4
        vinserti128 $1, %xmm3, %ymm7, %ymm8
        vperm2i128  $49, %ymm6, %ymm5, %ymm5
        vperm2i128  $49, %ymm3, %ymm7, %ymm3
        vpunpcklqdq %ymm8, %ymm4, %ymm6
        vpunpcklqdq %ymm3, %ymm5, %ymm7
        vpunpckhqdq %ymm8, %ymm4, %ymm4
        vpunpckhqdq %ymm3, %ymm5, %ymm3
        vmovdqu %ymm7, 64(%rcx)
        vmovdqu %ymm3, 96(%rcx)
        vmovdqu %ymm6, (%rcx)
        vmovdqu %ymm4, 32(%rcx)
        subq    $-128, %rcx
        addq    $4, %rax
        jne .LBB4_111
    

    While this code uses SIMD units (instructions starting with v here), it is especially slow for one reason on my machine: it converts 8-bit boolean items to 64-bit ones on-the-fly using the instructions vperm2i128, vpunpcklqdq and vpunpckhqdq and this specific set of instruction is inefficient on many Intel CPUs. More specifically, this issue impacts at least Haswell/Broadwell CPU and Skylake/KabyLake/CoffeeLake/CometLake CPU, but significantly-less IceLake CPUs and newer CPUs. Your CPU is a Skylake-based CPU like mine (CoffeeLake) so it is strongly impacted by this performance issue. To understand more precisely why, one should understand how Skylake-based CPUs execute instructions. All variants suffer from this same problem.

    Each Skylake CPU core are can decode and execute multiple instructions in parallel in a out-of-order way. More specifically, it can execute multiple instructions like vpaddq at the same time thanks to a superscalar execution. Each of this instruction operate on many items at the same time thanks to SIMD units. Once decoded, instructions are scheduled on execution ports. Skylake-based cores have 8 execution ports. Most execution ports are specialized so they can only execute a specific set of instructions. For example, the port 4 can only store data in memory. The port 5 is the only one that can execute instructions vperm2i128, vpunpcklqdq and vpunpckhqdq (which are massively used in the Numba generated code of np.sum). Meanwhile, the vpaddq instructions can be executed on the ports 0, 1 and 5. Note that things are actually much more complex than that, but I simplified things a big for sake of clarity.

    The thing is most instructions of the loop can only be executed on the port 5 which is completely saturated. Indeed, on the above variant, it is 18 instructions over 36. As a result, each iteration of the loop take at least 18 cycles. In fact, it is the main bottleneck so each iteration should actually take this time (or a bit more since the instruction scheduler is not perfect). Note that some other variants are even worsts than this one.

    Numpy performs a completely different approach : it firsts convert the boolean array to 64-bit integers and then perform the sum on it pretty efficiently. Here is the code doing the sum:

    add          $0x1,%rax
    shl          $0x5,%rdi
    vmovdqu      (%rsi,%rdi,1),%xmm6
    vinserti128  $0x1,0x10(%rsi,%rdi,1),%ymm6,%ymm0
    vpaddq       %ymm0,%ymm1,%ymm1
    cmp          %rax,%r8
    jne          438
    
    

    The downside of this approach is that it tends to use significantly more memory and saturate the memory bandwidth. Thus, the Numpy code is more likely to be memory-bound. That being said, the Numpy conversion code is certainly also limited a bit by the port 5, although the code generated by GCC is better than the one of Numba (LLVM). On my machine, the conversion is particularly expensive : it takes >50% of the execution time!


    Faster Numpy implementation

    The thing is we do not need 64-bit integer for the sum. Indeed, we add boolean so 32-bit integers are far enough as long as we add less than <2e9 boolean items (so not to cause an overflow). Using 32-bit integers instead of 64-bit ones not only reduce the pressure on the port 5 but also save some memory (both the amount of memory required and bandwidth), not to mention this make the computation more cache friendly due to a smaller amount of memory used! 16-bit should be even better but note that they can only be used to sum 65535 boolean items (assuming the integer is unsigned -- i.e. np.uint16) while the two last axis have 240*320=76800 items.

    Based on this, we can conclude that we should use 32-bit integers in the Numpy code for better performance. This can be easily done with the dtype parameter of np.sum. Here is the corrected Numpy code:

    def olfr_faster_numpy_int32(
        frame_masks: npt.ArrayLike,  # T x H x W
        region_masks: npt.ArrayLike,  # R x H x W
        region_areas: npt.ArrayLike,  # R
    ):
        ratios = np.zeros((TIMESTAMPS, REGIONS))
    
        for t in range(TIMESTAMPS):
            for r in range(REGIONS):
                raw_intersection = frame_masks[t] & region_masks[r]
                intersection_area = np.sum(raw_intersection, axis=(-1, -2), dtype=np.int32)
                ratios[t, r] = intersection_area / region_areas[r]
    
        return ratios
    

    While it should be even faster with 16-bit integers (with 2 reduction steps so to avoid overflow), it is in practice not faster (due to Numpy overheads).

    Another approach is simply to use the very optimized np.count_nonzero function instead of np.sum since the input are boolean numbers. It solves the bottlenecks related to the use of np.sum (no need to convert anything and Numpy already does the above optimization internally). However, Numba does not optimize np.count_nonzero yet (and I do not expect it to be as efficient as Numpy since Numpy developers manually use SIMD intrinsics in their code to aggressively optimize this function while Numba tends not to do that for sake of portability and maintainability).

    Here is the modified code:

    def olfr_faster_numpy_nnz(
        frame_masks: npt.ArrayLike,  # T x H x W
        region_masks: npt.ArrayLike,  # R x H x W
        region_areas: npt.ArrayLike,  # R
    ):
        ratios = np.zeros((TIMESTAMPS, REGIONS))
    
        for t in range(TIMESTAMPS):
            for r in range(REGIONS):
                raw_intersection = frame_masks[t] & region_masks[r]
                intersection_area = np.count_nonzero(raw_intersection)
                ratios[t, r] = intersection_area / region_areas[r]
    
        return ratios
    

    Faster Numba implementation

    The same thing applies to Numba. However, we can write a more efficient code than the above Numpy one since we are not limited by the overhead of the Numpy internals anymore and we can avoid creating expensive temporary arrays too. All of this using plain basic loops (Numba like it liek native C code). On top of that, we can also use multiple threads easily in Numba but we need to collapse loops to get enough available parallelism.

    Here is an optimized Numba implementation:

    @nb.njit(parallel=True, cache=True)
    def olfr_faster_numba(
        frame_masks: npt.ArrayLike,  # T x H x W
        region_masks: npt.ArrayLike,  # R x H x W
        region_areas: npt.ArrayLike,  # R
    ):
        # Prevent Numba to read the global variables which are assume constant while they may not in practice
        TIMESTAMPS = frame_masks.shape[0]
        REGIONS = region_masks.shape[0]
    
        ratios = np.zeros((TIMESTAMPS, REGIONS))
        h, w = frame_masks.shape[-2:]
        assert region_masks.shape[-2] == h and region_masks.shape[-1] == w
        assert frame_masks.dtype.type is np.bool_
        assert region_masks.dtype.type is np.bool_
    
        # Collapse the r and t loops so to get more parallelism
        for tr in nb.prange(TIMESTAMPS * REGIONS):
            t = tr // REGIONS
            r = tr % REGIONS
    
            # Significantly speed up the next loop by avoiding inefficient boolean logical AND
            # since Numba do not like it at all, partially because of lazy evaluation 
            # (replace them with a binary AND like AFAIK Numpy does intenally)
            frame_masks_block = frame_masks[t].view(np.int8)
            region_masks_block = region_masks[r].view(np.int8)
    
            intersection_area = 0
            for y in range(h):
                # Use a 16-bit integer for the accumulator since 
                # it is better for SIMD units (faster)
                s = np.int16(0)
                for x in range(w):
                    s += frame_masks_block[y,x] & region_masks_block[y,x]
                intersection_area += s
    
            ratios[t, r] = intersection_area / region_areas[r]
    
        return ratios
    

    If you find this a bit too complicated, here is a trade-off solution which is simpler but a bit less efficient inspired by the comment of @Aaron:

    @nb.njit(parallel=True, cache=True)
    def olfr_faster_numba_simple(
        frame_masks: npt.ArrayLike,  # T x H x W
        region_masks: npt.ArrayLike,  # R x H x W
        region_areas: npt.ArrayLike,  # R
    ):
        # Prevent Numba to read the global variables which are assume constant while they may not in practice
        TIMESTAMPS = frame_masks.shape[0]
        REGIONS = region_masks.shape[0]
    
        ratios = np.zeros((TIMESTAMPS, REGIONS))
    
        for r in nb.prange(REGIONS):
            for t in range(TIMESTAMPS):
                raw_intersection = frame_masks[t] & region_masks[r]
                intersection_area = np.sum(raw_intersection)
                ratios[t, r] = intersection_area / region_areas[r]
    
        return ratios
    

    It is (almost magically) relatively fast because Numba surprisingly generates a completely different assembly code when np.sum does not have an axis argument! The generated code does not contain the 3 type of instructions saturating the port 5 like in the initial implementation so the main bottleneck just vanishes resulting in a significantly faster execution! It is still not as efficient as the above code since the sum is performed using 64-bit integers (and Numba does not support the dtype parameter yet).

    Actually, there is an even faster implementation than the one proposed, but is a bit complex to implement, especially in Numba. One can convert boolean arrays to bit-fields (using np.packbits), then use binary AND so to compute the ANDs 8 times faster and finally use a popcnt x86-64 instruction so to count the number of 1 so to then put that in a 8-bit integer accumulator. This solution should be 8~10 times faster! Unfortunately popcnt is AFAIK not available in Numba so one may need to use a native language like C/C++ to do that. Interestingly, the next major version of Numpy (2.1) should introduce the new function bitwise_count which does exactly the operation needed (i.e. popcnt). It will then certainly take some times for the function to be integrated to Numba though.

    Note that Numba compiles functions lazily at runtime so the first call may be slow. You can specify the signature to compile the function eagerly. However, this means you exactly know the type of the array (i.e. number of dimensions and item types as well as whether arrays are contiguous).


    Discussion

    I do not expect any magical tool to write the optimized Numba code. Besides, merging vectorized Numpy operations was actually a bad idea performance-wise with 64-bit integers on Intel CPU. I do not expect any tool to write the above Numba code or to generate a similar resulting assembly code. In fact, I expect such a magical tool not to solve such performance issue and to hide it further by merging kernels making profiling even harder. In general, low-level details matter a lot for performance, but high-level tools very often miss that (and it is too complex to consider both the many things that can happen on all available architectures and the possible specialized code we can write for the specific operation you want to perform).


    Final performance results

    # Time per run (mean ± std. dev. of 7 runs, 10 loops each)
    olfr_native_for_loops:         15.5 ms ±  63.2 µs
    olfr_fully_vectorized:         15.3 ms ±  85.6 µs
    olfr_with_numba:               91.3 ms ±  54.6 µs
    olfr_all_flags_numba:          61.7 ms ± 111.0 µs
    
    # Time per run (mean ± std. dev. of 7 runs, 100 loops each)
    olfr_faster_numpy_int32:       9.02 ms ± 11.3 µs
    olfr_faster_numpy_nnz:         4.17 ms ±  3.3 µs
    olfr_faster_numba_simple:      0.86 ms ± 26.2 µs
    olfr_faster_numba:             0.66 ms ± 20.5 µs           <----------
    
    Optimal time (memory-bound):   0.28 ms
    

    The final optimized Numba implementation clearly outperform all other implementations by a large margin. It is 92 times faster than the initial parallel Numba implementation and 23 times faster than the fastest initial Numpy version. It is mostly because it uses multiple thread but the sequential version takes about 3.8 ms on my machine which is still faster than all other implementations, though the olfr_faster_numpy_nnz is pretty good (since it is sequential). The parallel Numba implementation scale very well.