Why does this fast function with Numba JIT slow down if I JIT compile another function?

So I have this function:

import numpy as np
import numba as nb


@nb.njit(cache=True, parallel=True, nogil=True)
def triangle_half_UR_LL(size: int, swap: bool = False) -> tuple[np.ndarray, np.ndarray]:
    total = (size + 1) * size // 2
    x_coords = np.full(total, 0, dtype=np.uint16)
    y_coords = np.full(total, 0, dtype=np.uint16)
    offset = 0
    side = np.arange(size, dtype=np.uint16)
    for i in nb.prange(size):
        offset = i * size - (i - 1) * i // 2
        end = offset + size - i
        x_coords[offset:end] = i
        y_coords[offset:end] = side[i:]
    
    return (x_coords, y_coords) if not swap else (y_coords, x_coords)

What it does is not important, the point is it is JIT compiled with Numba and therefore very fast:

In [2]: triangle_half_UR_LL(10)
Out[2]:
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
        2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
        5, 6, 6, 6, 6, 7, 7, 7, 8, 8, 9], dtype=uint16),
 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4,
        5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8,
        9, 6, 7, 8, 9, 7, 8, 9, 8, 9, 9], dtype=uint16))

In [3]: %timeit triangle_half_UR_LL(1000)
166 μs ± 489 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit triangle_half_UR_LL(1000)
166 μs ± 270 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: %timeit triangle_half_UR_LL(1000)
166 μs ± 506 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Now if I define another function and JIT compile it with Numba, the performance of the fast function inexplicably drops:

In [6]: @nb.njit(cache=True)
   ...: def dummy():
   ...:     pass

In [7]: dummy()

In [8]: %timeit triangle_half_UR_LL(1000)
980 μs ± 20 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit triangle_half_UR_LL(1000)
976 μs ± 9.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: %timeit triangle_half_UR_LL(1000)
974 μs ± 3.11 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is real, I have successfully reproduced this issue many times without fail, I start a new interpreter session, I paste the code, it runs fast. I define the dummy function then call the dummy function, and the fast function inexplicably slows down.

Screenshot as proof:

I am using Windows 11, and I absolutely have no idea what the hell is going on.

Is there an explanation for this? And how can I prevent this issue?

Interestingly if I get rid of nogil parameter and without changing anything else, the problem magically goes away:

In [1]: import numpy as np
   ...: import numba as nb
   ...:
   ...:
   ...: @nb.njit(cache=True, parallel=True)
   ...: def triangle_half_UR_LL(size: int, swap: bool = False) -> tuple[np.ndarray, np.ndarray]:
   ...:     total = (size + 1) * size // 2
   ...:     x_coords = np.full(total, 0, dtype=np.uint16)
   ...:     y_coords = np.full(total, 0, dtype=np.uint16)
   ...:     offset = 0
   ...:     side = np.arange(size, dtype=np.uint16)
   ...:     for i in nb.prange(size):
   ...:         offset = i * size - (i - 1) * i // 2
   ...:         end = offset + size - i
   ...:         x_coords[offset:end] = i
   ...:         y_coords[offset:end] = side[i:]
   ...:
   ...:     return (x_coords, y_coords) if not swap else (y_coords, x_coords)

In [2]: %timeit triangle_half_UR_LL(1000)
186 μs ± 47.9 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit triangle_half_UR_LL(1000)
167 μs ± 1.61 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit triangle_half_UR_LL(1000)
166 μs ± 109 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: @nb.njit(cache=True)
   ...: def dummy():
   ...:     pass

In [6]: dummy()

In [7]: %timeit triangle_half_UR_LL(1000)
167 μs ± 308 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit triangle_half_UR_LL(1000)
166 μs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [9]: %timeit triangle_half_UR_LL(1000)
167 μs ± 624 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Why does this happen?

But no, if I define other functions, somehow the first function slows down again. The simplest way to reproduce the issue is just redefining it:

In [7]: dummy()

In [8]: %timeit triangle_half_UR_LL(1000)
168 μs ± 750 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [9]: import numpy as np

In [10]: %timeit triangle_half_UR_LL(1000)
167 μs ± 958 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [11]: import numba as nb

In [12]: %timeit triangle_half_UR_LL(1000)
167 μs ± 311 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [13]: @nb.njit(cache=True, parallel=True)
    ...: def triangle_half_UR_LL(size: int, swap: bool = False) -> tuple[np.ndarray, np.ndarray]:
    ...:     total = (size + 1) * size // 2
    ...:     x_coords = np.full(total, 0, dtype=np.uint16)
    ...:     y_coords = np.full(total, 0, dtype=np.uint16)
    ...:     offset = 0
    ...:     side = np.arange(size, dtype=np.uint16)
    ...:     for i in nb.prange(size):
    ...:         offset = i * size - (i - 1) * i // 2
    ...:         end = offset + size - i
    ...:         x_coords[offset:end] = i
    ...:         y_coords[offset:end] = side[i:]
    ...:
    ...:     return (x_coords, y_coords) if not swap else (y_coords, x_coords)

In [14]: %timeit triangle_half_UR_LL(1000)
1.01 ms ± 94.3 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit triangle_half_UR_LL(1000)
964 μs ± 2.02 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The slowdown also happens if I define the following function and call it:

@nb.njit(cache=True)
def Farey_sequence(n: int) -> np.ndarray:
    a, b, c, d = 0, 1, 1, n
    result = [(a, b)]
    while 0 <= c <= n:
        k = (n + b) // d
        a, b, c, d = c, d, k * c - a, k * d - b
        result.append((a, b))

    return np.array(result, dtype=np.uint64)

In [6]: %timeit triangle_half_UR_LL(1000)
166 μs ± 296 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit Farey_sequence(16)
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached.
6.03 μs ± 5.72 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit Farey_sequence(16)
2.77 μs ± 50.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %timeit triangle_half_UR_LL(1000)
966 μs ± 6.48 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Solution

TL;DR: This mainly comes from the system allocator which does not behave the same way regarding the current state of the memory (hard to predict). When the function is fast, there are no page faults, while when the function is slow, it seems there are a lot of page faults slowing down the master thread.

Analysis

When a Numpy array is created, it request some memory to the standard C library (a.k.a. libc) typically using the malloc function. For large allocations, the libc request some memory space to the OS. For small allocations, it can preallocate some memory space and recycle so to avoid requesting memory to the OS. Indeed, such OS is pretty expensive. Moreover, the virtual memory pages requested by the OS is not physically mapped in RAM when the OS give you the memory space. This is done lazily: when memory is read or written (page per page). This is by the way why np.zeros and np.empty can be faster than np.full (see this post for more information about this). This principle is called overcommit. When a program read/write a page which is not yet mapped in RAM, the MMU trigger an interruption so for the OS to do the actual mapping of the virtual page to a physical page in RAM. This is called a page fault and it is very expensive. Indeed, not only exception are very expensive, but also all modern OS write zeros to mapped pages during a page fault for security reasons (because the page would contain possibly sensitive data of other processes like browsers' passwords).

The default allocator of the libc tends not to be conservative. I means it tends to give memory back to the OS and request it again. This means other processes can use this available memory, but it makes processes often requesting memory significantly slower. Regarding the amount of data allocated so far and the overall state of the cached pages, the allocator may or may not request more memory to the OS. Thus, page faults may or may not happens.

On top of that, when there is enough contiguous space in the preallocated memory buffers of the libc, the default allocator may or may not recycle a buffer that has been already used recently and stored in the CPU cache. When this happens, there is no need to fetch data from the slow DRAM, it is already there. This effect often happens in loops creating temporary arrays and writing/reading them before (implicitly) deleting them.

Here is a practical explanation:

Case 1) After the first call of the function, the 2 arrays are stored in the cache assuming is it small enough to fit in it (which is the case here on my PC). One the function is executed, the array is automatically freed but the default allocator does not release it back to the OS. Instead the memory space is recycled for the next function call and this new one can be much faster: no page fault and no need to fetch data from the DRAM! This can happen many times as long as the state of allocated memory does not suddenly change in a way there is no available memory space left for the requested memory arrays. If this case happens, we should see almost no DRAM read/writes nor any page-faults-related system calls (hard to track on Windows).

Case 2) At the end of the first function, memory is released back to the OS and then the next function call need to pay the huge overhead of page fault and also DRAM fetches. In fact data should AFAIK even be written back to the DRAM and fetched again (since the chance for both the virtual/physical page to be the same is small and also because of TLB flushes during the unmapping). If this case happens, we should see significant faults-related system calls and certainly also significantly more DRAM reads/writes.

Case 3) The libc always recycle memory but the address of the arrays are not always the same. In this case, page faults do not happen but the amount of read/written memory space can be much bigger than what is actually strictly required. This memory space can be so large it does not fit in cache. In this case, cache trashing happens causing expensive DRAM reads/writes. If this case happens, we should not see any page-faults-related function calls or associated overhead, but still significant DRAM reads/writes.

Based on this, we should be able to make the problem more likely to happen (not guaranteed) if we allocated many arrays of variable size so to change the state of the allocated memory space.

Experimental results

On my (Windows 10) machine, here is what I get when I succeed to reproduce the effect:

In [2]: %timeit triangle_half_UR_LL(1000)
135 μs ± 42 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit triangle_half_UR_LL(1000)
138 μs ± 37.5 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit triangle_half_UR_LL(1000)
117 μs ± 873 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: %timeit triangle_half_UR_LL(1000)
140 μs ± 11.7 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [6]: a = np.full(10, 0, dtype=np.uint16)

In [7]: b = np.full(100, 0, dtype=np.uint16)

In [8]: c = np.full(1000, 0, dtype=np.uint16)

In [9]: d = np.full(10000, 0, dtype=np.uint16)

In [10]: e = np.full(100000, 0, dtype=np.uint16)

In [11]: f = np.full(1000000, 0, dtype=np.uint16)

In [12]: g = np.full(10000000, 0, dtype=np.uint16)

In [13]: h = np.full(100000000, 0, dtype=np.uint16)

In [14]: %timeit triangle_half_UR_LL(1000)
932 μs ± 4.04 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [15]: %timeit triangle_half_UR_LL(1000)
935 μs ± 1.23 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We can see that creating temporary array do impact the performance of the function as expected. Compiling Numba function do allocate some memory (quite a lot AFAIK) and so it can produce a similar impact.

When the function is fast, the DRAM throughput is 1 GiB/s (~30% are writes), while it is 6 GiB/s (50~60% are writes) when the function is significantly slower. Note that I have some processes running in background taking about ~0.1 GiB/s in average (pretty variable).

When the function is fast, I can also see that 20~35% of the time is spent in the kernel which is not great (but strangely nearly no kernel time in the master thread). This is typically due to page-faults for such a code, but also possibly I/O operations due to the cache=True option (unlikely since it should only happens in the master thread), or context switches (I see some of them occurring though it does not seems a major issue) especially due to the parallel execution, or waiting time of worker threads (certainly due to some load imbalance). The later seems to be a significant issue but this seems normal to me for rather memory-bound codes like yours.

When the function is slow, the fraction of time spent in the kernel is slightly bigger (20~40%) in workers and much bigger in the master thread (30~85%).

Unfortunately, I cannot profile the Windows kernel (as opposed to Linux) so to track the page-fault-related function calls... The reported call-stack in the low-level profiler I used (VTune) does not seems to provide such an information. I can safely say that this is apparently nothing related to I/O operations since there is very few I/Os during the computation.

Experimentally, the clear biggest difference is the kernel time spent in the main thread (in read on the scheduling plot):

Fast execution:
Slow execution:

Thus, I think the kernel time spent in the main thread is likely the one increasing significantly the execution time. This means the portion of code not in the prange loop unless page faults can happen in it.

The culprit is thus these lines:

x_coords = np.full(total, 0, dtype=np.uint16)
y_coords = np.full(total, 0, dtype=np.uint16)
side = np.arange(size, dtype=np.uint16)

Conclusion

This confirms the theory explained before:

when the execution is fast, nearly no page faults happens (since nearly no time is spent in the kernel on the master thread) and data seems to mostly fit in CPU caches;
when the execution is slow a significant fraction of the time is spent in the kernel in the Numpy function doing only memory operations so the only source of kernel overhead is page fault. This is very likely because the libc allocator often release the memory back to the OS (possibly even after each function call when the array are not needed anymore) and requests it back later. Moreover, we can see an increase of the DRAM throughput as expected in this case.

Fix / Workaround

On Linux, you can easily use another more conservative allocator (e.g. TC-malloc) by tweaking the LD_PRELOAD variable. On Windows, this is apparently more complicated (this might works with TC-malloc)...

An alternative solution is just to do page fault as much as possible in parallel so to reduce their overhead by creating Numpy array with np.empty or np.zeros. That being said, page faults tend to poorly scale on Windows (as opposed to Linux were it is often significantly better).

The best solution is just to preallocate and pre-fault arrays at initialisation-time and avoid temporary arrays as much as possible in critical portions of your program (even in benchmarks for the sake of reproducibility), especially large temporary arrays. In HPC application, preallocation is very common (also to avoid out-of-memory issues after >10h of computations on >100 machines). In native programs caring about performance, like games, HPC applications (and even browsers) custom allocators are often to avoid such issues (and the overhead of allocations). Unfortunately, in Python and especially in Numpy, it is very common to create a lot of new objects and temporary arrays (and even often promoted) although this is pretty inefficient (and makes benchmark more complicated), and AFAIK we cannot easily write custom allocators for Numpy/CPython. It is sometimes difficult to preallocate/pre-fault arrays because we do not always know the size before some computations. It also often make the Python code more complex. The same thing is true with in-place Numpy operations (which can be used to avoid temporary arrays).