Surprising lack of speedup in caching numpy calculations

I need to do a lot of calculations on numpy arrays, with some of the calculations being repeated. I had the idea of caching the results, but observe that

In most cases, the cached version is slower than just carrying out all calculations.
Not only is the cached version slower, line profiling also indicates that the absolute time spent on numpy operations increase, even though there are fewer of them.

I can accept the first observation by some combined magic of numpy and the python interpreter, but the second observation makes no sense to me. I also see similar behavior when operating on scipy sparse matrices.

The full application is complex, but the behavior can be reproduced by the following:

import numpy as np
from time import time

def numpy_comparison(do_cache: bool, array_size: int, num_arrays: int, num_iter: int):
    # Create random arrays
    arrays: dict[int, np.ndarray] = {}
    for i in range(num_arrays):  
        arrays[i] = np.random.rand(array_size)

    if do_cache:  # Set up the cache if needed - I cannot use lru_cache or similar in practice
        cache: dict[tuple[int, int], np.ndarray] = {}

    for _ in range(num_iter):  # Loop over random pairs of array, add, store if relevant
        i, j = np.random.randint(num_arrays, size=2)

        if do_cache and (i, j) in cache:
            a = cache[(i, j)]  # a is not used further here, but would be in the real case
        else:
            a = arrays[i] + arrays[j]
            if do_cache:
                cache[(i, j)] = a

Now running (with no multithreading)

%timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
%timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)

gives the following results

num_iter	No caching	With caching
100	10.3ms	13.7ms
1000	28.8ms	62.7ms
10000	225ms	392ms
100000	2.12s	1.62s

Varying the array size and number of arrays give similar behavior. When num_iter is sufficiently high, retrieving from cache is most efficient, but in the regime relevant for my application, num_iter=1000 when the average chance of hitting a cached value is about 5%. Line profiling indicates this is not caused by working on cache, but on the addition of the arrays being slow.

Can anyone give a hint of what is going on here?

Solution

TL;DR: page faults explain why the cache-based version is significantly slower than the one without a cache when num_iter is small. This is a side effect of creating many new Numpy arrays and deleted only at the end. When num_iter is big, the cache becomes more effective (as explained in the JonSG's answer). Using another system allocator like TCMalloc can strongly reduce this overhead.

When you create many new temporary arrays, Numpy requests some memory to the system allocator which request relatively large buffers to the operating system (OS). The first touch to memory pages causes a page fault enabling the OS to actually setup the pages (actual page fault): the virtual pages are mapped to a physical one and the target pages are filled with zeros for security reasons (information should not leak from one process to another). When all arrays are deleted, Numpy free the memory space and the underlying memory allocator has a good chance to give the memory back to the OS (so other processes can use it).

Page faults are very expensive because the CPU needs to switch from the user-land to kernel one (with elevated privileges). The kernel needs to setup many data structures and call a lot of functions to do that.

Profiling & Analysis

To prove page faults are the issue and how bad page faults are performance-wise, here is the low-level breakdown of the time when the cache is enabled (using perf):

  13,75%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add_AVX2
   7,47%  [kernel]                                           [k] __irqentry_text_end
   6,94%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double
   3,63%  [kernel]                                           [k] clear_page_erms
   3,22%  [kernel]                                           [k] error_entry
   2,98%  [kernel]                                           [k] native_irq_return_iret
   2,88%  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
   2,35%  [kernel]                                           [k] sync_regs
   2,28%  [kernel]                                           [k] __list_del_entry_valid_or_report
   2,27%  [kernel]                                           [k] unmap_page_range
   1,62%  [kernel]                                           [k] __handle_mm_fault
   1,45%  [kernel]                                           [k] __mod_memcg_lruvec_state
   1,43%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] random_standard_uniform_fill
   1,10%  [kernel]                                           [k] do_anonymous_page
   1,10%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] mt19937_gen
   0,98%  [kernel]                                           [k] mas_walk
   0,93%  libpython3.11.so.1.0                               [.] PyObject_GenericGetAttr
   0,91%  [kernel]                                           [k] get_mem_cgroup_from_mm
   0,89%  [kernel]                                           [k] get_page_from_freelist
   0,79%  libpython3.11.so.1.0                               [.] _PyObject_Malloc
   0,77%  [kernel]                                           [k] lru_gen_add_folio
   0,72%  [nvidia]                                           [k] _nv039919rm
   0,65%  [kernel]                                           [k] lru_gen_del_folio.constprop.0
   0,63%  [kernel]                                           [k] blk_cgroup_congested
   0,62%  [kernel]                                           [k] handle_mm_fault
   0,59%  [kernel]                                           [k] __alloc_pages_noprof
   0,57%  [kernel]                                           [k] lru_add
   0,57%  [kernel]                                           [k] folio_batch_move_lru
   0,56%  [kernel]                                           [k] __rcu_read_lock
   0,52%  [kernel]                                           [k] do_user_addr_fault
[...] (many others functions taking <0.52% each)

As we can see, there are a lot of [kernel] functions called and most of them are due to page faults. For example, __irqentry_text_end and native_irq_return_iret (taking ~10% of the time) is caused by CPU interrupts triggered when the CPython process access to pages for the first time. clear_page_erms is the function clearing a memory page during a first touch. Several functions are related to the virtual to physical memory mapping (e.g. AFAIK ones related to the LRU cache). Note that DOUBLE_add_AVX2 is the internal native Numpy function actually summing the two arrays. In comparison, here is the breakdown with the cache disabled:

  20,85%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add_AVX2
  17,39%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double
   5,69%  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
   3,35%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] random_standard_uniform_fill
   2,46%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] mt19937_gen
   2,15%  libpython3.11.so.1.0                               [.] PyObject_GenericGetAttr
   1,76%  [kernel]                                           [k] __irqentry_text_end
   1,46%  libpython3.11.so.1.0                               [.] _PyObject_Malloc
   1,07%  libpython3.11.so.1.0                               [.] PyUnicode_FromFormatV
   1,03%  libc.so.6                                          [.] printf_positional
   0,93%  libpython3.11.so.1.0                               [.] _PyObject_Free
   0,88%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] NpyIter_AdvancedNew
   0,79%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] ufunc_generic_fastcall
   0,77%  [kernel]                                           [k] error_entry
   0,74%  _bounded_integers.cpython-311-x86_64-linux-gnu.so  [.] __pyx_f_5numpy_6random_17_bounded_integers__rand_int64
   0,72%  [nvidia]                                           [k] _nv039919rm
   0,69%  [kernel]                                           [k] native_irq_return_iret
   0,66%  [kernel]                                           [k] clear_page_erms
   0,55%  libpython3.11.so.1.0                               [.] PyType_IsSubtype
   0,55%  [kernel]                                           [k] sync_regs
   0,52%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] __pyx_pw_5numpy_6random_6mtrand_11RandomState_31randint
   0,48%  [kernel]                                           [k] unmap_page_range
   0,46%  libpython3.11.so.1.0                               [.] PyDict_GetItemWithError
   0,43%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] PyArray_NewFromDescr_int
   0,40%  libpython3.11.so.1.0                               [.] _PyFunction_Vectorcall
   0,38%  [kernel]                                           [k] __mod_memcg_lruvec_state
   0,38%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] promote_and_get_ufuncimpl
   0,37%  libpython3.11.so.1.0                               [.] PyDict_SetItem
   0,36%  libpython3.11.so.1.0                               [.] PyObject_RichCompareBool
   0,36%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] PyUFunc_GenericReduction
[...] (many others functions taking <0.35% each)

There are clearly less kernel function called in the top 30 most expensive functions. We can still see the interrupt related functions, but note that this low-level profiling is global to my whole machine and so these interrupt-related function likely comes from other processes (e.g. perf itself, the graphical desktop environment, and Firefox running as I write this answer).

There is another reason proving page faults are the main culprit: during my tests, the system allocator suddenly changed its behavior because I allocated many arrays before running the same commands, and this results in only few kernel calls (in perf) as well as very close timings between the two analyzed variants (with/without cache):

# First execution:
In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
20.2 ms ± 63.7 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.4 ms ± 81.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Second execution:
In [4]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
19.8 ms ± 43.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.3 ms ± 155 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# After creating many Numpy arrays (not deleted since) with no change of `numpy_comparison`:
In [95]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
    ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
18.4 ms ± 15.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.9 ms ± 26.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-----

We can clearly see that the overhead of using a cache is now pretty small. This also means the performance could be very different in a real-world application (because of a different allocator state), or even on different platforms.

Solution to mitigate page faults

You can use alternative system allocators like TCMalloc which requests a big chunk of memory to the OS at startup time so not to often pay page faults. Here are results with it (using the command line LD_PRELOAD=libtcmalloc_minimal.so.4 ipython):

In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
16.9 ms ± 51.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.5 ms ± 29.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The overhead related to the cache which was actually due the creation of many temporary arrays and more specifically page faults is now >10 times smaller!

I think the remaining overhead is due to the larger memory working set as pointed out by NickODell in comments (this cause more cache misses due to cash trashing and more data to be loaded from the slow DRAM). Put it shortly, the cache version is simply less cache friendly.

Here are results with num_item=100_000 with/without TCMalloc:

# Default system allocator (glibc)
In [97]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
    ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
1.31 s ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
859 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# TCMalloc allocator
In [3]: %timeit -n 1 numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit -n 1 numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
1.28 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
774 ms ± 83.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This behavior is expected: the cache starts to be useful with more hits. Note that TCMalloc makes the overall execution faster in most case!