Monte Carlo simulation of PI with numba is the slowest for the lowest number of points?

As part of a homework exercise, I am implementing a Monte-Carlo simulation of pi in Python. I am using Numba to accelerate and parallelize the computation. From a previous test I performed, I found that parallelized numba runs faster than a non parallelized numba version and from a pure numpy version, so this is what I chose to go with. The question asked to measure the performance of the algorithm for n = 10**k for k = [1, 2, 3, 4, 5, 6, 7], and when I implemented this I got a very strange result: The first run with the lowest value of n = 10 runs the slowest of them all. The run for n = 10 is consistently 3 times slower than n = 10,000,000:

At first, I thought the first run was slow because numba needs to compile my function, but I am using cache so this excuse can only work for the first time I run the script. You can see in the above picture that the first time I ran the script, the run for n = 10 was much slower than the other times because it needed to compile the function. However, the next times use the cached version and are still significantly slower than runs with bigger n.

Another weird result is that sometimes, the run of n = 1,000 is faster than n = 100. I thought this algorithm was supposed to be linear in time. I don't understand why the first run with the lowest value of n turns out to be the slowest and why some runs are often faster than previous runs with a smaller value of n. Can somebody explain the results I am getting?

Here's the code I used:

from datetime import timedelta
from time import perf_counter

from numba import jit, prange
import numpy as np
import numpy.typing as npt

jit_opts = dict(
    nopython=True, nogil=True, cache=True, error_model="numpy", fastmath=True
)

rng = np.random.default_rng()


@jit(**jit_opts, parallel=True)
def count_points_in_circle(points: npt.NDArray[float]) -> tuple[npt.NDArray[bool], int]:
    in_circle = np.empty(points.shape[0], dtype=np.bool_)
    in_circle_count = 0
    for i in prange(points.shape[0]):
        in_ = in_circle[i] = points[i, 0] ** 2 + points[i, 1] ** 2 < 1
        in_circle_count += in_
    return in_circle, in_circle_count


def monte_carlo_pi(n: int) -> tuple[npt.NDArray[float], npt.NDArray[bool], float]:
    points = rng.random((n, 2))
    in_circle, count = count_points_in_circle(points)
    return points, in_circle, 4 * count / n


def main() -> None:
    n_values = 10 ** np.arange(1, 8)
    for n in n_values:
        start = perf_counter()
        points, in_circle, pi_approx = monte_carlo_pi(n)
        end = perf_counter()
        duration = end - start
        delta = timedelta(seconds=duration)
        elapsed_msg = (
            f"[{delta} (Raw time: {duration} s)]"
            if delta
            else f"[Raw time: {duration} s]"
        )
        print(
            f"n = {n:,}:".ljust(20),
            f"\N{GREEK SMALL LETTER PI} \N{ALMOST EQUAL TO} {pi_approx}".ljust(20),
            elapsed_msg,
        )


if __name__ == "__main__":
    main()

Edit:

Following @Jérôme Richard's answer, I added explicit signatures to avoid lazy compilations and tested the execution time both with parallelization and without:

The first 3 runs are with parallel=True and deleting the cache before the first run. The last 3 runs used parallel=False and range instead of prange (I believe setting parallel=False would have been enough but I wanted to make sure). Once again, I deleted the cache before running these 3 tests. The modification to the code:

import numba as nb


@nb.jit(
    [
        nb.types.Tuple((nb.bool_[:], nb.int64))(nb.float64[:, :]),
        nb.types.Tuple((nb.bool_[:], nb.int32))(nb.float32[:, :]),
    ],
    **jit_opts,
    parallel=False,  # or parallel=True
)
def count_points_in_circle(points: npt.NDArray[float]) -> tuple[npt.NDArray[bool], int]:
    in_circle = np.empty(points.shape[0], dtype=np.bool_)
    in_circle_count = 0
    for i in range(points.shape[0]):  # or nb.prange
        in_ = in_circle[i] = points[i, 0] ** 2 + points[i, 1] ** 2 < 1
        in_circle_count += in_
    return in_circle, in_circle_count

Indeed, using parallelization for n <= 10,000 turns out to be slower on my machine. However, the first run with n = 10 is still consistently slower than every run with n <= 10,000 which is weird. What other "startup overhead" there is here and how can it be eliminated?

Another follow-up question I have is how can I enable parallelization dynamically based on the input. From these results, it looks like the best approach is to use parallelization only for n > 10,000 (or some other user-specified threshold). I don't want to write two versions of the scheme, one with parallel=True and prange while the second with parallel=False and range. How can I enable parallelization dynamically without duplicating the function?

Solution

The first run with the lowest value of n = 10 runs the slowest of them all.

This is due to the lazy compilation. Indeed, I can reproduce it and fix the issue by compiling the function eagerly. You can do that by specifying manually the signature of the function:

@jit('(float64[:,:],)', **jit_opts, parallel=True)

I thought the first run was slow because numba needs to compile my function, but I am using cache so this excuse can only work for the first time I run the script.

In practice, the cache is fragile. For example, if the environment changes or global variables used in the function or you do not run the script from a file, then the function will be recompiled. Besides, reloading the function for the storage device is not free though generally much faster than 0.3 seconds.

Another weird result is that sometimes, the run of n = 1,000 is faster than n = 100. I thought this algorithm was supposed to be linear in time.

They both run in only dozens of microseconds. This is a very small time. At this scale, timings tends not to be very stable. One issue is that the code use multiple threads and spawning threads, sharing the work and waiting for the result is slow : it often takes at least dozens of microseconds on most PC machines. In fact, it can be hundreds of microseconds on large scale servers (the higher the number of thread to spawn, the higher the overhead). One source of variability is the scheduler, the availability of the cores and C-States (cores can sleep to consume less energy but they need more time to wake up) as well as P-States (i.e. frequency scaling).

On my machine, not using threads is faster than using thread for n=100 and n=1000. Most of the overhead seems to come from the array allocation then. And the variability of 2 runs is pretty big compared to the difference of the two. Computing the standard deviation (or even the time distribution) is very important in benchmarks to measure their accuracy. It enable you to tell whether one case is statistically faster than another or not by computing the p-value. Timing alone are not enough. A higher execution time could be due to noise like a CPU-intensive background program awaken for a small time (bad luck), cold cache effects, etc.