Why do these nearly identical functions perform very differently?

I have written four functions that modify a square 2D array in place, it reflects half of the square array delimited by two sides that meet and the corresponding 45 degree diagonal, to the other half separated by the same diagonal.

I have written a function for each of the four possible cases, to reflect product(('upper', 'lower'), ('left', 'right')) to product(('lower', 'upper'), ('right', 'left')).

They use Numba to compile Just-In-Time and they are parallelized using numba.prange and are therefore much faster than the methods provided by NumPy:

In [2]: sqr = np.random.randint(0, 256, (1000, 1000), dtype=np.uint8)

In [3]: %timeit x, y = np.tril_indices(1000); sqr[x, y] = sqr[y, x]
9.16 ms ± 30.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, the above code takes a very long time to execute.

import numpy as np
import numba as nb


@nb.njit(cache=True, parallel=True, nogil=True)
def triangle_flip_LL2UR(arr: np.ndarray) -> None:
    height, width = arr.shape[:2]
    if height != width:
        raise ValueError("argument arr must be a square")

    for i in nb.prange(height):
        arr[i, i:] = arr[i:, i]


@nb.njit(cache=True, parallel=True, nogil=True)
def triangle_flip_UR2LL(arr: np.ndarray) -> None:
    height, width = arr.shape[:2]
    if height != width:
        raise ValueError("argument arr must be a square")

    for i in nb.prange(height):
        arr[i:, i] = arr[i, i:]


@nb.njit(cache=True, parallel=True, nogil=True)
def triangle_flip_LR2UL(arr: np.ndarray) -> None:
    height, width = arr.shape[:2]
    if height != width:
        raise ValueError("argument arr must be a square")

    last = height - 1
    for i in nb.prange(height):
        arr[i, last - i :: -1] = arr[i:, last - i]


@nb.njit(cache=True, parallel=True, nogil=True)
def triangle_flip_UL2LR(arr: np.ndarray) -> None:
    height, width = arr.shape[:2]
    if height != width:
        raise ValueError("argument arr must be a square")

    last = height - 1
    for i in nb.prange(height):
        arr[i:, last - i] = arr[i, last - i :: -1]

In [4]: triangle_flip_LL2UR(sqr)

In [5]: triangle_flip_UR2LL(sqr)

In [6]: triangle_flip_LR2UL(sqr)

In [7]: triangle_flip_UL2LR(sqr)

In [8]: %timeit triangle_flip_LL2UR(sqr)
194 μs ± 634 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [9]: %timeit triangle_flip_UR2LL(sqr)
488 μs ± 3.26 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: %timeit triangle_flip_LR2UL(sqr)
196 μs ± 501 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [11]: %timeit triangle_flip_UL2LR(sqr)
486 μs ± 855 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Why do they have execution times with a significant difference? Two of them take around 200 microseconds to execute, the other two around 500 microseconds, despite the fact that they are almost identical.

I have discovered something. triangle_flip_UR2LL(arr) is the same as triangle_flip_LL2UR(sqr.T) and vice versa.

Now if I transpose the array before calling the functions, the trend of performance is reversed:

In [109]: %timeit triangle_flip_UR2LL(sqr.T)
196 μs ± 1.15 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [110]: %timeit triangle_flip_LL2UR(sqr.T)
490 μs ± 1.24 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Why is this happening?

Solution

this is a mixture of false sharing and memory bandwidth bottleneck, removing the parallelization by converting nb.prange -> range you get equal times for all 4 functions.

the first one is writing a single row at a time, which is contiguous in memory, the second one is writing a column at a time. this column is not contiguous.

the computer doesn't work with bytes, it works with cache lines, a cache line is a contiguous 64 bytes of memory on most systems. when a thread writes to a cache line it marks the entire cache line dirty and other threads need to refresh their version of that dirty cache line before they read or write to the other values in it. this is basically false sharing.

In the second version, as each thread is writing a column at a time, it is marking many cache lines dirty at a time, thus stepping over other threads that will try to read from it.

lastly i need to point out that parallelizing this code introduces race conditions in all versions, you need to have separate arrays for input and output, or have each thread work on a square block at a time instead of an entire row or column, most optimized implementations of matrix transpose avoid this false sharing by doing that.