When I need to generate an N-dimensional index array in C-order, I’ve tried a few different NumPy approaches.
The fastest for larger arrays but less readable:
np.stack(np.meshgrid(*[np.arange(i, dtype=dtype) for i in sizes], indexing="ij"), axis=-1).reshape(-1, len(sizes))
More readable with good performance:
np.ascontiguousarray(np.indices(sizes, dtype=dtype).reshape(len(sizes), -1).T)
Here I'm not sure if the ascontiguousarray copy is actually necessary, or if there's a better way to make sure the result is in C-order without forcing a copy.
Most readable, but by far the slowest:
np.vstack([*np.ndindex(sizes)], dtype=dtype)
The iterator conversion is quite slow for larger arrays.
Is there a built-in more straightforward and readable way to achieve this that matches the performance of np.meshgrid or np.indices using NumPy? If not, can either the meshgrid or indices approaches be optimized to avoid unnecessary memory copies (like ascontiguousarray) while still making sure the array is C-contiguous?
Example:
sizes = (3, 1, 2)
idx = np.ascontiguousarray(np.indices(sizes).reshape(len(sizes), -1).T)
print(idx)
print(f"C_CONTIGUOUS: {idx.flags['C_CONTIGUOUS']}")
# [[0 0 0]
# [0 0 1]
# [1 0 0]
# [1 0 1]
# [2 0 0]
# [2 0 1]]
# C_CONTIGUOUS: True
Here is a (quite-naive) solution in Numba using multiple threads:
import numba as nb
@nb.njit(
[
# Eagerly compiled for common types
# Please add your type if it is missing
'(int32[:,:], int32[:])',
'(int64[:,:], int32[:])',
'(float32[:,:], int32[:])',
'(float64[:,:], int32[:])',
],
parallel=True,
cache=True
)
def nb_kernel(res, sizes):
n = np.prod(sizes)
m = sizes.size
chunk_size = 1024
assert n > 0 and m > 0
for i in range(m):
assert sizes[i] > 0
# Compute blocks of 256 rows.
# Multiple threads compute separate blocks.
for block in nb.prange((n + chunk_size - 1) // chunk_size):
start = block * chunk_size
end = min(start + chunk_size, n)
# Compute the first row of the block
jump = 1
for j in range(m-1, -1, -1):
res[start, j] = (start // jump) % sizes[j]
jump *= sizes[j]
# The next rows of the block incrementally
for i in range(start+1, end):
inc = 1
for j in range(m-1, -1, -1):
val = res[i-1, j] + inc
if val >= sizes[j]:
val = 0
inc = 1
else:
inc = 0
res[i, j] = val
def nb_compute(sizes, dtype):
res = np.empty((np.prod(sizes), len(sizes)), dtype=dtype)
nb_kernel(res, np.array(sizes, dtype=np.int32))
return res
On my machine (i5-9600KF CPU), on Windows, here are results with sizes=(101,53,71)
and dtype=np.int32
:
np.vstack(...): 626.1 ms
np.ascontiguousarray(...): 3.5 ms
np.stack(...): 2.6 ms
nb_compute(...): 1.1 ms <----
nb_kernel(...): 0.5 ms <----
With a fixed length of `sizes` known at compile time:
nb_compute(...): 0.8 ms <----
nb_kernel(...): 0.2 ms <----
We can see that calling nb_kernel
directly on a preallocated array is significantly faster. Indeed, when we fill an array for the first time, it causes a lot of memory page faults which are inherently slow. Doing that in parallel is better (but it does not scale on Windows).
If you already do this in each thread of a parallel code, nb_kernel
will not make this significantly faster. Indeed, most of the speed up of Numba comes from the use of multiple threads. Consequently, in this case, we need to optimise the Numba kernel. One major optimisation is to specialise the fonction for a specific length of sizes
(then known at compile time). Indeed, the code is more than twice if we replace m
by 3 (so we only supports len(sizes)
being 3). I expect most of the case to have a very small len(sizes)
so you can specialise the function for the case 2, 3, 4 and 5 and write a pure-Python function calling the good specialisation. This optimisation also makes the parallel code faster.
For better performance, you should avoid filling big arrays due to DRAM being slow. This is especially true for temporary arrays (arrays that are filled once and then never reused) due to page faults. I think the above code is optimal for output arrays not fitting in the last-level cache (LLC) of your CPU.
For output arrays fitting in the LLC, there are faster implementations than the one above, for example using a native SIMD-friendly language (but it is pretty complex to implement).