I have a large matrix and I want to output all the indices where the elements in the matrix are less than 0. I have this MWE in numba:
import numba as nb
import numpy as np
A = np.random.random(size = (1000, 1000)) - 0.1
@nb.njit(cache=True)
def numba_only(arr):
rows = np.empty(arr.shape[0]*arr.shape[1])
cols = np.empty(arr.shape[0]*arr.shape[1])
idx = 0
for i in range(arr.shape[0]):
for j in range(A.shape[1]):
if arr[i, j] < 0:
rows[idx] = i
cols[idx] = j
idx += 1
return rows[:idx], cols[:idx]
Timing, I get:
%timeit numba_only(A)
2.29 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is a faster than np.where(A<0)
(without numba) which gives:
%timeit numpy_only(A)
3.56 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But it is a little slower than np.where
wrapped by @nb.njit
:
@nb.njit(cache=True)
def numba_where(arr):
return np.where(arr<0)
which gives:
%timeit numba_where(A)
1.76 ms ± 84.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Can the numba code be sped up by parallelization somehow? I realise it might be memory bound but I think that modern hardware should allow some level of parallel access to memory.
I am not sure how to use nb.prange to achieve this due to the index idx in the loop.
The provided algorithm is inherently sequential due to the long dependency chain of instruction which is also data dependent. However, this is possible to do an equivalent operation in parallel using a scan.
Implementing such a parallel algorithm based on a scan is much more complex (especially an efficient cache-friendly scan). A relatively simple implementation is to iterate twice on the dataset so to compute local sums, then compute cumulative sums and finally do the actual work.
@nb.njit(cache=True, parallel=True)
def numba_parallel_where(arr):
flattenArr = arr.reshape(-1)
arrSize = flattenArr.size
chunkSize = 2048
chunkCount = (arrSize + chunkSize - 1) // chunkSize
counts = np.empty(chunkCount, dtype=np.int32)
for chunkId in nb.prange(chunkCount):
start = chunkId * chunkSize
end = min(start + chunkSize, arrSize)
count = 0
for i in range(start, end):
count += flattenArr[i] < 0
counts[chunkId] = count
offsets = np.empty(chunkCount + 1, dtype=np.int64)
offsets[0] = 0
for chunkId in range(chunkCount):
offsets[chunkId + 1] = offsets[chunkId] + counts[chunkId]
outSize = offsets[-1]
rows = np.empty(outSize, dtype=np.int32)
cols = np.empty(outSize, dtype=np.int32)
n = np.int32(arr.shape[1])
for chunkId in nb.prange(chunkCount):
start = chunkId * chunkSize
end = min(start + chunkSize, arrSize)
idx = offsets[chunkId]
for i in range(start, end):
if flattenArr[i] < 0:
rows[idx] = i // n
cols[idx] = i % n
idx += 1
return rows, cols
Here are performance results on my 6-core i5-9600KF CPU with a 40 GiB/s RAM (optimal bandwidth for reads) on Windows:
Sequential: 1.68 ms
Parallel (no pre-allocations): 0.85 ms
Parallel (with pre-allocations): 0.55 ms <-----
Theoretical optimal: 0.20 ms
The parallel implementation is about twice faster. It succeed to reach a memory throughput of ~19 GiB/s on my machine which is sub-optimal, but not so bad. While this seems disappointing on a 6-core CPU, the parallel implementation is not very optimized and there are several points to consider:
Pre-allocating the output (filled manually with zeros to avoid possible page-faults) and passing it to the function helps to significantly reduce the execution time. This optimized version is 3 times faster than the initial (sequential) one. It reaches ~28 GiB/s which is relatively good.
Other possible optimizations include:
AFAIK, there is an opened bug for the slow Numba allocations since few years but it has obviously not been solved yet. On top of that, relatively large allocations can be quite expensive on Windows too. A natively-compiled code (written in C/C++/Fortran/Rust) on Linux should not exhibit such an issue.
Note that such optimizations will make the implementation even more complex than the above one.