My cumulative sum in numba cuda is giving the wrong results when using 1024 threads

I'm trying to implement a cumulative sum in numba cuda with parallel calculation. This function would take a 1-D array ( a = [1, 1, 1, 1]) and compute it so that each element of the array would be the sum of all previous elements ( ans = [1, 2, 3, 4] ). Here is the function as I have implemented it:

from numba import cuda
import math
import numpy as np

@cuda.jit
def c_sum(array):

   tidx = cuda.threadIdx.x
   nb_it = math.log2(cuda.blockDim.x)
   if tidx < array.size:
       for l in range(nb_it):
           if tidx > 2**l-1:
               array[tidx] += array[tidx-(2**l)]
           cuda.syncthreads()

input_ar = np.ones([1024]) 
c_sum[1,1024](input_ar)

The requirement for this function to work is that the number of threads is greater or equal than the size of the array. I have tested it for arrays of ones of size 16,32,64,128 up to 512 and it works fine! For some reason, it doesn't work with an array reaching 1024, which is unfortunately what I need for my application. 1024 should be the maximum allowed number of threads in a thread block. The last 5 elements of the array should be [1020,1021,1022,1023,1024]. Here is what I get:

print(input_ar[-5:])
[1352. 1348. 1344. 1348. 1344.]

It actually changes everytime. Anyone have an idea?

Solution

You have a global memory race condition. The threads do not all execute in lockstep. Therefore this line results in ambiguous behavior:

array[tidx] += array[tidx-(2**l)]

For example, what if thread 32 (ie. tidx == 32) did its work entirely, and then thread 31 ran (if its needed for clarity, assume we are on the loop iteration when l is zero). Conversely, what if thread 31 did its work entirely and then thread 32 ran. You would expect two different results for those two different cases.

You can fix it by separating the loads from the stores with appropriate barriers:

$ cat t3.py
from numba import cuda
import math
import numpy as np

@cuda.jit
def c_sum(array):

   tidx = cuda.threadIdx.x
   nb_it = math.log2(cuda.blockDim.x)
   if tidx < array.size:
       for l in range(nb_it):
           if tidx > 2**l-1:
               c = array[tidx-(2**l)]
           cuda.syncthreads()
           if tidx > 2**l-1:
               array[tidx] += c
           cuda.syncthreads()

input_ar = np.ones([1024])
c_sum[1,1024](input_ar)
print(input_ar[-5:])
$ python3 t3.py
/home/bob/.local/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
/home/bob/.local/lib/python3.10/site-packages/numba/cuda/cudadrv/devicearray.py:888: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
  warn(NumbaPerformanceWarning(msg))
[1020. 1021. 1022. 1023. 1024.]
$

This is not the only possible way to fix it, of course. You could use a temporary array and ping-pong back and forth, so that the operation is not working entirely "in-place". And there are probably other possible fixes. For performance reasons, you would probably also get the advice not to do this kind of work purely in global memory, but to use shared memory as well. But the problem is clearly a toy problem (indeed with some boilerplate warnings provided by recent versions of numba cuda to that effect, as can be seen in the output above). Suffice it to say there are better, faster, more efficient methods for this kind of work.

This methodology cannot be directly extended beyond a single threadblock, because cuda.syncthreads() is only an execution barrier for the threads in a block. To go beyond a single block you would need to use a grid-wide (device-wide) barrier, or else use an alternate formulation to get a prefix sum.