cudanumbagpureduction

numba cuda does not produce correct result with += (gpu reduction needed?)


I am using numba cuda to calculate a function.

The code is simply to add up all the values into one result, but numba cuda gives me a different result from numpy.

numba code

 import math
 def numba_example(number_of_maximum_loop,gs,ts,bs):
        from numba import cuda
        result = cuda.device_array([3,])

        @cuda.jit(device=True)

    def BesselJ0(x):
        return math.sqrt(2/math.pi/x)

    @cuda.jit
    def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
        i = cuda.grid(1)
        if i < number_of_maximum_loop:
            result[0] += BesselJ0(i/100+gs)
            result[1] += BesselJ0(i/100+ts)
            result[2] += BesselJ0(i/100+bs)

    # Configure the blocks
    threadsperblock = 128
    blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock

    # Start the kernel 
    cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs) 

    return result.copy_to_host()

numba_example(1000,20,20,20) 

output:

array([ 0.17770302,  0.34166728,  0.35132036])

numpy code

import math
def numpy_example(number_of_maximum_loop,gs,ts,bs):
    import numpy as np
    result = np.zeros([3,])

    def BesselJ0(x):
        return math.sqrt(2/math.pi/x)

    for i in range(number_of_maximum_loop):
        result[0] += BesselJ0(i/100+gs)
        result[1] += BesselJ0(i/100+ts)
        result[2] += BesselJ0(i/100+bs)

    return result

numpy_example(1000,20,20,20) 

output:

array([ 160.40546935,  160.40546935,  160.40546935])

I don't know where I am being wrong. I guess I might use reduction. But it seems impossible to finish it with one cuda kernel.


Solution

  • Yes, a proper parallel reduction is needed to sum data from multiple GPU threads to a single variable.

    Here's one trivial example of how it could be done from a single kernel:

    $ cat t23.py
    import math
    def numba_example(number_of_maximum_loop,gs,ts,bs):
        from numba import cuda
        result = cuda.device_array([3,])
    
        @cuda.jit(device=True)
        def BesselJ0(x):
            return math.sqrt(2/math.pi/x)
    
        @cuda.jit
        def cuda_kernel(number_of_maximum_loop,result,gs,ts,bs):
            i = cuda.grid(1)
            if i < number_of_maximum_loop:
                cuda.atomic.add(result, 0, BesselJ0(i/100+gs))
                cuda.atomic.add(result, 1, BesselJ0(i/100+ts))
                cuda.atomic.add(result, 2, BesselJ0(i/100+bs))
    
    # Configure the blocks
        threadsperblock = 128
        blockspergrid = (number_of_maximum_loop + (threadsperblock - 1)) // threadsperblock
    
     # Start the kernel
        init = [0.0,0.0,0.0]
        result = cuda.to_device(init)
        cuda_kernel[blockspergrid, threadsperblock](number_of_maximum_loop,result,gs,ts,bs)
    
        return result.copy_to_host()
    
    print(numba_example(1000,20,20,20))
    $ python t23.py
    [ 162.04299487  162.04299487  162.04299487]
    $
    

    You can also do a proper reduction in numba directly with the reduce decorator as described here although I'm not sure 3 reductions can be done in a single kernel that way.

    Finally, you could write an ordinary cuda parallel reduction using numba cuda as indicated here. It should not be difficult I think to extend that to performing 3 reductions in a single kernel.

    These 3 different methods will likely have performance differences, of course.

    As an aside, if you're wondering about the results discrepancy between my code above and your python code in the question, I can't explain it. When I run your python code I get results matching the numba cuda code in my answer:

    $ cat t24.py
    import math
    def numpy_example(number_of_maximum_loop,gs,ts,bs):
        import numpy as np
        result = np.zeros([3,])
    
        def BesselJ0(x):
            return math.sqrt(2/math.pi/x)
    
        for i in range(number_of_maximum_loop):
            result[0] += BesselJ0(i/100+gs)
            result[1] += BesselJ0(i/100+ts)
            result[2] += BesselJ0(i/100+bs)
    
        return result
    
    print(numpy_example(1000,20,20,20))
    $ python t24.py
    [ 162.04299487  162.04299487  162.04299487]
    $