Numba CUDA code crashing due to unknown error, fixed with the addition of blank print statement in any thread

I'm writing some Hamiltonian evolution code that relies heavily on matrix multiplication, so I've been trying to learn about developing for a GPU using python.

However, when I run these lines of code inside my time evolution wrapper function:

momentum_update_kernel[blocks, threads_per_block](config, dt / 2, staple_gpu, Barray_gpu, V2Barray_gpu, g_in)
cuda.synchronize()
print("momentum updated")
link_update_kernel[blocks, threads_per_block](config, dt, lie_gens)

I started getting:

"error 700"

on the cuda.synchronize() line.

When attempting to debug this using print statements, I noticed that the code started compiling successfully and also giving plausible answers, suggesting to me that it runs properly when a print statement is added to the code. This leads me to believe that there's some sort of thread sync issue and the compiled version with the print statement somehow avoids this. The momentum_update code is as follows:

def momentum_update(config, dt, staple_index_array, Barray, V2Barray, idx, out, g):

    links = config[0]
    momentum = config[1]
    inshape = links.shape

    numnodes = inshape[0]
    numdims = inshape[1]

    total_matricies = numnodes * numdims

    if idx >= total_matricies:
        return


    nodeindex = idx // numdims
    direction = idx % numdims


    if idx == 0:
        print("") #there for debug reasons, not entirely sure what's happening here

    #making the staple


    temp = cuda.local.array((2, 2), dtype=complex128)
    temp2 = cuda.local.array((2, 2), dtype=complex128)
    Vdirection = cuda.local.array((2,2), dtype=complex128)

    Vdirection[0,0] = 0
    Vdirection[0, 1] = 0
    Vdirection[1, 0] = 0
    Vdirection[1, 1] = 0

    for i in range(numdims):
        idx_tuple_1 = staple_index_array[nodeindex, direction,i, 0, 0]
        idx_tuple_2 = staple_index_array[nodeindex, direction, i, 0, 1]
        idx_tuple_3 = staple_index_array[nodeindex, direction, i, 0, 2]
        idx_tuple_4 = staple_index_array[nodeindex, direction, i, 1, 0]
        idx_tuple_5 = staple_index_array[nodeindex, direction, i, 1, 1]
        idx_tuple_6 = staple_index_array[nodeindex, direction, i, 1, 2]

        staple_matrix_1 = links[idx_tuple_1[0], idx_tuple_1[1]]
        staple_matrix_2 = links[idx_tuple_2[0], idx_tuple_2[1]]
        staple_matrix_3 = links[idx_tuple_3[0], idx_tuple_3[1]]

        staple_matrix_4 = links[idx_tuple_4[0], idx_tuple_4[1]]
        staple_matrix_5 = links[idx_tuple_5[0], idx_tuple_5[1]]
        staple_matrix_6 = links[idx_tuple_6[0], idx_tuple_6[1]]




        Bval = Barray[nodeindex, direction, i]
        V2Bval = V2Barray[nodeindex, direction, i]

        #calculating first staple
        dagger_2x2_cuda(staple_matrix_2, temp)

        matmul_2x2_cuda(staple_matrix_1, temp, temp2)

        dagger_2x2_cuda(staple_matrix_3, temp)



        matmul_2x2_cuda(temp2, temp, temp2)

        scale_2x2_cuda(temp2, Bval, temp2)
        add_2x2_cuda(temp2, Vdirection, Vdirection)

        # calculating second staple
        dagger_2x2_cuda(staple_matrix_4, temp)
        dagger_2x2_cuda(staple_matrix_5, temp2)
        matmul_2x2_cuda(temp, temp2, temp2)
        matmul_2x2_cuda(temp2, staple_matrix_6, temp2)
        scale_2x2_cuda(temp2, V2Bval, temp2)
        add_2x2_cuda(temp2, Vdirection, Vdirection)


    #calculating staple contribution
    matmul_2x2_cuda(links[nodeindex, direction], Vdirection, temp)

    dagger_2x2_cuda(temp, temp2)

    scale_2x2_cuda(temp, -1, temp)
    add_2x2_cuda(temp2, temp, temp)

    scale_2x2_cuda(temp, 1/g**2, temp)
    scale_2x2_cuda(temp, dt, temp)

    #calculating new momentum
    add_2x2_cuda(momentum[nodeindex, direction], temp, temp2)
    for i in range(2):
        for j in range(2):
            config[1][nodeindex, direction,i,j] = temp2[i,j]

The actual momentum_update_kernel code isn't particularly interesting, and essentially just defines idx and calls momentum_update, so I haven't included it here.

Now, one possible issue is some sort of race condition, because momentum_update is both reading to and writing from config. However, I suspect that this isn't the case. config is a (2, N, d, 2, 2) array, and each thread in momentum update should only read from:

Link array, the (N, d, 2, 2) array stored in config[0], and
The specific momentum value that it updates.

So multiple threads may be reading from the same memory at once, but nothing should be reading from something being written to by a different thread.

When I scale up the size of the simulation to the size I actually want to run it at, it starts crashing regardless of the print statement being there, so I can't just leave the print statement in as a hacky fix.

Is there at least a way I can debug this? Obviously momentum_update calls several other functions, but if it was some sort of bug with the other functions I'd expect the hamiltonian evolution to start giving me incorrect answers (I had bugs in, say, matmul_2x2_cuda earlier, causing it to not work properly when the output array was the same as one of the inputs, but I could tell because all of a sudden the simulation became numerically unstable), which doesn't seem to be happening.

I'm running this on the GTX1060 in my desktop, but the bug is also present on, say, the computecanada's GPU nodes (and in fact the print statement hack doesn't work there).

Solution

I started getting an "error 700" on the cuda.synchronize() line.

An error 700 is AFAIK an illegal memory access. So it is typically an out of bounds.

Is there at least a way I can debug this?

Debugging CUDA code tends to be harder than CPU ones, especially with Numba. Fortunately, you can use tools to better track this issue. This is mentioned in the documentation:

By setting the debug argument to cuda.jit to True (@cuda.jit(debug=True)), Numba will emit source location in the compiled CUDA code. Unlike the CPU target, only filename and line information are available, but no variable type information is emitted. The information is sufficient to debug memory error with cuda-memcheck.
[...]
We can use cuda-memcheck to find the memory error:
$ cuda-memcheck python chk_cuda_debug.py

This should certainly be enough to track the illegal access in your Numba code.