PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).

The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.

Kernel code and setup:

kernel_code = '''
    #include <curand_kernel.h>
    #include <math.h>

    extern "C" {

    __device__ float get_random_number(curandState* global_state, int thread_id) {

        curandState local_state = global_state[thread_id];
        float num = curand_uniform(&local_state);
        global_state[thread_id] = local_state;
        return num;
    }

    __global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {

        int x = threadIdx.x + blockIdx.x * blockDim.x;             // column index of cell
        int y = threadIdx.y + blockIdx.y * blockDim.y;             // row index of cell

        // make sure this cell is within bounds of grid
        if (x < grid_size && y < grid_size) {

            int thread_id = y * grid_size + x;                      // thread index
            grid_b[thread_id] = grid_a[thread_id];                  // copy current cell
            float num;

            // ignore cell if it is not already populated
            if (grid_a[thread_id] > 0.0) {

                num = get_random_number(global_state, thread_id);

                // agents in this cell die
                if (num < survival_probabilities[thread_id]) {
                    grid_b[thread_id] = 0.0;                        // cell dies
                    //printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
                }
            }
        }
    }

mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')

Data setup:

matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims

grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))

Kernel call:

survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities, 
    grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))

I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.

pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered

Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?

Solution

The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:

matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims

You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to

Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select

I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.