python cuda jupyter-notebook pycuda curand

Does order of memory allocation matter in PyCUDA's curandom?

I use PyCUDA's interface [1] over CUDA Unified Memory [2]. At some point I added random number generators [3] and stared to see dead kernels in Jupyter Notebook:

I narrowed the problem down to the creation of random number generator. Or, to be precise, to the moment when I do this:

import pycuda.curandom
from pycuda import autoinit, driver
import numpy as np

gpu_data_1 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
gpu_generator = pycuda.curandom.XORWOWRandomNumberGenerator(pycuda.curandom.seed_getter_uniform)
gpu_data_2 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)

The code above fails without any error message, but if I put the gpu_generator = ... line one line higher or lower, it appears to work fine.

I believe PyCUDA might somehow fail to execute the prepare call, which comes down to this kernel:

extern "C" {
    __global__ void prepare(curandStateXORWOW *s, const int n,
        unsigned int *v, const unsigned int o)
    {
      const int id = blockIdx.x*blockDim.x+threadIdx.x;
      if (id < n)
        curand_init(v[id], id, o, &s[id]);
    }
}

Any idea what might be the problem?

Solution

It is illegal in a pre-Pascal UM (Unified Memory) regime for host code to touch a managed allocation after a kernel has been launched, but before a cudaDeviceSynchronize() has been issued.

I am guessing this code violates this rule. If I run your repro case on a Maxwell system I get this:

$ cuda-memcheck python ./idontthinkso.py
========= CUDA-MEMCHECK
========= Error: process didn't terminate successfully
========= Fatal UVM CPU fault due to invalid operation
=========     during write access to address 0x703bc1000
=========
========= ERROR SUMMARY: 1 error

That is the managed memory system blowing up. Placing a synchronization call between the random generator setup (which runs a kernel) and the zeros call (which touches managed memory) gets rid of it on my system:

$ cat idontthinkso.py 
import pycuda.curandom
from pycuda import autoinit, driver
import numpy as np

gpu_data_1 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
gpu_generator = pycuda.curandom.XORWOWRandomNumberGenerator(pycuda.curandom.seed_getter_uniform)
autoinit.context.synchronize()
gpu_data_2 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)

$ cuda-memcheck python ./idontthinkso.py
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

The UM regime you are in will vary depending on what GPU, driver and OS you use.