I use PyCUDA's interface [1] over CUDA Unified Memory [2]. At some point I added random number generators [3] and stared to see dead kernels in Jupyter Notebook:
I narrowed the problem down to the creation of random number generator. Or, to be precise, to the moment when I do this:
import pycuda.curandom
from pycuda import autoinit, driver
import numpy as np
gpu_data_1 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
gpu_generator = pycuda.curandom.XORWOWRandomNumberGenerator(pycuda.curandom.seed_getter_uniform)
gpu_data_2 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
The code above fails without any error message, but if I put the gpu_generator = ...
line one line higher or lower, it appears to work fine.
I believe PyCUDA might somehow fail to execute the prepare
call, which comes down to this kernel:
extern "C" {
__global__ void prepare(curandStateXORWOW *s, const int n,
unsigned int *v, const unsigned int o)
{
const int id = blockIdx.x*blockDim.x+threadIdx.x;
if (id < n)
curand_init(v[id], id, o, &s[id]);
}
}
Any idea what might be the problem?
It is illegal in a pre-Pascal UM (Unified Memory) regime for host code to touch a managed allocation after a kernel has been launched, but before a cudaDeviceSynchronize()
has been issued.
I am guessing this code violates this rule. If I run your repro case on a Maxwell system I get this:
$ cuda-memcheck python ./idontthinkso.py
========= CUDA-MEMCHECK
========= Error: process didn't terminate successfully
========= Fatal UVM CPU fault due to invalid operation
========= during write access to address 0x703bc1000
=========
========= ERROR SUMMARY: 1 error
That is the managed memory system blowing up. Placing a synchronization call between the random generator setup (which runs a kernel) and the zeros call (which touches managed memory) gets rid of it on my system:
$ cat idontthinkso.py
import pycuda.curandom
from pycuda import autoinit, driver
import numpy as np
gpu_data_1 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
gpu_generator = pycuda.curandom.XORWOWRandomNumberGenerator(pycuda.curandom.seed_getter_uniform)
autoinit.context.synchronize()
gpu_data_2 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
$ cuda-memcheck python ./idontthinkso.py
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
The UM regime you are in will vary depending on what GPU, driver and OS you use.