pythoncudajupyter-notebookpycudacurand

Does order of memory allocation matter in PyCUDA's curandom?


I use PyCUDA's interface [1] over CUDA Unified Memory [2]. At some point I added random number generators [3] and stared to see dead kernels in Jupyter Notebook: The kernel appears to have died

I narrowed the problem down to the creation of random number generator. Or, to be precise, to the moment when I do this:

import pycuda.curandom
from pycuda import autoinit, driver
import numpy as np

gpu_data_1 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
gpu_generator = pycuda.curandom.XORWOWRandomNumberGenerator(pycuda.curandom.seed_getter_uniform)
gpu_data_2 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)

The code above fails without any error message, but if I put the gpu_generator = ... line one line higher or lower, it appears to work fine.

I believe PyCUDA might somehow fail to execute the prepare call, which comes down to this kernel:

extern "C" {
    __global__ void prepare(curandStateXORWOW *s, const int n,
        unsigned int *v, const unsigned int o)
    {
      const int id = blockIdx.x*blockDim.x+threadIdx.x;
      if (id < n)
        curand_init(v[id], id, o, &s[id]);
    }
}

Any idea what might be the problem?


Solution

  • It is illegal in a pre-Pascal UM (Unified Memory) regime for host code to touch a managed allocation after a kernel has been launched, but before a cudaDeviceSynchronize() has been issued.

    I am guessing this code violates this rule. If I run your repro case on a Maxwell system I get this:

    $ cuda-memcheck python ./idontthinkso.py
    ========= CUDA-MEMCHECK
    ========= Error: process didn't terminate successfully
    ========= Fatal UVM CPU fault due to invalid operation
    =========     during write access to address 0x703bc1000
    =========
    ========= ERROR SUMMARY: 1 error
    

    That is the managed memory system blowing up. Placing a synchronization call between the random generator setup (which runs a kernel) and the zeros call (which touches managed memory) gets rid of it on my system:

    $ cat idontthinkso.py 
    import pycuda.curandom
    from pycuda import autoinit, driver
    import numpy as np
    
    gpu_data_1 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
    gpu_generator = pycuda.curandom.XORWOWRandomNumberGenerator(pycuda.curandom.seed_getter_uniform)
    autoinit.context.synchronize()
    gpu_data_2 = driver.managed_zeros(shape=5, dtype=np.int32, mem_flags=driver.mem_attach_flags.GLOBAL)
    
    $ cuda-memcheck python ./idontthinkso.py
    ========= CUDA-MEMCHECK
    ========= ERROR SUMMARY: 0 errors
    

    The UM regime you are in will vary depending on what GPU, driver and OS you use.