[SOLVED] error when calling a cuda kernel in python using pycuda

error when calling a cuda kernel in python using pycuda

I'm triyng to call a cuda kernel inside a python script, using pycuda. Inside the main cuda kuda kernel, different cuda kernels are called. The compilation is done without any errors, by using the following command:

nvcc -shared -rdc=true -o test.so test.cu -Xcompiler -fPIC

However, when I try to run the python script, I get the following error:

raise CompileError(
pycuda.driver.CompileError: nvcc compilation of /tmp/tmpkyzts09l/kernel.cu failed
[command: nvcc --cubin -arch sm_86 -I/home/user/wrapper_test/wrapper/lib/python3.8/site-packages/pycuda/cuda kernel.cu]
[stderr:
kernel.cu(111): error: kernel launch from __device__ or __global__ functions requires separate compilation mode

kernel.cu(145): error: kernel launch from __device__ or __global__ functions requires separate compilation mode

This is the cuda code:

__global__ void square(float *array, int n) {

    if (array[0] == 0)
    {
        array[0] = 5;
        array[1] = 6;
        array[2] = 7;
    }
     }
__global__ void square_kernel(float *array, int n) {
    square<<<1,1>>>(array, n);
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        array[idx] = array[idx] * array[idx];
    } }

And this is the python code:

import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

# Load the CUDA module
module = SourceModule(open('test.cu', 'r').read())

# Get a reference to the CUDA kernel function
square_kernel = module.get_function("square_kernel")

# Create data on the CPU
data = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
n = len(data)

# Allocate GPU memory
gpu_data = cuda.mem_alloc(data.nbytes)

# Transfer data from CPU to GPU
cuda.memcpy_htod(gpu_data, data)

# Define block and grid sizes
block_size = (1024, 1, 1)
grid_size = ((n + block_size[0] - 1) // block_size[0], 1)

# Launch the CUDA kernel
square_kernel(gpu_data, np.int32(n), block=block_size, grid=grid_size)

# Wait for kernel to finish
cuda.Context.synchronize()

# Transfer the result back from GPU to CPU
cuda.memcpy_dtoh(data, gpu_data)

# Print the result
print("Original data:", data)

How can I solve the problem. Thanks in advance for your help.

Solution

To JIT compile and link source containing dynamic parallelism, you must pass the source to a DynamicSourceModule instance, rather than the conventional SourceModule. The former adds the device linker steps required to produce a complete object which the driver API can load into a context without the error you are seeing.