I'm triyng to call a cuda kernel inside a python script, using pycuda. Inside the main cuda kuda kernel, different cuda kernels are called. The compilation is done without any errors, by using the following command:
nvcc -shared -rdc=true -o test.so test.cu -Xcompiler -fPIC
However, when I try to run the python script, I get the following error:
raise CompileError(
pycuda.driver.CompileError: nvcc compilation of /tmp/tmpkyzts09l/kernel.cu failed
[command: nvcc --cubin -arch sm_86 -I/home/user/wrapper_test/wrapper/lib/python3.8/site-packages/pycuda/cuda kernel.cu]
[stderr:
kernel.cu(111): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
kernel.cu(145): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
This is the cuda code:
__global__ void square(float *array, int n) {
if (array[0] == 0)
{
array[0] = 5;
array[1] = 6;
array[2] = 7;
}
}
__global__ void square_kernel(float *array, int n) {
square<<<1,1>>>(array, n);
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
array[idx] = array[idx] * array[idx];
} }
And this is the python code:
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
# Load the CUDA module
module = SourceModule(open('test.cu', 'r').read())
# Get a reference to the CUDA kernel function
square_kernel = module.get_function("square_kernel")
# Create data on the CPU
data = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
n = len(data)
# Allocate GPU memory
gpu_data = cuda.mem_alloc(data.nbytes)
# Transfer data from CPU to GPU
cuda.memcpy_htod(gpu_data, data)
# Define block and grid sizes
block_size = (1024, 1, 1)
grid_size = ((n + block_size[0] - 1) // block_size[0], 1)
# Launch the CUDA kernel
square_kernel(gpu_data, np.int32(n), block=block_size, grid=grid_size)
# Wait for kernel to finish
cuda.Context.synchronize()
# Transfer the result back from GPU to CPU
cuda.memcpy_dtoh(data, gpu_data)
# Print the result
print("Original data:", data)
How can I solve the problem. Thanks in advance for your help.
To JIT compile and link source containing dynamic parallelism, you must pass the source to a DynamicSourceModule
instance, rather than the conventional SourceModule
. The former adds the device linker steps required to produce a complete object which the driver API can load into a context without the error you are seeing.