I work on the code from JCuda documentation. Currently, it's just adding vectors on GPU.
What should I do to reuse function add
on CPU (host)?
I know that, I have to change __global__
to __host__ __device__
but I have no idea how can I call it in my main function. I suspect that I have to use another nvcc option.
My goal is to run this same function on GPU and CPU and check execution time (I know how to check it).
.cu file (compiled with nvcc -ptx file.cu -o file.ptx
extern "C"
__global__ void add(int n, float *a, float *b, float *sum)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<n)
{
sum[i] = a[i] + b[i];
}
}
fragment of main function
public static void main(String[] args) {
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
CUmodule module = new CUmodule();
cuModuleLoad(module, "kernels/JCudaVectorAdd.ptx");
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "add");
...
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{numElements}),
Pointer.to(deviceInputA),
Pointer.to(deviceInputB),
Pointer.to(deviceOutput)
);
You can't and probably will never be able do this in JCUDA, because of the API interface it uses to interact with CUDA.
While CUDA can now "launch" a host function into a stream, that API isn't exposed by JCUDA at present, and it wouldn't work the way that device code does now (this restriction would apply to PyCUDA and other driver API based frameworks as well).
You would likely need use JNI or some other way to retrieve the host function from a library and call it that way.