
JCuda: copy multidimensional array from device to host

I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).

In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.

Here an example:

CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter

// Invoke kernel with pointer_dev as parameter. Now it should contain some results

CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel

cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(, pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(, arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction

What am I doing wrong?


Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link

Any workaround? I'm using CUDA Toolkit v5.0


  • Here we are copying a two dimensional array of integers from the device to host.

    1. First, create a single dimensional array with size equal to size of another single dimension array (here blockSizeX).

      CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
      for (int i = 0; i < blockSizeX; i++)
          hostDevicePointers[i] = new CUdeviceptr();
          cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
    2. Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.

      CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
      cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
      cuMemcpyHtoD(hostDevicePointersArray,, blockSizeX * Sizeof.POINTER);
    3. Launch the kernel., hostDevicePointersArray);
    4. Transfer the output from the device to host.

      int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
      cuMemcpyDtoH(, hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);
      for (int j = 0; j < size; j++)
          sum = sum + hostOutputData[j];