I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes.
// This does pukes and gives CUDA_ERROR_INVALID_VALUE
var hostOutput = new Array[Int](numElements)
cuMemcpyDtoH(
Pointer.to(hostOutput),
deviceOutput,
8
)
// This runs just fine
var hostOutput = new Array[Int](numElements)
cuMemcpyDtoH(
Pointer.to(hostOutput),
deviceOutput,
4
)
To give better context of the actual program bellow is my kernel code which compiles and runs just fine:
extern "C"
__global__ void add(int n, int *a, int *b, int *sum) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<n)
{
sum[i] = a[i] + b[i];
}
}
Also I then translated some java sample code into my scala code. Anyway bellow is the entire program that runs:
package dev
import jcuda.driver.JCudaDriver._
import jcuda._
import jcuda.driver._
import jcuda.runtime._
/**
* Created by dev on 6/7/15.
*/
object TestCuda {
def init = {
JCudaDriver.setExceptionsEnabled(true)
// Input vector
// Output vector
// Load module
// Load the ptx file.
val kernelPath = "/home/dev/IdeaProjects/jniopencl/src/main/resources/kernels/JCudaVectorAddKernel30.cubin"
cuInit(0)
val device = new CUdevice
cuDeviceGet(device, 0)
val context = new CUcontext
cuCtxCreate(context, 0, device)
// Create and load module
val module = new CUmodule()
cuModuleLoad(module, kernelPath)
// Obtain a function pointer to the kernel function.
var add = new CUfunction()
cuModuleGetFunction(add, module, "add")
val numElements = 100000
val hostInputA = 1 to numElements toArray
val hostInputB = 1 to numElements toArray
val SI: Int = Sizeof.INT.asInstanceOf[Int]
// Allocate the device input data, and copy
// the host input data to the device
var deviceInputA = new CUdeviceptr
cuMemAlloc(deviceInputA, numElements * SI)
cuMemcpyHtoD(
deviceInputA,
Pointer.to(hostInputA),
numElements * SI
)
var deviceInputB = new CUdeviceptr
cuMemAlloc(deviceInputB, numElements * SI)
cuMemcpyHtoD(
deviceInputB,
Pointer.to(hostInputB),
numElements * SI
)
// Allocate device output memory
val deviceOutput = new CUdeviceptr()
cuMemAlloc(deviceOutput, SI)
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
val kernelParameters = Pointer.to(
Pointer.to(Array[Int](numElements)),
Pointer.to(deviceInputA),
Pointer.to(deviceInputB),
Pointer.to(deviceOutput)
)
// Call the kernel function
val blockSizeX = 256
val gridSizeX = Math.ceil(numElements / blockSizeX).asInstanceOf[Int]
cuLaunchKernel(
add,
gridSizeX, 1, 1,
blockSizeX, 1, 1,
0, null,
kernelParameters, null
)
cuCtxSynchronize
// **** Code pukes here with that error
// If I comment this out the program runs fine
var hostOutput = new Array[Int](numElements)
cuMemcpyDtoH(
Pointer.to(hostOutput),
deviceOutput,
numElements
)
hostOutput.foreach(print(_))
}
}
Anyway, just to let you know the specs of my computer. I'm running Ubuntu 14.04 on an optimus setup with a GTX 770M card which is compute 3.0 capable. I'm also running NVCC version 5.5. Lastly I'm running scala version 2.11.6 with Java 8. I'm a noob and would greatly appreciate any help.
Here
val deviceOutput = new CUdeviceptr()
cuMemAlloc(deviceOutput, SI)
you are allocating SI
bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be
cuMemAlloc(deviceOutput, SI * numElements)
And similarly, I think that the call in question should be
cuMemcpyDtoH(
Pointer.to(hostOutput),
deviceOutput,
numElements * SI
)
(note the * SI
for the last parameter).