I'm trying to optimize my code which I accelerated using basically OpenACC only. Is it a good approach to insert CUDA such as in the example that follows? In this case, u_device and v_device are used by the device only. Using cudaMalloc assures me that I allocate memory on the device memory and not on the host memory too.
int size = NVAR * sizeof(double);
// Declare pointers that will point to the memory allocated on the device.
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v_device[nv] = V[nv][k][j][i];
PrimToCons (v_device, u_device);
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) U[k][j][i][nv] = u_device[nv];
}
cudaFree(u_device);
cudaFree(v_device);
Before I would have used OpenACC and written something like this:
double* v_device = (double*)malloc(size);
double* u_device = (double*)malloc(size);
#pragma acc enter data create(u_device[:size],v[:device])
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
...
}
#pragma acc exit data delete(u_device[:size],v[:device])
Is there a way with OpenACC to avoid host memory allocation?
Another doubt I have regarding cudaMalloc is the possibility to put the routine inside the kernel, in order to make the arrays private:
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
.
.
.
cudaFree(u_device);
cudaFree(v_device);
}
Writing in this way I get the error:
182, Accelerator restriction: call to 'cudaMalloc' with no acc routine information
Is there a way with OpenACC to avoid host memory allocation?
You can use cudaMalloc, but for pure OpenACC, you'd use "acc_malloc" and "acc_free". For example: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/acc_malloc.c
Note the use the "deviceptr" clause which indicates that the pointer is a device pointer. Though here, you're wanting to privatize these arrays so you can keep the private.
I've never used a device pointer in a private clause, but just tried and it seems to work. Which make sense since all the compiler really needs is the size and type of the private array to make the private copies. In this case since it's on the gang loop, the compiler will attempt to put the private arrays in shared memory, assuming they aren't too big to fit. I'd recommend using the triplet notation for the array, i.e. "private(v_device[:NVAR],...) so the compiler will know the size.
Though I'm not sure there's much of an advantage to using device arrays here. The device memory you're allocating isn't going to be used taking up space on the device. Device memory is often much smaller than host memory, so if you do need to waste space, probably better this be on the host. Plus having to use acc_malloc or cudaMalloc limits portability of the code. Not that there isn't cases where using device only memory is beneficial, I just don't think it is for this case.
Note you can call "malloc" within device code, but it's not recommended. Malloc's get serialized causing performance issues, but also the default heap is relatively small which can lead to heap overflows. Granted, this can be increased by either calling cudaDeviceLimits or via the environment variable "NV_ACC_CUDA_HEAPSIZE".