memory-managementcudaunified-memory

How to apply Unified Memory to existing aligned host memory


I'm involved in effort integrating CUDA into some existing software. The software I'm integrating into is pseudo real-time, so it has a memory manager library that manually passes pointers from a single large memory allocation that is allocated up front. CUDA's Unified Memory is attractive to us, since in theory we'd theoretically be able to change this large memory chunk to Unified Memory, have the existing CPU code still work, and allow us to add GPU kernels with very little changes to the existing data I/O stream.

Parts of our existing CPU processing code requires memory to be aligned to certain alignment. cudaMallocManaged() does not allow me to specify the alignment for memory, and I feel like having to copy between "managed" and strict CPU buffers for these CPU sections almost defeats the purpose of UM. Is there a known way to address this issue that I'm missing?

I found this link on Stack Overflow that seems to solve it in theory, but I've been unable to produce good results with this method. Using CUDA 9.1, Tesla M40 (24GB):

#include <stdio.h>
#include <malloc.h>
#include <cuda.h>

#define USE_HOST_REGISTER 1

int main (int argc, char **argv)
{
   int num_float = 10;
   int num_bytes = num_float * sizeof(float);

   float *f_data = NULL;

   #if (USE_HOST_REGISTER > 0)
   printf(
      "%s: Using memalign + cudaHostRegister..\n",
       argv[0]);

   f_data = (float *) memalign(32, num_bytes);

   cudaHostRegister(
      (void *) f_data,
      num_bytes,
      cudaHostRegisterDefault);
   #else
   printf(
      "%s: Using cudaMallocManaged..\n",
       argv[0]);

   cudaMallocManaged(
      (void **) &f_data,
      num_bytes);
   #endif

   struct cudaPointerAttributes att;
   cudaPointerGetAttributes(
      &att,
      f_data);

   printf(
      "%s: ptr is managed: %i\n",
       argv[0],
       att.isManaged);
   fflush(stdout);

   return 0;
}

When using memalign() + cudaHostRegister() (USE_HOST_REGISTER == 1), the last print statement prints 0. Device accesses via kernel launches in larger files unsurprisingly report illegal accesses.

When using cudaMallocManaged() (USE_HOST_REGISTER == 0), the last print statement prints 1 as expected.

edit: cudaHostRegister() and cudaMallocManaged() do return successful error codes for me. Left this error-checking out in my sample I shared, but I did check them during my initial integration work. Just added the code to check, and both still return CUDA_SUCCESS.

Thanks for your insights and suggestions.


Solution

  • There is no method currently available in CUDA to take an existing host memory allocation and convert it into a managed memory allocation.