c++pointerscudawindows-subsystem-for-linux

Allocating CUDA Unified Memory by overloading operator new causes illegal access


I need a very specific structure,

The base is an array of Interval objects. I am using the already prepared ones from the CUDA samples.

On top of that I have created a class that wraps this array named ivector (interval vector) each ivector represents a box (or hyperbox).

So far so good, using this CUDA Unified Memory and inheriting the Managed class they provide it works!

Here starts the issue. I need an array of ivectors for my experiments and I cannot find a way to make it work.

It's a bit hard to provide every required code for compilation because it's very specific.

Let's say our kernel is this and that our ivector_gpu object has the operator[] overloaded, such as:

#define DIMENSIONS 2

class Managed {
public:
  void *operator new(size_t len) {
    void *ptr;
    cudaMallocManaged(&ptr, len);
    cudaDeviceSynchronize();
    return ptr;
  }

  void operator delete(void *ptr) {
    cudaDeviceSynchronize();
    cudaFree(ptr);
  }
};

class ivector_gpu: public Managed {
 public:
   __host__ ivector_gpu();
  __device__ __host__ ivector_gpu(const ivector_gpu &iv);
   __host__ ivector_gpu(int N);
  __device__ __host__ interval_gpu<double>& operator[](int i);
  __device__ __host__ ivector_gpu& operator=(ivector_gpu &x);
  __device__ __host__ int size() const;
 private:
  interval_gpu<double> * ivector;
  int dims;
};

inline __host__ ivector_gpu::ivector_gpu(){
  dims = DIMENSIONS;
  ivector = new interval_gpu<double>(DIMENSIONS);
}

inline __host__ ivector_gpu::ivector_gpu(int N){
  dims = N;
  ivector = new interval_gpu<double>(dims);
}

inline __host__ ivector_gpu::ivector_gpu(const ivector_gpu &iv){
  ivector = iv.ivector;
  dims = iv.dims;
  cudaMallocManaged(&ivector, dims);
  memcpy(ivector, iv.ivector, dims);
}

inline __device__ __host__ ivector_gpu& ivector_gpu::operator=(ivector_gpu &x){
  for(int i=0; i<size(); i++){
    ivector[i]=x[i];
  }
  return *this;
}

inline __device__ __host__ interval_gpu<double>& ivector_gpu::operator[](int i) {
  return ivector[i];
}

^ I remind you that this works if I instantiate 1 object but not when I want to create an array of ivector_gpus. Assume that the interval_gpu<T> object also functions as intended as it is provided by NVIDIA.

My kernel is this and I want to access the 0th interval_gpu element of the 0th ivector_gpu element.

__global__ void test(interval_gpu<double> a, ivector_gpu *&c){
    interval_gpu<double> first = interval_gpu<double>::empty();
    c[0][0] = first;

My main is like this:

//create the array
ivector_gpu * v = new ivector_gpu[1];

//fill it with something
v[0][0] = interval_gpu<double>(0,10);
v[0][1] = interval_gpu<double>(5,10);

//let's print it for test purposes
std::cout << v[0][0].lower() << ' ' << v[0][0].upper()  << std::endl;
std::cout << v[0][1].lower() << ' ' << v[0][1].upper()  << std::endl;

// ^ so far so good, it compiles and works

//let's call the kernel
test<<<1,1>>>(t,s,v);               
CHECKED_CALL(cudaGetLastError());
CHECKED_CALL(cudaDeviceSynchronize()); 

The kernel throws

interval.cu(89): ERROR: cudaDeviceSynchronize() returned an illegal memory access was encountered (err#700)

I am assuming I am doing something wrong with pointers, or that it needs a new cudaMallocManaged pointer but I am completely burned out at this, trying to make it work for many hours. I cannot comprehend the reason 1 object works but not an array of objects.

Some more info, I am trying to make it work on an RTX 3060 TI

GPU Device 0: "Ampere" with compute capability 8.6

Any help is greatly appreciated!


Solution

  • Adding overloads of operator new[] and operator delete[] to the Managed class should allow you to allocate and free arrays of these objects:

    class Managed {
    public:
      void *operator new(size_t len) {
        void *ptr;
        cudaMallocManaged(&ptr, len);
        cudaDeviceSynchronize();
        return ptr;
      }
    
      void *operator new[](size_t len) {
        void *ptr;
        cudaMallocManaged(&ptr, len);
        cudaDeviceSynchronize();
        return ptr;
      }
    
      void operator delete(void *ptr) {
        cudaDeviceSynchronize();
        cudaFree(ptr);
      }
    
      void operator delete[](void *ptr) {
        cudaDeviceSynchronize();
        cudaFree(ptr);
      }
    };
    

    The C++ mechanics of this are documented on cppreference.com in the "Class-specific overloads" sections of operator new, operator new[] and operator delete, operator delete[]

    That being said, UM has changed since CUDA 6, at least under Linux. For example I do not think that the cudaDeviceSynchronize() calls still make sense. One can find the reasoning for them in the comments below that blog post:

    Basically, the programming model assumes that any kernel launched can be accessing any managed memory attached to the “global” stream, even if that memory was allocated after the kernel was launched. This means that if you want to allocate managed memory and access on the CPU right away, you have to either make sure that all kernels have been synchronized OR you have to attach to the “host” stream when you allocate (i.e. do cudaMallocManaged(&ptr, size, cudaMemAttachHost)). The latter choice then requires that the data be attached to “global” or a specific stream if it needs to be accessed from the GPU.

    Most of this does not hold true anymore for newer versions of CUDA combined with newer GPU architectures (under Linux). To quote Unified Memory for CUDA Beginners (a somewhat newer blog post):

    Simultaneous access to managed memory from the CPU and GPUs of compute capability lower than 6.0 is not possible. This is because pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these GPUs, an access from the CPU while a kernel is running will cause a segmentation fault. On Pascal and later GPUs, the CPU and the GPU can simultaneously access managed memory, since they can both handle page faults; however, it is up to the application developer to ensure there are no race conditions caused by simultaneous accesses.

    So, while one might still want to synchronize all kernels working on some memory before freeing it (in operator delete or operator delete[]), synchronizing the whole device (instead of specific streams) is very inflexible. One might want to leave the synchronization to the user to allow for more flexibility.

    Synchronization after allocation (in operator new and operator new[]) seems completely unnecessary nowadays.

    Alternatives

    For a modern C++ interface I would recommend using memory resources which are provided by e.g. Thrust or RMM instead. See thrust::universal_memory_resource and rmm::mr::managed_memory_resource. These should do the right thing with regards to synchronizing or not.

    Thrust has the advantage of being shipped with the CUDA Toolkit, but the documentation of the Thrust memory resources is lacking in comparison.

    Windows

    While CUDA managed memory is still hardly usable under Windows, the Windows Display Driver Model (WDDM) seems to have a similar virtual memory paging mechanism in place since Win10 that might allow oversubscribing device memory allocated via cudaMalloc() using system memory.