Does CUDA unified memory solve data movement issues on newer GPUs?

How does CUDA unified memory handle data movement?

Let's say I have a program that looks like this -

cudaMallocManaged(&y, ...);
host_kernel(y);
device_kernel(y);
host_kernel(y);
device_kernel(y);
host_kernel(y);
device_kernel(y);

Will I take a performance hit due to data movement between device and host kernels?

If it is dependent on the GPU, what generation of GPUs does this start becoming efficient? Would Compute Capability 6+ (Pascal) suffice? https://developer.nvidia.com/blog/unified-memory-cuda-beginners/#what_happens_on_pascal_when_i_call_cudamallocmanaged

Do I need to prefetch the data on the host to reduce data movement overheads? https://developer.nvidia.com/blog/unified-memory-cuda-beginners/#what_happens_on_pascal_when_i_call_cudamallocmanaged

For better performance should I come up with an algorithm to only allocate the amount of memory that would fit on the GPU and then allocate the next batch? https://stackoverflow.com/a/50683703/4647107

Solution

The general conventional wisdom here is that you should prefetch y to the processor it is being used on, at each step along the way.

On windows with any GPU, or on linux with a Maxwell or older GPU, UM acts in such a way that the data is "automatically" prefetched in your example, and is generally efficient.

If you don't prefetch the data in the other cases (linux, Pascal and newer) then you run into the possibility that the data movement will be demand-paged and inefficient.

Will I take a performance hit due to data movement between device and host kernels?

Yes, if you are on linux on a pascal or newer GPU, and you don't do sensible prefetching, then you are relying on demand-paged movement of data, which is inefficient, when done in bulk, compared to a bulk copy such as via cudaMemcpy.

Do I need to prefetch the data on the host to reduce data movement overheads?

Yes, to avoid efficiency loss due to demand-paged movement of data, you generally should prefetch data, even on the newest GPUs, when on linux.

For better performance should I come up with an algorithm to only allocate the amount of memory that would fit on the GPU and then allocate the next batch?

I generally wouldn't recommend oversubscription, except as a last resort. If you understand how the data needs to be moved, its generally better to move it yourself, e.g. via prefetch. Oversubscription doesn't help this in any way.