c++cudaunified-memory

Replacing cudaMemcpy2D with cudaMemPrefetchAsync


I'm trying to do an asynchronous memory transfer (Host to Device) of data residing in Unified Memory in the same way that cudaMemcpy2DAsync() offers, i.e., using a pitch value for the destination (Device) memory.

However, to my understanding:

My question:

How can I transfer my data, which (a) resides in Unified Memory, in (b) an asynchronous manner from the Host to the Device while, at the same time, making sure that the data is copied (c) with the right pitch value?

My current transfer using cudaMemcpy2D() looks something like this:

cudaMemcpy2D( 
    dest_ptr, dest_pitch,         // dst address & pitch
    src_ptr, dim_x*sizeof(float)  // src address & pitch
    dim_x*sizeof(float), dim_y,   // transfer width & height
    cudaMemcpyHostToDevice ) );

(As you can see, the pitch at the source is effectively zero, while the pitch at the destination is dest_pitch -- maybe that helps?)

An additional hassle is that I do not allocate the data that needs to be transferred myself and so I cannot apply the pitch manually without creating an additional copy of the data (which would be problematic).

One idea I had was to simply copy the data row-wise. However, this would mean a very large number of very small data transfers which sounds a bit horrible. In my example it would be 3,040,000 transfers of 304 bytes each.. but maybe that's fine on Pascal if the transfers are spread asynchronously across multiple streams..?

Any pointers/ideas would be appreciated!


Solution

  • The answer is that you'll need to make a copy. Asynchronous copies with unified memory do not support pitches, because unified memory does not support pitches. You can either copy the prefetched unified buffer on the device into a pitched memory region, or you can copy the unified buffer n the host into a pitched memory region and then asynchronously copy it. You cannot satisfy all three conditions (a, b, and c) simultaneously