I'm trying to do an asynchronous memory transfer (Host to Device) of data residing in Unified Memory in the same way that cudaMemcpy2DAsync()
offers, i.e., using a pitch value for the destination (Device) memory.
However, to my understanding:
I cannot use cudaMemcpy2DAsync()
since it requires pinned memory and a Unified Memory block allocated using cudaMallocManaged()
is not pinned by default and cannot be pinned manually.
I cannot use cudaMemPrefetchAsync()
because it only copies a continuous chunk of memory, whereas cudaMemcpy2DAsync()
includes additional options to buffer the rows of my data according to a given pitch value.
My question:
How can I transfer my data, which (a) resides in Unified Memory, in (b) an asynchronous manner from the Host to the Device while, at the same time, making sure that the data is copied (c) with the right pitch value?
My current transfer using cudaMemcpy2D()
looks something like this:
cudaMemcpy2D(
dest_ptr, dest_pitch, // dst address & pitch
src_ptr, dim_x*sizeof(float) // src address & pitch
dim_x*sizeof(float), dim_y, // transfer width & height
cudaMemcpyHostToDevice ) );
(As you can see, the pitch at the source is effectively zero, while the pitch at the destination is dest_pitch
-- maybe that helps?)
An additional hassle is that I do not allocate the data that needs to be transferred myself and so I cannot apply the pitch manually without creating an additional copy of the data (which would be problematic).
One idea I had was to simply copy the data row-wise. However, this would mean a very large number of very small data transfers which sounds a bit horrible. In my example it would be 3,040,000 transfers of 304 bytes each.. but maybe that's fine on Pascal if the transfers are spread asynchronously across multiple streams..?
Any pointers/ideas would be appreciated!
The answer is that you'll need to make a copy. Asynchronous copies with unified memory do not support pitches, because unified memory does not support pitches. You can either copy the prefetched unified buffer on the device into a pitched memory region, or you can copy the unified buffer n the host into a pitched memory region and then asynchronously copy it. You cannot satisfy all three conditions (a, b, and c) simultaneously