I have a single kernel which is feeling data to two parameters (dev_out_1 and dev_out_2) using single stream. I wanted to copy back the data from the device to host in parallel. my requirement is to use single stream and copy back to the host in parallel.
How do you manage this kind of issues ?
SomeCudaCall<<<25,34>>>(input, dev_out_1,dev_out_2);
cudaMemcpyAsync(toHere_1, dev_out_1, sizeof(int), cudaMemcpyDeviceToHost,0);
cudaMemcpyAsync(toHere_2, dev_out_2, sizeof(int), cudaMemcpyDeviceToHost,0);
I wanted to copy back the data from the device to host in parallel
That is not possible.
NVIDIA GPUs can only use one DMA engine for device to host transfers (even in the case where there are more than one DMA engine), and the DMA engine can only perform a single transfer at a time. So "parallel" copies in the same direction over the PCI express bus are not possible.