c++optimizationcudaunified-memory

Overcoming the copy overhead in CUDA


I want to parallelize an image operation on the GPU using CUDA, using a thread for each pixel (or group of pixels) of an image. The operation is quite simple: each pixel is multiplied for a value.

However, if I understand it correctly, in order to put the image on the GPU and have it processed in parallel, I have to copy it to unified memory or some other GPU-accessible memory, which is basically a double for loop like the one that would process the image on the CPU. I am wondering whether there is a more efficient way to copy an image (i.e. a 1D or 2D array) on the GPU that does not have an overhead such that the parallelization is useless.


Solution

  • However, if I understand it correctly, in order to put the image on the GPU and have it processed in parallel, I have to copy it to unified memory or some other GPU-accessible memory

    You understand correctly.

    I am wondering whether there is a more efficient way to copy an image (i.e. a 1D or 2D array) on the GPU that does not have an overhead

    There isn't. Data in host system memory must pass over the PCIE bus to get to GPU memory. This is bound by the PCIE bus bandwidth (~12GB/s for PCIE Gen3) and also has some "fixed overhead" associated with it, at least on the order of a few microseconds per transfer, such that very small transfers appear to be worse off from a performance (bytes/s) perspective.

    such that the parallelization is useless.

    If the only operation you want to perform is take an image and multiply each pixel by a value, and the image is not already on the GPU for some reason, nobody in their right mind would use a GPU for that (except maybe for learning purposes). You need to find more involved work for the GPU to do, before the performance starts to become interesting

    The operation is quite simple

    That's generally not a good indicator for a performance benefit from GPU acceleration.