openclgpgpuamd-gpuopencl-cmemory-bandwidth

Can't seem to achieve anywhere near my GPU global memory bandwidth in OpenCL


Using opencl on my AMD GPU, I've only been able to achieve 4% (15 GB/sec) of the GPU global-memory bandwidth reported by clpeak (375 GB/sec).

Before resigning myself to this, I want to make sure I'm not guilty of some newbie mistake (ex. I learned the hard way that the CL_MEM_SVM_FINE_GRAIN_BUFFER flag leads to a 40x slowdown).

To put the problem in context: I'm relatively new to opencl and have been trying to perform a certain calculation on my GPU. This calculation involves a number of unavoidable and essentially random global GPU memory reads/writes within my inner-loop. Unsurprisingly, these constitute the vast majority of kernel time. I know that global access is suboptimal, but let's just assume it really is unavoidable here. [To clarify, when I say "global" I mean on the GPU. None of the performance I'm referring to involves data xfers between host and GPU.]

I was able to reproduce the performance issue using a simple toy kernel --- which still does no better than 20GB/sec.

kernel void testkernel(__global const ulong *loc,__global uchar *res,__global const uchar *val,__private ulong bsize)
{
    uint n= get_global_id(0);
    ulong i= n*bsize;
    ulong j= i+bsize;
    uchar z= 0;
    for (ulong k=i;k<j;++k)
    {
        ulong l= loc[k];
        ulong l2= loc[l];
        z+= val[l2];
    }
    res[n]= z;
}

bsize is a loop-counter, which I set to 1000.

If nkern denotes the number of kernel instances I will enqueue, then loc and val have sizes nkern*bsize, and res has size nkern.

res is just a result array (one per kernel instance) to prevent the loop from being optimized away.

loc is what emulates random access across my full global memory. I populated it with random values [0,asize).

val is just another array to add some flavor. I populated it with random values [0,256).

I did all the randomization, etc in host memory, then copied the arrays to the GPU using clEnqueueSVMMemcpy before enqueueing all the kernel instances. The GPU memory was allocated using clSVMAlloc with just the CL_MEM_READ_WRITE flag.

I ran this toy kernel with various nkern values (10K, 100K, and 700K), but always get around 20 GB/sec. [In computing this I count the inner loop as having 17 bytes of bandwidth --- 2 ulongs and a uchar. I also ran the entire round --- enqueuing and waiting for all nkern instances --- 10 times to get decent statistics. At nkern=700K, I'm pretty close to the useable memory of my GPU, so this means I'm reading my entire GPU memory 10x.].

For reference, my GPU is an AMD RX 5700XT (a gfx1010 device) with 8GB of global memory arranged in 32 banks and at around 2GhZ. clpeak reports around 375 GB/sec global bandwidth for floatn (regardless of n). AMD reports that I have 2304 stream processors (which appear to be arranged in 20 compute units). I'm using opencl 2.0 (the latest my card supports) and am running on linux with the actual AMD rocm driver --- which is the only one I could get working.

I'm perfectly willing to accept that I'm constrained by the number of memory calls rather than the memory bandwidth, and that 15GB/sec is the best I'll get. However, I'd like to confirm that this actually is the case before giving up on a potential 25x improvement.

Any insights or suggestions would be greatly appreciated!


Solution

  • It's a while since I've done any OpenCL programming, and I never got particularly good at optimising memory bound applications. I also have experienced performance of the same order of magnitude, so I don't think you're doing it wrong or that there's something inherently wrong with your setup.

    I put my results down to the following arguments, though I haven't had them confirmed. When you request a long randomly in a buffer, GPUs, like CPUs, don't just fetch those 8 bytes, they fill a whole cache line. On modern x86 CPUs that's 64 bytes, and it might even read ahead more than that. I don't know what it is on your GPU. The quoted bandwidth counts all the data transferred, and not just the few bytes you might need. I don't think you'll be able to get anywhere near a 20x performance improvement, unless you experiment with bigger data requests.

    I hope this lays some of your concerns about your program and setup to rest.