openclpragmaamd-processorloop-unrolling

Is there a way to unroll loops in an AMD OpenCL kernel with the compiler?


I'm trying to assess the performance differences between OpenCL for AMD and Nvidia GPUs. I have a kernel which performs matrix-vector multiplication. I'm running the kernel on two different systems at the moments, my laptop which has an NVidia GT525m with Ubuntu 12.04 and CUDA 4.0 (which contains the OpenCL libraries and headers) and the other is a desktop with an AMD Radeon HD7970 again with Ubuntu 12.04 and the latest Catalyst drivers.

In the kernel I have two #pragma unroll statements which produce a large speed-up for the Nvidia OpenCL implementation (~6x). However the AMD OpenCL version does not produce any speedup. Looking at the kernel with the AMD APP kernel analyzer gives the error that the unroll is not used because the trip count is not known. So my question is, does #pragma unroll work with AMD OpenCL or is there an alternative (perhaps a compiler flag that i'm unaware of). I've included the kernel below

__kernel void mvKernel(__global float* a, const __global float* x, __global float* y, int m, int n)
{
    float sum = 0.0f;
    __global float* A;
    int i;
    int j = 0;
    int indx = get_global_id(0);
    __local float xs[12000];
#pragma unroll 
    for(i = get_local_id(0); i < n; i+= get_local_size(0)) {
        xs[i] = x[i];
    } 
    barrier(CLK_LOCAL_MEM_FENCE);
    A = &a[indx];
#pragma unroll 256
    for(i = 0; i < n; i++) {
        sum += xs[i] * A[j];
        j += m;
    }
    y[indx] = sum;
}

This same kernel produces correct results in both implementations but the #pragma unroll commands don't do anything for the AMD (checked by commenting them out).


Solution

  • It's not documented, but it should actually work with #pragma unroll. Can you check the compiler log to see if the unroll is applied? I'm not sure if the kernel analyzer uses the same compiler as the OpenCL runtime, you might want to check.

    Otherwise, if you know that n comes in chunks of 256, you can unroll manually by having one loop over blocks of 256 elements and another one inside with a fixed size of 256, which might be easier to unroll. This will surely solve the problem that the trip count is not known statically.

    However, keep in mind unrolling a loop is usually not that big of a win anyway, as you don't have many registers to cache your computation. The increased register pressure from the loop unrolling might lead to register spilling, which is even slower. You should check how fast the kernel actually is on the AMD card. A newer NVIDIA OpenCL compiler might also not benefit any more from the unroll pragma.