I'm trying to assess the performance differences between OpenCL for AMD .I have kernel for hough transfrom in the kernel i have two #pragma unroll statements when run the kernel not produce any speedup
kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
{
sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int gid0 = get_global_id(0);
int gid1 = get_global_id(1);
uint4 pixel;
int x0=0,y0=0,r;
int maxval=0;
pixel=read_imageui(imageIn,sampler,(int2)(gid0,gid1));
if(pixel.x==255)
{
#pragma unroll 20
for(int r=90;r<110;r+=1)
{
// int r=190;
#pragma unroll 360
for(int theta=0; theta<360;theta++)
{
x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));
// if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
//in[w_hough*y0+x0]++;
}
}
}
}
does #pragma unroll work with AMD OpenCL any help
Unrolling does work with AMD.
http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
That tool includes kernelanalyzer which allows you to see the actual output of their compiler. I used that to verify that the unrolling actually does produce a different kernel.
However unrolling loops does not necessarily give you any speedup. After all it only saves on the jump instructions at the expense of program size, whereas in GPU you are usually bound by memory latency.
In your case the bottleneck is probably the sin/cos functions, those are extremely slow on AMD HW (also on other GPU's). You should use native_sin and native_cos. They are not as precise and do not support as long of an range as the normal ones, which is why they don't use them by default, but in most cases they are enough. The precision of the native_ functions is incidentally the same as required by DirectX shaders for sin and cos.