I'm trying to use cuda to make a basic fragment shader, and I have found that actually executing the kernel takes over a second, which is unacceptable for a shader that I'm trying to run in real time. I found using the synchronize method and by commenting some of the kernel that it is the memory accesses to the output array that are what's causing it to be so slow. I haven't really tried anything to solve the problem because I can't even fathom where to start. This is in PyCUDA, which I don't think really matters, but here's the kernel code:
__global__ void fragment_shader(int palette_lim,float *palette, float *input, float *output) {
int fragment_idx = (3*gridDim.y*blockIdx.x)+(3*blockIdx.y);
float min_dist = sqrtf(3);
float color_dist;
int best_c = 0;
for (int c=0;c<palette_lim;c++) {
color_dist = sqrtf(pow(input[fragment_idx]-palette[c*3],2)+pow(input[fragment_idx+1]-palette[c*3+1],2)+pow(input[fragment_idx+2]-palette[c*3+2],2));
if (color_dist < min_dist) {
min_dist = color_dist;
best_c = c;
}
}
//These are the lines that make it slow. If these lines get commented out, it runs in a time that would be acceptable
output[fragment_idx] = palette[best_c*3];
output[fragment_idx+1] = palette[best_c*3+1];
output[fragment_idx+2] = palette[best_c*3+2];
}
EDIT: After playing around with it a bit more, I found that it also has to do with what's being assigned to the output array, because when I had it write some constants rather than something from the palette it also worked just fine, it just didn't do anything useful then.
First some remarks on your actual computation:
sqrtf(x) < sqrtf(3)
. Roots are expensive. Just compare x < 3.f
sqrt(pow(x, 2)+...)
, for that matter don't use pow
just for squaring. Use hypotf
for 2D or norm3df
for 3D vectorsNow let's analyze your memory accesses:
Let's look at fragment_idx = 3*gridDim.y*blockIdx.x+3*blockIdx.y
: You're not taking threadIdx.x
and threadIdx.y
into account. This is your main problem: Many threads act on the same input and output data. You likely want this: fragment_idx = 3 * (threadIdx.y * blockDim.x + threadIdx.x)
So you load 3 floats. For starters, why do you reload it inside the loop when it isn't dependent on the loop iteration? I assume the compiler saves you from that access but don't get in the habit of doing that.
Second, your access pattern isn't properly coalesced since a) these are 3 independent accesses and b) CUDA cannot coalesce accesses to float3
vectors even if you did it properly. Please read section 9.2.1 Coalesced Access to Global Memory of the Best Practices Guide. For better performance you have two options:
fragment_idx
so you can load the whole thing as a float4
Same problem with the access of 3 floats. Plus, now every thread reads the same values since c
doesn't depend on the thread index. At the very least, the access should go through the __ldg
function to use the L1 cache. Preferably you prefetch the palette into shared memory
The write access has the same issue with uncoalesced access. Plus, since best_c
varies among threads, the read accesses to palette
are random. You had to load the palette
values before in your loop. Just save the best palette value in a local variable and reuse it to store the output in the end.
Two remarks:
fragment_idx
This is the simplest code to rectify the issues. It doesn't solve the issues with loading vector3 variables and it doesn't use shared memory. That requires more involved changes
__device__ float sqr_norm(float3 a, float3 b) {
a.x -= b.x, a.y -= b.y, a.z -= b.z;
return a.x * a.x + a.y * a.y + a.z * a.z;
}
__global__ void fragment_shader(int palette_lim,
const float *palette, const float *input,
float *output) {
int fragment_idx = 3 * (threadIdx.y * blockDim.x + threadIdx.x);
/* TODO: Switch to float4 for better memory access patterns */
float3 inputcolor = make_float3(
input[fragment_idx], input[fragment_idx + 1], input[fragment_idx + 2]);
float min_dist_sqr = 3.f;
/* The old code always used color index 0 if there was no fit */
float3 best_color = make_float3(
__ldg(palette), __ldg(palette + 1), __ldg(palette + 2));
float best_dist = sqr_norm(best_color, inputcolor);
for(int c = 1; c < palette_lim; c++) {
/* TODO: Prefetch into shared memory */
float3 color = make_float3(
__ldg(palette + c), __ldg(palette + c + 1), __ldg(palette + c + 2));
float dist = sqr_norm(color, inputcolor);
/* Since we always used color 0 in the old code,
* the min_dist is somewhat pointless */
if(dist < min_dist_sqr && dist < best_dist) {
best_color = color;
best_dist = dist;
}
}
output[fragment_idx] = best_color.x;
output[fragment_idx + 1] = best_color.y;
output[fragment_idx + 2] = best_color.z;
}
Here is a more extensive rewrite:
float4
(RGBA instead of RGB). The extra channel is ignored for distance computation but it is propagated. Typically one tries to use the value for something, e.g. you could store the distance value in there__device__ float sqr_dist_rgb(float4 a, float4 b) {
a.x -= b.x, a.y -= b.y, a.z -= b.z;
return a.x * a.x + a.y * a.y + a.z * a.z;
}
__global__ void fragment_shader(int palette_lim,
const float4 *palette, const float4 *input,
float4 *output) {
/* Call with dynamic shared memory:
* 2 * sizeof(float4) * blockDim.x * blockDim.y */
extern __shared__ float4 colorbuf[];
const int buf_size = blockDim.x * blockDim.y;
const int buf_idx = threadIdx.y * blockDim.x + threadIdx.x;
const int fragment_idx = threadIdx.y * blockDim.x + threadIdx.x;
const float4 inputcolor = input[fragment_idx];
float4 best_color = __ldg(palette);
const float min_dist_sqr = 3.f;
float best_dist = sqr_dist_rgb(best_color, inputcolor);
for(int cb = 0, b = 0; cb < palette_lim; b ^= 1, cb += buf_size) {
/* We use a double buffer scheme to reduce the __syncthreads calls */
float4* cur_buf = b ? colorbuf + buf_size : colorbuf;
if(cb + buf_idx < palette_lim)
cur_buf[buf_idx] = __ldg(palette + cb + buf_idx);
__syncthreads();
const int n = min(buf_size, palette_lim - cb);
for(int c = 0; c < n; c++) {
float4 color = cur_buf[c];
float dist = sqr_dist_rgb(color, inputcolor);
if(dist < min_dist_sqr && dist < best_dist) {
best_color = color;
best_dist = dist;
}
}
}
output[fragment_idx] = best_color;
}
For larger palettes, a brute force search like this is suboptimal. Spatial index algorithms can do the same thing but faster. The classic structure for this would be a KD tree. If you search for this, you will find some CUDA implementations that might be worth checking out.