Trying to understand kernel optimization in CUDA. These changes had negligable effect

I have a CUDA Mandelbrot kernel, one which passes in the standard way:

__global__ void mandelbrot_worker(uint32_t* count_arr, uint32_t w, uint32_t h,
                                  const uint32_t max_count, const float xmin,
                                  const float xmax, const float ymin, const float ymax) {
    // Calculate global thread coordinates in the 2D Mandelbrot set grid
    int j = blockIdx.x * blockDim.x + threadIdx.x;  // x-coordinate
    int i = blockIdx.y * blockDim.y + threadIdx.y;  // y-coordinate

    // if (i >= h || j >= w) {
    //     printf("ERROR: OUT OF BOUNDS");
    //     return;
    // } // Bounds check

    // Map pixel coordinate (i, j) to complex plane coordinates (x, y)
    float x0 = xmin + (xmax - xmin) * j / w;
    float y0 = ymin + (ymax - ymin) * i / h;

Commenting out the if statement to check for a thread that is out of bounds changed nothing. In this case, passing in

w = 3840*8
h = 2160*8

the condition is never executed.

But I assumed that removing the if statement would change the timings a bit. It might have, but obviously the loop in Mandelbrot is the biggest part.

I also thought that computing x0 and y0 is inefficient and passed in:

__global__ void mandelbrot_worker2(uint32_t* count_arr, uint32_t w, uint32_t h,
                                  const uint32_t max_count, const float xmin,
                                  const float dx_per_pixel, const float ymin, const float dy_per_pixel) {
    // Calculate global thread coordinates in the 2D Mandelbrot set grid
    int j = blockIdx.x * blockDim.x + threadIdx.x;  // x-coordinate
    int i = blockIdx.y * blockDim.y + threadIdx.y;  // y-coordinate

    // Map pixel coordinate (i, j) to complex plane coordinates (x, y)
    float x0 = xmin + dx_per_pixel * j ;
    float y0 = ymin + dy_per_pixel * i ;

The maximum iteration is 64, so perhaps it's just a small percentage. Last, it seems like optimization is always on for CUDA? The -O3 flag affects the host side a lot, but does not seem to impact code speed on CUDA at all.

So the questions are:

Do floating point terms get rearranged in CUDA unlike C?
Does pulling the constant out of the kernel do anything? Forget this instance, where the loop takes most of the time.
Is there any further compiler options that will generate better code?
We were just using nvcc -O3 mandel.cu

Last, does this code write sequentially to memory, do they try to schedule it that way, or is it random order since the threads do what they want? One idea that came to mind is to have a thread write values to 8 registers and then write them all at once.

Solution

the condition is never executed.

This is dependent of the kernel configuration (i.e. number of thread and block for each dimension). Generally, people use a number of thread that is divisible by the size of a warp which is 32 (so far on all devices). Since w and h are divisible by 32, there is no problem. Also note that conditions are generally cheap unless some threads are inactive due to predication. Last but not least, the main computing loop should take most of the time anyway so you should not care about this part.

Note the number of iteration is dependent of the location in Mandelbrot. This has huge implications. For a significant part of the set, z will never converge in a reasonable time (nor diverge). In some regions, this is noisy and warp divergence is a problem. Remapping is the key to increase the number of active threads. You need to profile your code to see which part of your code are efficient or not (Nsight is really great for that).

Do floating point terms get rearranged in CUDA unlike C?

No, unless you manually request it. The same thing it true in C (at least on mainstream compilers). -ffast-math on Clang and GCC enable this as well as many other optimisations assuming no NaN or Inf (which can be potentially dangerous regarding your target application). IEEE-754 compliance is AFAIK guaranteed on NVCC by default.

Does pulling the constant out of the kernel do anything? Forget this instance, where the loop takes most of the time.

I am not sure to understand but using constants for w and h can significantly improve performance of the division for example. Using constants is not always faster though: it depends of the actual computation done. The benefit for multiplications and addition is small if any (except is some very specific cases). For the number of iteration, it help the compiler to fully unroll (when the number is reasonably small).

Is there any further compiler options that will generate better code? We were just using nvcc -O3 mandel.cu

Fast-math can help. That being said, you can optimize the code yourself so it is not even needed. The idea is to use as much as possible FMA operations instead of just multiplication or addition alone.

does this code write sequentially to memory

Threads are executed by warp (32 threads) and they do SIMD packed loads/stores (i.e. contiguous) when memory accesses are coalesced. This is critical for performance. Warp scheduling is hardware-dependent. The size of the blocks can sometime impact performance.

What matter for performance is to do coalesced loads/stores, if possible with all threads being active (otherwise the bandwidth is simply wasted). Caching can also help but, here, the output size is so big that you cannot benefit from that.

Mandelbrot is generally compute-bound so you should care about minimizing the number of operations in the main loop (memory-related optimizations are not very useful except when the number of iteration is small which is rare in average). Loop unrolling can help to achieve that. Merging MULs+ADDs to FMAs operations too. Moreover, warp divergence is a critical issue on Mandelbrot you should reduce as much as possible to ensure all SIMD units are efficiently used. As stated before, remapping is an efficient solution for that. You can detect some patterns to avoid a lot of iterations too (rather complicated in practice). On top of that, like for all CUDA kernel, you should check the occupancy to maximize the performance of the kernel. Moreover, as pointed out by @paleonix in comments, instruction pipelining (sometime considered as instruction-level parallelism) can also be a way to improve performance on GPUs (even with a relatively bad occupancy) though it is certainly hard to benefit from this on a Mandelbrot kernel.