Understanting thread utilization in the CUDA reduction examples

I'm currently learning about GPU architecture and CUDA, and I'm looking at the reduction examples in the CUDA-samples github here, while also following along this this presentation here

Now I have a question about the changes between version 0 and version 1.

There are two changes:

The most obvious, the removal of the modulo operation, which is slow (I'll take it for granted because it says so, altough I'm assuming it's probably because division is an historically slow instruction)
Instead of using whatever thread matches the current stride (0,1,2,4,8...), it always uses contiguous threads (0,1,2,3,4,5,6,7 for the first iteration of the loop, 0,1,2,3 for the second, 0,1 for the third...)

I'm not understanding what advantages change number 2 brings in isolation. (I wonder if most of the speedup is really given by the removal of the modulo operation?)

In the context of version 2 of the kernel, I can see it. Using contiguous threads accessing contiguous memory takes advantage of memory coalescense and removes bank conflicts in the shared memory, leading to higher bandwidth.

But by itself, does using contiguous threads bring an advantage? From what I studied thus far, it kinda feels like it shouldn't. You still have divergence in the warp, which means you're using fewer cores than available, and using two cores that are adjacent instead of cores that are far doesn't really feel like it should matter, considering that the memory they're accessing is sparse and the indeces far from each other (which is what version 2 fixes). The warp scheduler also doesn't pack active threads together, so you're also not getting a full warp to run either way. Basically using contiguous threads with memory that's not contiguous doesn't seem that useful to me. Am I wrong? Am I missing something from the examples?

My other idea is that maybe this was more efficient on older nVidia architectures, which had fewer core per SM, for example my textbook describes an architecture with 8 cuda cores per SM, which would require 4 context switches for a warp, which even if fast on the GPU, are not necessarily free, compared to recent architectures with 32 cores per SM which would not necessitate intra-warp context switches. But I feel like I'm overthinking it with this last paragraph.

Also, as an aside, the initial comment for version 2 of the kernel says:

This version uses sequential addressing -- no divergence or bank conflicts.

Isn't the warp still divergent, considering that many of the threads won't do anything eventually? In the first loop you only have half the threads active, and it only gets worse. Just want to be sure I have my terminology right

Solution

In the absence of further comments, let's summarize and expand them.

Modulo

The most obvious, the removal of the modulo operation, which is slow (I'll take it for granted because it says so, altough I'm assuming it's probably because division is an historically slow instruction)

Well, yes and no. Yes, integer division (and by extension, modulo) is generally slow. However, in this particular case it is possible to write the code so that the compiler may be able to deduce that the divisors are powers of 2 and the dividend is an unsigned integer. In that case the modulo can be reduced to a bitmask. If you don't trust the compiler to do this, it is easy enough to do the transformation manually.

This from the reduce0 sample

if ((tid % (2 * s)) == 0)

becomes

if ((tid & (2 * s - 1)) == 0)

Using contiguous threads

does using contiguous threads bring an advantage?

First a bit of nomenclature. For the purpose of this answer, I define a thread or warp as productive if it does actual work towards the result and a thread as unproductive if it is predicated or just waiting; either by waiting for reconvergence after a diverging branch or by waiting on a __syncthreads() without doing any work before that in the current iteration. This notation is non-standard and I just use it here to avoid any confusion with active and inactive threads, which usually just refers to divergence or threads that exit early from the call.

In the reduce0 sample, at least one thread from every warp stays productive for the first 5 iterations. And even afterwards, the number of productive warps remains larger than in the reduce1 sample except for the last iteration. Does this matter?

Well, this depends on details of the GPU architecture and is a) subject to change, b) not always explicitly documented. We can infer some information, however. For example the Volta Whitepaper shows 4 warp scheduler (32 threads per clock cycle, each) per multiprocessor. The older Fermi Whitepaper makes it more explicit:

Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs.

It is therefore reasonable to assume that a single active thread per warp or all threads being active does not change the amount of time an instruction takes (not counting bank conflicts or additional global memory transactions). So if we can arrange the code so that entire warps skip over a piece of code, this will reduce the number of instructions actually executed before all threads reach the synchronization point.

In particular, we (sort of) know that only one warp per multiprocessor can access the shared memory in each clock cycle. I don't have an authorative source for this but it can be inferred from the maximum theoretical bandwidth and that there are no bank conflicts between warps, only within a single warp. So even if we had enough warp schedulers, a shared-memory heavy workload like reduce0 would have many warps simply waiting for their turn accessing the shared memory just to then spend that cycle doing one or two read/write operations while reduce1 consistently does 32 per cycle. Of course reduce1 suffers from bank conflicts but that is addressed by the later code samples.