I was reading a slide-deck from Nvidia (Optimizing Parallel Reduction in CUDA) discussing a parallel reduction algorithm.
Here the writer discusses that by using sequential addressing, we avoid shared memory bank conflicts. I think this is wrong and here is my reason:
On slide 10 we assume that each thread block has 128 threads. In the sequential code provided in the blue box on the slide, thread 0 tries to use both sdata[0]
and sdata[64]
at the same and since we have 32 memory banks, these 2 accesses lead to a memory bank conflict.
Which part of my inference is wrong?
This isn't about a bank conflicts between the read side sdata[tid + s]
and the write side sdata[tid]
. It is about the bank conflict between neighboring threads within the same warp for each operation (read or write) on its own. A warp can't do a read and a write operation in parallel in addition to the operations not being independent in this case. Operations that do not happen in parallel/lockstep can't cause bank conflicts.
Remember that threads within a warp (half-warp in very old GPUs) do their memory access in lockstep. In the first code, that index is strided by 2 * s
. So for example for 2 * s == 32
, each thread accesses the same bank, assuming 32 memory banks.
In the second code, all threads access array elements in sequential order without a stride and the index only differs by thread ID. So they always hit different memory banks.
A quick warning: Note this from Nvidia's Volta Tuning Guide
Applications that assume reads and writes are implicitly visible to other threads in the same warp need to insert the new
__syncwarp()
warp-wide barrier synchronization instruction between steps where data is exchanged between threads via global or shared memory. Assumptions that code is executed in lockstep or that reads/writes from separate threads are visible across a warp without synchronization are invalid.
This affects the code in slide 22 (the unrolling of the last warp). Also, I'm pretty sure that part is better solved with Warp Shuffle Functions rather than using shared memory.
The corresponding CUDA sample on GitHub is updated appropriately. There is probably an updated version of the webinar somewhere, too.