[SOLVED] Bank Conflicts From Non-Sequential Access in Shared Memory on CUDA

Bank Conflicts From Non-Sequential Access in Shared Memory on CUDA

I'm in the process of writing some N-body simulation code with short-ranged interactions in CUDA targeted toward Volta and Turing series cards. I plan on using shared memory, but it's not quite clear to me how to avoid bank conflicts when doing so. Since my interactions are local, I was planning on sorting my particle data into local groups that I can send to each SM's shared memory (not yet worrying about particles that have a neighbor who is being worked on from another SM. In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?

All of the information I see seems to only mention that memory be coalesced for the copy from global memory to shared memory, but don't I see anything about whether threads in a warp (or the whole SM) care about coalesence in shared memory.

Solution

In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?

bank conflicts are only possible between threads in a single warp that are performing a shared memory access, and then only possible on a per-instruction (issued) basis. The instructions I am talking about here are SASS (GPU assembly code) instructions, but nevertheless should be directly identifiable from shared memory references in CUDA C++ source code.

There is no such idea as bank conflicts:

between threads in different warps
between shared memory accesses arising from different (issued) instructions

A given thread may access shared memory in any pattern, with no concern or possibility of shared memory back conflicts, due to its own activity. Bank conflicts only arise as a result of 2 or more threads in a single warp, as a result of a particular shared memory instruction or access, issued warp-wide.

Furthermore it is not sufficient that each thread reads/writes from/to a different address. For a given issued instruction (i.e. a given access) roughly speaking, each thread in the warp must read from a different bank, or else it must read from an address that is the same as another address in the warp (broadcast).

Let's assume that we are referring to 32-bit banks, and an arrangement of 32 banks. Shared memory can readily be imagined as a 2D arrangement:

Addr                Bank
 v     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

 0     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32    32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64    64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96    96 97 98 ...

We see that addresses/index/offset/locations 0, 32, 64, 96 etc. are in the same bank. Addresses 1, 33, 65, 97, etc. are in the same bank, and so on, for each of the 32 banks. Banks are like columns of locations when the addresses of shared memory are visualized in this 2D arrangement

The requirement for non-bank-conflicted access for a given instruction (load or store) issued to a warp is:

no 2 threads in the warp may access locations in the same bank/column.
a special case exists if the locations in the same column are actually the same location. This invokes the broadcast rule and does not lead to bank conflicts.

And to repeat some statements above in a slightly different way:

If I have a loop in CUDA code, there is no possibility for bank conflicts to arise between separate iterations of that loop
If I have two separate lines of CUDA C++ code, there is no possibility for bank conflicts to arise between those two separate lines of CUDA C++ code.