I learned CUDA function cudaMallocPitch creates padded memory that helps avoid bank conflict from this nice SO answer.
I can understand well how does the padding help alignment, as it very much resembles its counterpart in CPU.
However, I am not sure how does it optimize for minimizing bank-conflict, does it pad to ensure, say, the start of first column of each row different bank? like below?
Before padding:
Bank 0 | Bank 1 | Bank 2 | Bank 3
-------|--------|--------|-------
A[0][0]| A[0][1]| A[0][2]| A[0][3]
A[1][0]| A[1][1]| A[1][2]| A[1][3]
A[2][0]| A[2][1]| A[2][2]| A[2][3]
...
After padding:
Bank 0 | Bank 1 | Bank 2 | Bank 3
-------|--------|--------|-------
A[0][0]| A[0][1]| A[0][2]| A[0][3]
Padding| A[1][0]| A[1][1]| A[1][2]
A[1][3]| padding| padding| A[2][0]
...
However, I am not sure how does it optimize for minimizing bank-conflict, …
The extremely concise answer is that doesn’t.
Bank conflicts are about warp level access conflicts to shared memory, they have nothing to do with global memory which is what cudaMallocPitch
operates on. As you note, pitched global allocation is about getting nice alignment in DRAM for the cache lines in the memory controller and texture units. Completely orthogonal concept to bank conflicts.