In my program I use shared memory to do prefetching of data. A 2D block of threads, dimentions 8 by 4 (32), gets 8 * 4 * 8 * sizeof(float4)
bytes of shared memory. Each thread copies 8 float4
s in a loop:
inline __device__ void pack(const float4 *g_src, float4 *s_dst, const unsigned int w, const unsigned int d) {
uint2 indx = { blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y };
uint2 sindx = { threadIdx.x, threadIdx.y };
int i;
for (i = 0; i < d; ++i) s_dst[(sindx.y * blockDim.x + sindx.x) * d + i] = g_src[(w * indx.y + indx.x) * d + i];
}
where w
is set to width of the global memory buffer (in number of float4
s) and d
is set to 8 (number of float4
s copied).
Can such configuration and further usage of the memory, lead to bank conflicts, or will broadcasting be applied? Will this be a case also when threads copy only, say 5 float4
s, not 8?
MK
P.S. Same topic in Nvidia forum
During prefetching phase bank conflicts will occur. E.g. threads within first warp with IDs (computed as threadIdx.x + threadIdx.y * blockDim.x
) 0, 4, 8, ... 28 access same bank. You can see it as thread (0,0) and thread (4,0) for i
equals 0 access s_dst[0]
and s_dst[32]
belonging to the same bank.
If bank conflicts occur during further usage depends on the scheme you will access s_dst
.
Broadcast mechanism is applied only when threads simultaneously read the same address.
How many bank conflicts occur depends on the value of d
. If d mod 32 == 1
there won't be any conflicts.
EDIT:
IMHO the best way to avoid bank conflicts in prefetching phase, specially if d
is changing, is to equaly split the work among the warps. Lets say you need to prefetch n
values to shared memory, w_id
is ID of warp and l_id
is ID of thread within warp (from 0 to 31). Than prefetching should look like this:
for(int i = l_id + w_id*WARP_SIZE; i < n; i += WARP_SIZE*COUNT_OF_WARPS_IN_BLOCK)
{
s_dst[i] = ...;
}
But this helps only to avoid bank conflicts during prefetching. As I have already said to avoid conflicts during further usage depends on the scheme you will access s_dst
.