Will this lead to inconsistencies in shared memory?
My kernel code looks like this (pseudocode):
__shared__ uint histogram[32][64];
uint threadLane = threadIdx.x % 32;
for (data){
histogram[threadLane][data]++;
}
Will this lead to collisions, given that, in a block with 64 threads, threads with id x
and (x + 32)
will very often write into the same position in the matrix?
This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.
It is dangerous. It leads to race conditions.
If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.
Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf
Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust
for instance .