c++cudagpuhistogramgpu-shared-memory

CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?


Will this lead to inconsistencies in shared memory?

My kernel code looks like this (pseudocode):

__shared__ uint histogram[32][64];

uint threadLane = threadIdx.x % 32;

for (data){
     histogram[threadLane][data]++;
}

Will this lead to collisions, given that, in a block with 64 threads, threads with id x and (x + 32) will very often write into the same position in the matrix?

This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.


Solution

  • It is dangerous. It leads to race conditions.

    If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.

    Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf

    Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust for instance .