cudagpuhistogram

How to implement a CUDA histogram kernel?


I want write a CUDA kernel that computes histograms.

Simple example

Suppose I have the following array:

| 1 | 10 | 30 | 39 | 32 | 2 | 4 | 5 | 1 |

with no value exceeding maxValue (40 in my example). I want to create a histogram, say using the following 4 buckets:

0 - 9 (1st bucket)
10 - 19 (2nd bucket)
20 - 29 (3rd bucket)
30 - 39 (4th bucket)

I first thought of creating partial histogram in each block using shared memory (temp array).

__global__ void histo_kernel_optimized5(unsigned char *buffer, long size, unsigned int *histo) {
    extern __shared__ unsigned int temp[];
    
    temp[threadIdx.x] = 0;
    __syncthreads();
    
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int offset = blockDim.x * gridDim.x;
    int bucketID;
    while (i < size) {
        bucketID = array[i] / Bwidth;
        atomicAdd(&temp[bucketID], 1);
        i += offset;
    }
    __syncthreads();
    
    atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}

This is how I invoke my kernel:

histo_kernel_optimized<<<array_size / buckets, buckets, buckets * sizeof(unsigned int)>>>(buffer, SIZE, histogram);

But compilation fails with:

Instruction '{atom,red}.shared' requires .target sm_12 or higher

Note: My GPU has Compute Capability 1.1.


I also tried having each thread create its own temp array:

__global__ void histo_kernel_optimized5(unsigned char *buffer, long size, unsigned int *histo) {
    unsigned int temp[buckets];
    int j;
    for (j = 0; j < buckets; j++) {
        temp[j] = 0;
    }
    
    int bucketID;
    
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int offset = blockDim.x * gridDim.x;
    while (i < size) {
        bucketID = array[i] / Bwidth;
        temp[bucketID]++;
        i += offset;
    }
    
    for (j = 0; j < buckets; j++) {
        histo[j] += temp[j];    
    }
}

This does not compile either. Apparently temp has to be declared with a constant size. But I want to support setting the number of buckets dynamically (the user should be able to set them via the command line upon invoking my program).


What am I doing wrong? How to implement this correctly?


Solution

  • When using atomics, launching fewer blocks will reduce contention (and hence improve performance) because it will not have to coordinate between fewer blocks. Launch fewer blocks and have each block loop over more of the input elements.

    for (unsigned tid = blockIdx.x*blockDim.x+threadIdx.x; 
                  tid < size; tid += gridDim.x*blockDim.x) {
        unsigned char value = array[tid]; // borrowing notation from another answer here
        int bin = value % buckets;
        atomicAdd(&histo[bin],1);
    }