python-2.7cudanumbapycudanumba-pro

Inconsistent results in cuda GPU accelerated code


I was trying to compute Local Binary Patterns for a image on my GPU, utilising cuda module in python for the same. But the results produced by execution of similar algorithm on CPU and GPU is producing different results. Can you help me figure out the problem ?

Below is the snippet of code I was trying to execute :

from __future__ import division
from skimage.io import imread, imshow
from numba import cuda
import time
import math
import numpy

# CUDA Kernel
@cuda.jit
def pointKernelLBP(imgGPU, histVec, pos) :
    ''' Computes Point Local Binary Pattern '''
    row, col = cuda.grid(2)
    if row+1 < imgGPU.shape[0] and col+1 < imgGPU.shape[1] and col-1>=0 and row-1>=0 :
        curPos = 0
        mask = 0
        for i in xrange(-1, 2) :
            for j in xrange(-1, 2) :
                if i==0 and j==0 :
                    continue
                if imgGPU[row+i][col+j] > imgGPU[row][col] :
                    mask |= (1<<curPos)     
                curPos+=1
        histVec[mask]+=1


#Host Code for computing LBP 
def pointLBP(x, y, img) :
    ''' Computes Local Binary Pattern around a point (x,y),
    considering 8 nearest neighbours '''
    pos = [0, 1, 2, 7, 3, 6, 5, 4]  
    curPos = 0
    mask = 0
    for i in xrange(-1, 2) :
        for j in xrange(-1, 2) :
            if i==0 and j==0 :
                continue
            if img[x+i][y+j] > img[x][y] :
                mask |= (1<<curPos)         
            curPos+=1
    return mask                 

def LBPHistogram(img, n, m) :
    ''' Computes LBP Histogram for given image '''
    HistVec = [0] * 256 
    for i in xrange(1, n-1) :
        for j in xrange(1, m-1) :
            HistVec[ pointLBP(i, j, img) ]+=1
    return HistVec

if __name__ == '__main__' :

    # Reading Image
    img = imread('cat.jpg', as_grey=True)
    n, m = img.shape

    start = time.time() 
    imgHist = LBPHistogram(img, n, m)
    print "Computation time incurred on CPU : %s seconds.\n" % (time.time() - start)    

    print "LBP Hisogram Vector Using CPU :\n"
    print imgHist

    print type(img)

    pos = numpy.ndarray( [0, 1, 2, 7, 3, 6, 5, 4] )

    img_global_mem = cuda.to_device(img)
    imgHist_global_mem = cuda.to_device(numpy.full(256, 0, numpy.uint8))
    pos_global_mem = cuda.to_device(pos)

    threadsperblock = (32, 32)
    blockspergrid_x = int(math.ceil(img.shape[0] / threadsperblock[0]))
    blockspergrid_y = int(math.ceil(img.shape[1] / threadsperblock[1]))
    blockspergrid = (blockspergrid_x, blockspergrid_y)

    start = time.time() 
    pointKernelLBP[blockspergrid, threadsperblock](img_global_mem, imgHist_global_mem, pos_global_mem)
    print "Computation time incurred on GPU : %s seconds.\n" % (time.time() - start)

    imgHist = imgHist_global_mem.copy_to_host()

    print "LBP Histogram as computed on GPU's : \n"
    print imgHist, len(imgHist)

Solution

  • Now that you have fixed the glaringly obvious mistake in the original kernel code you posted, there are two problems which are preventing this code from working correctly.

    The first, and most serious, is a memory race in the kernel. This update of the histogram bins:

    histVec[mask]+=1
    

    is not safe. Multiple threads in multiple blocks will attempt to read and write the same bin counters in global memory simultaneously. CUDA offers no guarantees of either correctness or repeatability in such circumstances.

    The simplest (but not necessarily the most performant, depending on your hardware) solution to this is to use atomic memory transactions. These do guarantee that the increment operation will be serialized, but of course that serialization implies some performance penalty. You can do this by changing the update code to something like:

    cuda.atomic.add(histVec,mask,1)
    

    Note that CUDA only supports 32 and 64 bit atomic memory transactions, so you will need to ensure the type of histVec is a compatible 32 or 64 bit integer type.

    This leads to the second problem, which is that you have defined the bin counter vector to be numpy.uint8. That means that even if you didn't have a memory race, you would only have 8 bits to store the counts, and they would quickly overflow for images of any meaningful size. So for both compatibility with atomic memory transactions and to prevent counter rollover, you will need to change the type of your counters.

    When I changed these things in the code you posted (and fixed the early missing code problem), I could get exact agreement between GPU and host code calculated histograms for a random 8 bit input array.

    The underlying parallel histogram problem is very well described for CUDA and there are a lot of examples and codebases you can study when you get to the point of worrying about performance, for example here.